Compare commits

...

823 Commits

Author SHA1 Message Date
Tomasz Grabiec
5e3a52024e sstable/compaction: Use correct schema in the writing consumer
Introduced in 2a437ab427.

regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.

One effect of this is storing values of some columns under a different
column.

This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.

Fixes #4304.

Tests:

  - manual tests with hard-coded schema change injection to reproduce the bug
  - build/dev/scylla boot
  - tests/sstable_mutation_test

Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 58e7ad20eb)
2019-03-04 18:16:43 +02:00
Avi Kivity
3869b5ab51 Merge "Fix commitlog chunks overwriting each other" from Paweł
"
This series fixes a problem in the commitlog cycle() function that
confused in-memory and on-disk size of chunks it wrote to disk. The
former was used to decide how much data needs to be actually written,
and the latter was used to compute the offset of the next chunk. If two
chunk writes happened concurrently one the one positioned earlier in
the file could corrupt the header of the next one.

Fixes #4231.

Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table)
"

* tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla:
  commitlog: write the correct buffer size
  utils/fragmented_temporary_buffer_view: add remove suffix

(cherry picked from commit d95dec22d9)
2019-03-04 17:58:46 +02:00
Jenkins
3cca6f5384 release: prepare for 3.0.4 by hagitsegev 2019-03-04 10:14:33 +02:00
Nadav Har'El
15188b5ea5 Materialized views: fix accidental zeroing of flow-control delay
The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.

Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.

Fixes #4143.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
(cherry picked from commit da54d0fc7d)
2019-02-27 22:17:44 +02:00
Avi Kivity
96d9ebb67e Update seastar submodule
* seastar b79be02...4a06029 (1):
  > net: fix tcp load balancer accounting leak while moving socket to other shard

Fixes #4269.
2019-02-25 23:22:09 +02:00
Avi Kivity
f18b370198 Merge "sstables: mc: writer: Avoid large allocations for keeping promoted index entries" from Tomasz
"
Currently we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.

A similar problem exists for the offset vector built
in write_promoted_index().

This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.

The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.

There still remains a problem that large-enough promoted index can cause OOM.

Refs #4217

Tests:
  - unit (release)
  - scylla-bench write

Branches: 3.0
"

* tag 'fix-large-alloc-for-promoted-index-v3' of github.com:tgrabiec/scylla:
  sstables: mc: writer: Avoid large allocations for maintaining promoted index
  sstables: mc: writer: Avoid double-serialization of the promoted index

(cherry picked from commit fdefee696e)
2019-02-24 15:45:32 +02:00
Avi Kivity
8e657e5685 Merge " Fix INSERT JSON with null values" from Piotr
"
Fixes #4256

This miniseries fixes a problem with inserting NULL values through
INSERT JSON interface.

Tests: unit (dev)
"

* 'fix_insert_json_with_null' of https://github.com/psarna/scylla:
  tests: add test for INSERT JSON with null values
  cql3: add missing value erasing to json parser

(cherry picked from commit 5520fc37ba)
2019-02-22 15:52:46 +02:00
Avi Kivity
4fde670abf Merge "Add DEFAULT UNSET support to JSON" from Piotr
"
This series adds DEFAULT UNSET and DEFAULT NULL keyword support
to INSERT JSON statement, as stated in #3909.

Tests: unit (release)
"

* 'add_json_default_unset_2' of https://github.com/psarna/scylla:
  tests: add DEFAULT UNSET case to JSON cql tests
  tests: split JSON part of cql query test
  cql3: add DEFAULT UNSET to INSERT JSON

(cherry picked from commit 447f953a2c)
2019-02-22 15:52:16 +02:00
Avi Kivity
923318e636 sstables: checksummed_file_writer: fix dma alignment
checksummed_file_writer does not override allocate_buffer(), so it inherits
data_source_impl's default allocate_buffer, which does not care about alignment.
The buffer is then passed to the real file_data_sink_impl, and thence to the file
itself, which cannot complete the write since it is not properly aligned.

This doesn't fail in release mode, since the Seastar allocator will supply a
properly aligned buffer even if not asked to do so. The ASAN allocator usually
does supply an aligned buffer, but not always, which causes the test to fail.

Fix by forwarding the allocate_buffer() function to the underlying data_source.

Fixes #4262.
Branches: branch-3.0
Message-Id: <20190221184115.6695-1-avi@scylladb.com>

(cherry picked from commit 34b254381f)
2019-02-22 09:28:55 +02:00
Amnon Heiman
35cc09b150 scylla-housekeeping: Read JSON as UTF-8 string for older Python 3 compatibility
Python 3.6 is the first version to accept bytes to the json.loads(),
which causes the following error on older Python 3 versions:

  Traceback (most recent call last):
    File "/usr/lib/scylla/scylla-housekeeping", line 175, in <module>
      args.func(args)
    File "/usr/lib/scylla/scylla-housekeeping", line 121, in check_version
      raise e
    File "/usr/lib/scylla/scylla-housekeeping", line 116, in check_version
      versions = get_json_from_url(version_url + params)
    File "/usr/lib/scylla/scylla-housekeeping", line 55, in get_json_from_url
      return json.loads(data)
    File "/usr/lib64/python3.4/json/__init__.py", line 312, in loads
      s.__class__.__name__))
  TypeError: the JSON object must be str, not 'bytes'

To support those older Python versions, convert the bytes read to utf8
strings before calling the json.loads().

Fixes #4239
Branches: master, 3.0

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190218112312.24455-1-amnon@scylladb.com>
(cherry picked from commit 750b76b1de)
2019-02-20 11:03:12 +02:00
Asias He
22f41f04ba range_streamer: Futurize add_ranges
It might take long time for get_all_ranges_with_sources_for and
get_all_ranges_with_strict_sources_for to calculate which cause reactor
stall. To fix, run them in a thread and yield. Those functions are used in
the slow path, it is ok to yield more than needed.

Fixes #3639

Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>
(cherry picked from commit 8edf3defdf)
2019-02-20 11:03:12 +02:00
Avi Kivity
fae11c0d6b fragmented_temporary_buffer: fix read_exactly() during premature end-of-stream
read_exactly(), when given a stream that does not contain the amount of data
requested, will loop endlessly, allocating more and more memory as it does, until
it fails with an exception (at which point it will release the memory).

Fix by returning an empty result, like input_stream::read_exactly() (which it
replaces). Add a test case that fails without a fix.

Affected callers are the native transport, commitlog replay, and internal
deserialization.

Fixes #4233.

Branches: master, branch-3.0
Tests: unit(dev)
Message-Id: <20190216150825.14841-1-avi@scylladb.com>
(cherry picked from commit 03531c2443)
2019-02-20 11:03:11 +02:00
Nadav Har'El
82016c07f2 Materialized views: limit size of row batching during bulk view building
The bulk materialized-view building processes (when adding a materialized
view to a table with existing data) currently reads the base table in
batches of 128 (view_builder::batch_size) rows. This is clearly better
than reading entire partitions (which may be huge), but still, 128 rows
may grow pretty large when we have rows with large strings or blobs,
and there is no real reason to buffer 128 rows when they are large.

Instead, when the rows we read so far exceed some size threshold (in this
patch, 1MB), we can operate on them immediately instead of waiting for
128.

As a side-effect, this patch also solves another bug: At worst case, all
the base rows of one batch may be written into one output view partition,
in one mutation. But there is a hard limit on the size of one mutation
(commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the
batch size to exceed this limit. By not batching further after 1MB,
we avoid reaching this limit when individual rows do not reach it but
128 of them did.

Fixes #4213.

This patch also includes a unit test reproducing #4213, and demonstrating
that it is now solved.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190214093424.7172-1-nyh@scylladb.com>
(cherry picked from commit fec562ec8f)
2019-02-16 21:54:41 +02:00
Botond Dénes
282ccbb072 service/storage_service: fix pre-bootstrap wait for schema agreement
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.

To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.

To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.

Fixes: #4196

Branches: 3.0, 2.3

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
(cherry picked from commit 2125e99531)
2019-02-16 19:04:08 +02:00
Avi Kivity
873e0f0e14 auth: password_authenticator: protect against NULL salted_hash
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().

Protect against that by using get_opt() and failing authentication if we see a NULL.

Fixes #4168.

Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
(cherry picked from commit da9628c6dc)
2019-02-11 23:55:53 +02:00
Jenkins
c36c58c64e release: prepare for 3.0.3 by hagitsegev 2019-02-11 18:58:02 +02:00
Duarte Nunes
0685c8f5bc Merge 'Fix misdetection of remote counter shards' from Paweł
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.

Fixes #4206

Tests: unit(release)
"

* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
  tests/sstables: test counter cell header with large number of shards
  sstables/counters: fix remote counter shard detection

(cherry picked from commit d2d885fb93)
2019-02-11 14:09:55 +02:00
Calle Wilund
92cf2934c6 tls: Use a default prio string disabling TLS1.0 forcing min 128bits
Fixes #4010

Unless user sets this explicitly, we should try explicitly avoid
deprecated protocol versions. While gnutls should do this for
connections initiated thusly, clients such as drivers etc might
use obsolete versions.

Message-Id: <20190107131513.30197-1-calle@scylladb.com>
(cherry picked from commit ba6a8ef35b)
2019-02-05 19:45:13 +02:00
Calle Wilund
ed2fb65732 commitlog_replayer: Bugfix: finding truncation positions uses local var ref
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.

Fixes #4187.

Message-Id: <20190204141831.3387-1-calle@scylladb.com>
(cherry picked from commit 9cadbaa96f)
2019-02-04 20:25:17 +02:00
Gleb Natapov
ce2957d106 messaging_service: do not forget to close stream when sending it to another side failed
Fixes #4124

Message-Id: <20190131091857.GC3172@scylladb.com>
(cherry picked from commit a70374d982)
2019-02-03 13:00:20 +02:00
Avi Kivity
b31d94e317 Update seastar submodule
* seastar 5226277...b79be02 (1):
  > rpc: support closing streaming when only sink or source was created

Ref #4124.
2019-02-03 12:59:43 +02:00
Asias He
da80f27f44 migration_manager: Fix nullptr dereference in maybe_schedule_schema_pull
Commit 976324bbb8 changed to use
get_application_state_ptr to get a pointer of the application_state. It
may return nullptr that is dereferenced unconditionally.

In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:

   4 nodes in the tests

   n1, n2, n3, n4 are started

   n1 is stopped

   n1 is changed to use different shard config

   n1 is restarted ( 2019-01-27 04:56:00,377 )

The backtrace happened on n2 right fater n1 restarts:

   0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
   1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
   2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
   3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
   4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
   5 Segmentation fault on shard 0.
   6 Backtrace:
   7 0x00000000041c0782
   8 0x00000000040d9a8c
   9 0x00000000040d9d35
   10 0x00000000040d9d83
   11 /lib64/libpthread.so.0+0x00000000000121af
   12 0x0000000001a8ac0e
   13 0x00000000040ba39e
   14 0x00000000040ba561
   15 0x000000000418c247
   16 0x0000000004265437
   17 0x000000000054766e
   18 /lib64/libc.so.6+0x0000000000020f29
   19 0x00000000005b17d9

We do not know when this backtrace happened, but according to log from n3 an n4:

   INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
   INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL

We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the
delay the gossip notice a peer is down), so the abort time is around 04:56:0X.
The migration_manager::maybe_schedule_schema_pull that triggers the backtrace
must be scheduled before n1 is restarted, because it dereference
application_state pointer after it sleeps 60 seconds, so the time
maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is
restarted.

So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.

Fixes: #4148
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>
(cherry picked from commit 28d6d117d2)
2019-02-01 13:00:38 +02:00
Jenkins
5174b1cd13 release: prepare for 3.0.2 by slivne 2019-01-30 16:14:34 +02:00
Nadav Har'El
9ba608cae4 cql3: really ensure retrieval of columns for filtering
Commit fd422c954e aimed to fix
issue #3803. In that issue, if a query SELECTed only certain columns but
did filtering (ALLOW FILTERING) over other unselected columns, the filtering
didn't work. The fix involved adding the columns being filtered to the set
of columns we read from disk, so they can be filtered.

But that commit included an optimization: If you have clustering keys
c1 and c2, and the query asks for a specific partition key and c1 < 3 and
c2 > 3, the "c1 < 3" part does NOT need to be filtered because it is already
done as a slice (a contiguous read from disk). The committed code erroneously
concluded that both c1 and c2 don't need to be filtered, which was wrong
(c2 *does* need to be read and filtered).

In this patch, we fix this optimization. Previously, we used the "prefix
length", which in the above example was 2 (both c1 and c2 were filtered)
but we need a new and more elaborate function,
num_prefix_columns_that_need_not_be_filtered(), to determine we can only
skip filtering of 1 (c1) and cannot skip the second.

Fixes #4121. This patch also adds a unit test to confirm this.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20190123131212.6269-1-nyh@scylladb.com>
(cherry picked from commit 76f1fcc346)
2019-01-23 21:11:05 +02:00
Avi Kivity
f7c5cbc645 build: fix libdeflate object file corruption during parallel build
libdeflate's build places some object files in the source directory, which is
shared between the debug and release build. If the same object file (for the two
modes) is written concurrently, or if one more reads it while the other writes it,
it will be corrupted.

Fix by not building the executables at all. They aren't needed, and we already
placed the libraries' objects in the build directory (which is unshared). We only
need the libraries anyway.

Fixes #4130.
Branches: master, branch-3.0
Message-Id: <20190123145435.19049-1-avi@scylladb.com>

(cherry picked from commit c83ae62aed)
2019-01-23 21:11:05 +02:00
Duarte Nunes
cf4b4d4878 Merge 'hinted handoff: cache cf mappings' from Vlad
"
Cache cf mappings when breaking in the middle of a segment sending so
that the sender has them the next time it wants to send this segment
for where it left off before.

Also add the "discard" metric so that we can track hints that are being
discarded in the send flow.
"

Fixes #4122

* 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla:
  hinted handoff: cache column family mappings for segments that were not sent out in full
  hinted handoff: add a "discarded" metric

(cherry picked from commit 88c7c1e851)
2019-01-23 17:14:29 +02:00
Asias He
45bb1ba1b7 streaming: Futurize estimate_partitions
The loop can take a long time if the number of sstables and/or ranges
are large. To fix, futurize the loop.

Fixes: #4005

Message-Id: <3b05cb84f3f57cc566702142c6365a04b075018e.1545290730.git.asias@scylladb.com>
(cherry picked from commit bcba6b4f4d)
2019-01-22 18:20:21 +02:00
Botond Dénes
28294ed42e auth/service: unregister migration listener on stop()
Otherwise any event that triggers notification to this listener would
trigger a heap-use-after-free.

Refs: #4107

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6bbd609371a2312aed7571b05119d59c7d103d7.1548067626.git.bdenes@scylladb.com>
(cherry picked from commit f229dff210)
2019-01-22 17:54:36 +02:00
Jenkins
3c4f8cf6ed release: prepare for 3.0.1 by hagitsegev 2019-01-20 12:42:00 +02:00
Botond Dénes
7b94264ae5 mutlishard_mutation_query(): use correct reader concurrency semaphore
The multishard mutation query used the semaphore obtained from
`database::user_read_concurrency_sem()` to pause-resume shard readers.
This presented a problem when `multishard_mutation_query()` was reading
from system tables. In this case the readers themselves would obtain
their permits from the system read concurrency semaphore. Since the
pausing of shard readers used the user read semaphore, pausing failed to
fulfill its objective of alleviating pressure on the semaphore the reads
obtained their permits from. In some cases this lead to a deadlock
during system reads.
To ensure the correct semaphore is used for pausing-resuming readers,
obtain the semaphore from the `table` object. To avoid looking up the
table on every pause or resume call, cache the semaphores when readers
are created.

Fixes: #4096

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com>
(cherry picked from commit 4537ec7426)
2019-01-17 18:08:01 +02:00
Duarte Nunes
22a085fbd3 Merge 'Fix filtering with LIMIT and paging' from Piotr
"
Before this series the limit was applied per page instead
of globally, which might have resulted in returning too many
rows.

To fix that:
 1. restrictions filter now has a 'remaining' parameter
    in order to stop accepting rows after enough of them
    have already been accepted
 2. pager passes its row limit to restrictions filter,
    so no more rows than necessary will be served to the client
 3. results no longer need to be trimmed on select_statement
    level

Tests: unit (release)
"

Fixes #4100

* 'fix_filtering_limit_with_paging_3' of https://github.com/psarna/scylla:
  tests: add filtering+limit+paging test case
  tests: allow null paging state in filtering tests
  cql3: fix filtering with LIMIT with regard to paging

(cherry picked from commit 7505815013)
2019-01-17 18:07:41 +02:00
Tomasz Grabiec
2d181da656 row_cache: Fix crash on memtable flush with LCS
Presence checker is constructed and destroyed in the standard
allocator context, but the presence check was invoked in the LSA
context. If the presence checker allocates and caches some managed
objects, there will be alloc-dealloc mismatch.

That is the case with LeveledCompactionStrategy, which uses
incremental_selector.

Fix by invoking the presence check in the standard allocator context.

Fixes #4063.

Message-Id: <1547547700-16599-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 32f711ce56)
2019-01-15 21:16:13 +02:00
Nadav Har'El
d427a23d42 scylla_util.py: make view_hints_directory setting optional
It is optional to set "view_hints_directory", so we shouldn't insist that
it is defined in scylla.yaml on upgrade.

Fixes #4091.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190114125225.10794-1-nyh@scylladb.com>
(cherry picked from commit 9062750089)
2019-01-14 16:59:40 +02:00
Shlomi Livne
37ab553f02 release: prepare for 3.0.0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2019-01-12 22:24:08 +02:00
Raphael S. Carvalho
6a3f4fb3f9 database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Refs #4051.

(master commit 1b7cad3531)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
2019-01-11 13:48:12 +02:00
Avi Kivity
8168d13887 Merge "Fix UDTs representation in serialization header" from Piotr
"
Tests: unit(release)
"

Fixes #4073.

* commit 'FETCH_HEAD~1':
  Add test for serialization header with UDT
  Fix UDT names in serialization header

(cherry picked from commit 4a6aeced59)
2019-01-11 07:48:23 +02:00
Benny Halevy
13bdec6eb4 sstables: mc: sign-extend serialization_header min_local_deletion_time_base and min_ttl_base
Refs #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110141439.1324-1-bhalevy@scylladb.com>
(cherry picked from commit 2dc3776407)
2019-01-11 07:47:45 +02:00
Benny Halevy
57e7081d86 sstables: mc: sign-extend delta local_deletion_time and delta ttl
Follow Cassandra's encoding so that values that are less than the
baseline encoding_stats will wrap-around in 64-bits rather tham 32.

Fixes #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190109192703.18371-1-bhalevy@scylladb.com>
(cherry picked from commit 60323b79d1)
2019-01-09 23:16:00 +02:00
Avi Kivity
2fcae36d96 tests: mutation_source_test: generate valid utf-8 data
test_fast_forwarding_across_partitions_to_empty_range uses an uninitialized
string to populate an sstable, but this can be invalid utf-8 so that sstable
cannot be sstabledumped.

Make it valid by using make_random_string().

Fixes #4040.
Message-Id: <20190107193240.14409-1-avi@scylladb.com>

(cherry picked from commit d8adbeda11)
2019-01-08 14:53:55 +02:00
Avi Kivity
ba62dcd5c7 Update seastar submodule
* seastar 618bc23...5226277 (1):
  > iotune: Initialize io_rates member variables

Fixes #4064.
2019-01-08 11:39:50 +02:00
Nadav Har'El
515399ce17 materialized views: move hints to top-level directory
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:

   WARN: database - Skipping undefined keyspace: view_pending_updates

This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.

So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
(cherry picked from commit da090a5458)
2019-01-07 22:01:56 +02:00
Benny Halevy
772c4b5fdc sstables: mc: expired_liveness_ttl should be max int32_t rather than max uint32_t
Corresponding to Cassandra's EXPIRED_LIVENESS_TTL = Integer.MAX_VALUE;

Fixes #4060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190107172457.20430-1-bhalevy@scylladb.com>
(cherry picked from commit 40410465d7)
2019-01-07 21:59:59 +02:00
Avi Kivity
874d88c98d Update seastar submodule
* seastar 08f1258...618bc23 (1):
  > perftune.py: tune only active NVMe HW queues on i3 AWS instances

Ref #3831.
2019-01-06 12:59:22 +02:00
Avi Kivity
5a178ff635 compaction_controller: increase minimum shares to 50 (~5%) for small-data workloads
The workload in #3844 has these characteristics:
 - very small data set size (a few gigabytes per shard)
 - large working set size (all the data, enough for high cache miss rate)
 - high overwrite rate (so a compaction results in 12X data reduction)

As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.

While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.

We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.

Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).

Fixes #3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>

(cherry picked from commit b0980ba7c6)
2019-01-04 13:28:43 +02:00
Avi Kivity
d67439b910 Revert "release: prepare for 3.0-rc4"
This reverts commit 21a5a4c76a. we were already
at rc4, and the commit only changes the syntax (from the incorrect one to
the correct one).
2019-01-04 12:34:38 +02:00
Hagit Segev
21a5a4c76a release: prepare for 3.0-rc4 2019-01-03 23:53:47 +02:00
Tomasz Grabiec
f818d6ee3f tests: cql_test_env: Start the compaction manager
Broken in fee4d2e

Not doing this results in compaction requests being ignored.

One effect of this is that perf_fast_forward produces many sstables instead of one.

Refs #3984
Refs #3983

Message-Id: <1544719540-10178-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 245a0d953a)
2019-01-03 14:56:42 +01:00
Tomasz Grabiec
20c2745592 Merge "Improve times to start / stop the nodes" from Glauber
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.

Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.

We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.

By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.

There is a similar problem at the drain level, which is also fixed in this
series.

Fixes #3958

* git@github.com:glommer/scylla.git faster-restart
  compaction_manager: delay initialization of the compaction manager.
  drain: stop compactions early

(cherry picked from commit 3e70ae1d06)
2019-01-03 14:56:16 +01:00
Avi Kivity
cf5c72561c release: prepare for scylla-3.0-rc4 2019-01-03 13:15:58 +02:00
Botond Dénes
53b85e5d32 querier_cache: unregister queriers evicted due to expired TTL
Currently queriers evicted due to their TTL expiring are not
unregistered from the `reader_concurrency_semaphore`. This can cause a
use-after-free when the semaphore tries to evict the same querier at
some later point in time, as the querier entry it has a pointer to is
now invalid.

Fix by unregistering the querier from the semaphore before destroying
the entry.

Refs: #4018
Refs: #4031

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4adfd09f5af8a12d73c29d59407a789324cd3d01.1546504034.git.bdenes@scylladb.com>
(cherry picked from commit e5a0ea390a)
2019-01-03 13:14:02 +02:00
Avi Kivity
2456cf63f2 querier_cache: unregister querier from reader_concurrency_semaphore during eviction
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.

Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.

Fixes #4018.

Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
(cherry picked from commit 918d255168)
2019-01-03 13:14:00 +02:00
Pekka Enberg
c1f6ce4251 Merge 'Fixes for the view_update_from_staging_generator' from Duarte
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.

Fixes #4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
  db/view/view_update_from_staging_generator: Break semaphore on stop()
  db/view/view_update_from_staging_generator: Restore formatting
  db/view/view_update_from_staging_generator: Avoid creating more than one fiber

(cherry picked from commit 96172b7bca)
2018-12-29 20:22:54 +02:00
Duarte Nunes
fc82eb5586 streaming/stream_session: Only stage sstables for tables with views
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.

Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>

(cherry picked from commit bab7e6877b)
2018-12-28 20:52:15 +02:00
Avi Kivity
f58e592345 Merge "Fix use-after-free when destroying partition_snapshots in the background"from Tomasz
"
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive, the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumes destruction.

The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.

When memtable is destroyed without being moved to cache there is no
problem because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.

Introduced in f3da043 (in >= 3.0-rc1)

Fixes #4030.

Tests:

  - mvcc_test (debug)

"

* tag 'fix-snapshot-merging-use-after-free-v1.1' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed
  tests: mvcc: Introduce mvcc_container::migrate()
  tests: mvcc: Make mvcc_partition move-constructible
  tests: mvcc: Introduce mvcc_container::make_not_evictable()
  tests: mvcc: Allow constructing mvcc_container without a cache_tracker
  mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
  mvcc: partition_snapshot: Introduce migrate()
  mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner

(cherry picked from commit 8e2f6d0513)
2018-12-28 13:37:29 +02:00
Duarte Nunes
6375b1e5b7 streaming/stream_session: Don't use table reference across defer points
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.

Refs #4021

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
(cherry picked from commit 66e45469b2)
2018-12-28 10:58:26 +02:00
Gleb Natapov
7ca24efb39 streaming: always read from rpc::source until end-of-stream during mutation sending
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.

Fixes: #4025

Message-Id: <20181227135051.GC29458@scylladb.com>
(cherry picked from commit 37b4043677)
2018-12-27 18:59:59 +02:00
Avi Kivity
32ebaaa585 Update libdeflate submodule
* libdeflate 17ec6c9...e7e54ea (1):
  > build: improve out-of-tree build with multiple output trees

(cherry picked from commit d6a22c50cb)
2018-12-25 14:41:24 +02:00
Nadav Har'El
a88c722a4c build_ami.sh: need to check out the right branch of scylla-jmx
This patch is for branch 3.0's build_ami.sh.
It checks out the latest master branch of scylla-jmx, which not only
sounds wrong, it also doesn't work: the latest master of scylla-jmx
can only build a "relocatable package" but branch 3.0 doesn't work with
those.

This patch needs to be applied only in branch 3.0.

It should probably be made more general, though... build_ami.sh should
have been able to figure out what is the *current* branch, and if it is
branch-3.0 or next-3.0, check out branch-3.0 of the other repositories.
But I'm not sure how to do this correctly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181217214610.4498-1-nyh@scylladb.com>
2018-12-25 12:37:11 +02:00
Tomasz Grabiec
07582d6c10 sstables: index_reader: Fix abort when _trust_pi == trust_promoted_index::no
data is not moved-from if _trust_pi == trust_promoted_index::no, which
triggers the assert on data.empty(). We should make it empty
unconditionally.

Message-Id: <1545408731-14333-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 419c771791)
2018-12-24 11:45:14 +02:00
Tomasz Grabiec
18c89edbf7 sstables: mc: reader: Use enum class instead of variant
variant is an overkill here.

Message-Id: <1545409014-16289-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 07d153c769)
2018-12-24 11:45:09 +02:00
Duarte Nunes
5558fa8c44 service/storage_proxy: Protect against empty mutation when storing hint
mutation_holder::get_mutation_for() can return nullptr's, so protect
against those when storing a hint.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181221194853.98775-2-duarte@scylladb.com>
(cherry picked from commit e6a8883228)
2018-12-23 12:27:27 +02:00
Duarte Nunes
f678eb52cd service/storage_proxy: Protect against empty mutation in mutation_holder
The per_destination_mutation holder can contain empty mutations,
so make sure release_mutation() skips over those.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181221194853.98775-1-duarte@scylladb.com>
(cherry picked from commit 6c4a34f378)
2018-12-23 12:27:25 +02:00
Tomasz Grabiec
dfb23f4b38 sstables: mc: index_reader: Handle CK_SIZE split across buffers properly
we incorrectly falled-through to the next state instead of returning
to read more data.

This can manifest in a number of ways, an abort, or incorrect read.

Introduced in 917528c

Fixes #4011.

Message-Id: <1545402032-4114-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d2f96a60f6)
2018-12-21 20:40:35 +02:00
Tomasz Grabiec
502ddf158a sstables: mc: reader: Avoid unnecessary index reads on fast forwarding
When the next pending fragments are after the start of the new range,
we know there is no need to skip.

Caught by perf_fast_forward --datasets large-part-ds3 \
                            --run-tests=large-partition-slicing

Refs #3984
Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 7afe2bad51)
2018-12-21 20:40:35 +02:00
Paweł Dziepak
0ccb0a127a Merge "Optimize slicing sstable readers" from Tomasz
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:

  - Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
  - Avoiding reading of the head of the partition when slicing
  - Avoiding parsing rows which are going to be skipped [MC-format-only]
"

* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
  sstables: mc: reader: Skip ignored rows before parsing them
  sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
  sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
  sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
  sstables: mc: parser: Allow the consumer to skip the whole row
  sstables: continuous_data_consumer: Introduce skip()
  sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
  sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
  sstables: reader: Do not read the head of the partition when index can be used
  sstables: mc: mutation_fragment_filter: Check the fast-forward window first
  sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()

(cherry picked from commit e6d26a528f)
2018-12-21 20:40:35 +02:00
Avi Kivity
b94997be0d Merge " Extract MC sstable writer to a separate compilation unit" from Tomasz
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.

Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.

The ka/la format writers are still left in sstables.cc, they could be also extracted.
"

* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
  sstables: Make variadic write() not picked on substitution error
  sstables: Extract MC format writer to mc/writer.cc
  sstables: Extract maybe_add_summary_entry() out of components_writer
  sstables: Publish functions used by writers in writer.hh
  sstables: Move common write functions to writer.hh
  sstables: Extract sstable_writer_impl to a header
  sstables: Do not include writer.hh from sstables.hh
  sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
  sstables: types: Extract sstable_enabled_features::all()
  sstables: Move components_writer to .cc
  tests: sstable_datafile_test: Avoid dependency on components_writer

(cherry picked from commit b023e8b45d)
2018-12-21 20:40:35 +02:00
Benny Halevy
d3a5b10cb8 sstables_stats: writer_impl: move common members to base class
To be used by sstable_writer for stats collection.

Note that this patch is factored out so it can be verified with no
other change in functionality.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6853c1677d)
2018-12-21 20:40:35 +02:00
Benny Halevy
48f3f899ac sstable: make write_crc, write_digest, and new_sstable_component_file private methods
Prepare for per-sstable sub directory.
Also, these functions get most of their parameters from the sst at hand so they might
as well be first class members.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ad5f1e4fbb)
2018-12-21 20:40:35 +02:00
Paweł Dziepak
c4f745276c Merge "Optimize sstable writing of large partitions" from Tomasz
"
This series contains several optimizations of the MC format sstable writer, mainly:
  - Avoiding output_stream when serializing into memory (e.g. a row)
  - Faster serialization of primitive types when serializing into memory

I measured the improvement in throughput (frag/s) using perf_fast_forward for
datasets with a single large partition with many small rows:

  - 10% for a row with a single cell of 8 bytes
  - 10% for a row with a single cell of 100 bytes
  -  9% for a row with a single cell of 1000 bytes
  - 13% for a row with 6 cells of 100 bytes
"

* tag 'avoid-output-stream-in-sstable-writer-v2' of github.com:tgrabiec/scylla:
  bytes_ostream: Optimize writing of fixed-size types
  sstables: mc: Write temporary data to bytes_ostream rather than file_writer
  sstables: mc: Avoid double-serialization of a range tombstone marker
  sstables: file_writer: Generalize bytes& writer to accept bytes_view
  sstables: Templetize write() functions on the writer
  sstables: Turn m_format_write_helpers.cc into an impl header
  sstables: De-futurize file_writer
  bytes_ostream: Implement clear()
  bytes_ostream: Make initial chunk size configurable

(cherry picked from commit e3f53542c9)
2018-12-21 20:40:35 +02:00
Hagit Segev
392c7dee3c release: prepare for 3.0-rc3 2018-12-21 20:19:50 +02:00
Gleb Natapov
04e982f909 streaming: hold to sink while close() is running and call close on error as well
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.

Fixes #4004.

Message-Id: <20181220155931.GL3075@scylladb.com>
(cherry picked from commit 393269d34b)
2018-12-20 19:50:47 +02:00
Amnon Heiman
97a8cc149e node_exporter_install: switch to node_exporter 0.17
The newer version of node_exporter comes with important bug fixes, that
is especially important for I3.metal is not supported with the older
version of node_exporter.

The dashboards can now support both the new and the old version of
node_exporter.

Fixes #3927

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181210085251.23312-1-amnon@scylladb.com>
(cherry picked from commit 09c2b8b48a)
2018-12-20 19:12:00 +02:00
Avi Kivity
dbe347811c Merge "materialized views: Apply backpressure from view replicas" from Duarte
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.

To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.

More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.

Design document:
    https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo

Fixes #2538
"

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
  service/storage_proxy: Release mutation as early as possible
  service/storage_proxy: Delay replica writes based on view update backlog
  service/storage_proxy: Get the backlog of a particular base replica
  service/storage_proxy: Add counters for delayed base writes
  main: Start and stop the view_update_backlog_broker
  service: Distribute a node's view update backlog
  service: Advertise view update backlog over gossip
  service/storage_proxy: Send view update backlog from replicas
  service/storage_proxy: Prepare to receive replica view update backlog
  service/storage_proxy: Expose local view update backlog
  tests/view_schema_test: Add simple test for db::view::node_update_backlog
  db/view: Introduce node_update_backlog class
  db/hints: Initialize current backlog
  database: Add counter for current view backlog
  database: Expose current memory view update backlog
  idl: Add db::view::update_backlog
  db/view: Add view_update_backlog
  database: Wait on view update semaphore for view building
  service/storage_proxy: Use near-infinite timeouts for view updates
  database: generate_and_propagate_view_updates no longer needs a timeout
  ...

(cherry picked from commit b66f59aa3d)
2018-12-20 19:11:56 +02:00
Avi Kivity
8f2d24bb8f config: remove "to be removed before release" notice mc sstable config
The "enable_sstables_mc_format" config item help text wants to remove itself
before release. Since scylla-3.0 did not get enough mc format mileage, we
decided to leave it in, so the notice should be removed.

Fixes #4003.
Message-Id: <20181219082554.23923-1-avi@scylladb.com>

(cherry picked from commit dd51c659f7)
2018-12-19 19:08:36 +02:00
Nadav Har'El
689e11c892 scylla.spec: add libatomic
In some cases which I've yet to understand, build fails without libatomic.
We need to add it to the mock build machine.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181218154757.25236-1-nyh@scylladb.com>
2018-12-19 12:55:00 +00:00
Nadav Har'El
1766c793a8 build_rpm.sh: put temporary mock in build/, not /var/lib.
build_rpm.sh uses "mock" to build an entire Scylla build environment,
which easily spans more than 15 gigabytes. mock, by defaults, puts this
build directory in a subdirectory of /var/lib/mock. There is no reason
why temporary build products need to be in the root directory: Some machines
(like mine) don't have that much free space in the root directory making it
impossible to use this script on such machines. and it's too easy
to leave temporary files there without noticing.

With this patch, the mock directories are put in build/mock/ instead of
/var/lib/mock.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181217195952.15154-1-nyh@scylladb.com>
2018-12-19 12:54:37 +02:00
Avi Kivity
0b09008cde Merge "Make sstable reader fail on unknown colum names in MC format" from Piotr
"
Before the reader was just ignoring such columns but this creates a risk of data loss.

Refs #2598
"

* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
  sstables: Add test_sstable_reader_on_unknown_column
  sstables: Exception on sstable's column not present in schema
  sstables: store column name in column_translation::column_info
  sstables: Make test_dropped_column_handling test dropped columns

(cherry picked from commit b0cb69ec25)
2018-12-18 16:23:51 +00:00
Piotr Jastrzebski
713e60f690 sstables: Extract mp_row_cosumer_m::check_schema_mismatch
This method will contain common logic used in multiple places
and reduce code duplication.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <bbda2f4ea4f9325055f096dc549f63b1bb03d3b6.1543311990.git.piotr@scylladb.com>
(cherry picked from commit 4366302c4c)
2018-12-18 16:23:47 +00:00
Paweł Dziepak
7b6841f947 Merge "Check for schema mismatch after dropping dead cells" from Piotr
"
Previously we were checking for schema incompatibility between current schema and sstable
serialization header before reading any data. This isn't the best approach because
data in sstable may be already irrelevant due to column drop for example.

This patchset moves the check after actual data is read and verified that it has
a timestamp new enough to classify it as nonobsolete.

Fixes #3924
"

* 'haaawk/3924/v3' of github.com:scylladb/seastar-dev:
  sstables: Enable test_schema_change for MC format
  sstables3: Throw error on schema mismatch only for live cells
  sstables: Pass column_info to consume_*_column
  sstables: Add schema_mismatch to column_info
  sstables: Store column data type in column_info
  sstables: Remove code duplication in column_translation

(cherry picked from commit 62ea153629)
2018-12-18 15:27:53 +00:00
Tomasz Grabiec
f124b7026f Merge 'Add tests for schema changes' from Paweł
This series adds a generic test for schema changes that generates
various schema and data before and after an ALTER TABLE operation. It is
then used to check correctness of mutation::upgrade() and sstable
readers and lead to the discovery of #3924 and #3925.

Fixes #3925.

* https://github.com/pdziepak/scylla.git schema-change-test/v3.1
  schema_builder: make member function names less confusing
  converting_mutation_partition_applier: fix collection type changes
  converting_mutation_partition_applier: do not emit empty collections
  sstable: use format() instead of sprint()
  tests/random-utils: make functions and variables inline
  tests: add models for schemas and data
  tests: generate schema changes
  tests/mutation: add test for schema changes
  tests/sstable: add test for schema changes

(cherry picked from commit 564b328b2e)
2018-12-18 14:57:50 +00:00
Avi Kivity
28cca751d1 Merge "Don't binary compare compressed sstables in test_write_many_partitions_* tests" from Piotr
"
Compression is not deterministic so instead of binary comparing the sstable files we just read data back
and make sure everything that was written down is still present.

Tests: unit(release)
"

* 'haaawk/binary-compare-of-compressed-sstables/v3' of github.com:scylladb/seastar-dev:
  sstables: Remove compressed parameter from get_write_test_path
  sstables: Remove unused sstable test files
  sstables: Ensure compare_sstables isn't used for compressed files
  sstables: Don't binary compare compressed sstables
  sstables: Remove debug printout from test_write_many_partitions

(cherry picked from commit 1ff6b8fb96)
2018-12-18 14:53:52 +00:00
Duarte Nunes
21d08aa41e Merge 'Fix evictable shard reader related issues' from Botond
"
Recently some additional issues were discovered related to recent
changes to the way inactive readers are evicted and making shard readers
evictable.
One such issue is that the `querier_cache` is not prepared for the
querier to be immediately evicted by the reader concurrency semaphore,
when registered with it as an inactive read (#3987).
The other issue is that the multishard mutation query code was not
fully prepared for evicted shard readers being re-created, or failing
why being re-created (#3991).

This series fixes both of these issues and adds a unit test which covers
the second one. I am working on a unit test which would cover the second
issue, but it's proving to be a difficult one and I don't want to delay
the fixes for these issues any longer as they also affect 3.0.

Fixes: #3987
Fixes: #3991
"

* 'evictable-reader-related-issues/branch-3.0/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: reset failed readers to inexistent state
  multishard_mutation_query: handle missing readers when dismantling
  multishard_mutation_query: add support for keeping stats for discarded partitions
  multishard_mutation_query: expect evicted reader state when creating reader
  multishard_mutation_query: pretty-print the reader state in log messages
  querier_cache: check that the query wasn't evicted during registering
  reader_concurrency_semaphore: use the correct types in the constructor
  reader_concurrency_semaphore: add consume_resources()
  reader_concurrency_semaphore::inactive_read_handle: add operator bool()
2018-12-18 13:36:42 +00:00
Botond Dénes
f0b5170fa6 multishard_mutation_query: reset failed readers to inexistent state
When attempting to dismantling readers, some of the to-be-dismantled
readers might be in a failed state. The code waiting on the reader to
stop is expecting failures, however it didn't do anything besides
logging the failure and bumping a counter. Code in the lower layers did
not know how to deal with a failed reader and would trigger
`std::bad_variant_access` when trying to process (save or cleanup) it.
To prevent this, reset the state of failed readers to `inexistent_state`
so code in the lower layers doesn't attempt to further process them.

(cherry picked from commit b4c3aab4a7)
2018-12-18 14:46:56 +02:00
Botond Dénes
3b617e873c multishard_mutation_query: handle missing readers when dismantling
When dismantling the combined buffer and the compaction state we are no
longer guaranteed to have the reader each partition originated from. The
reader might have been evicted and not resumed, or resuming it might
have failed. In any case we can no longer assume the originating reader
of each partition will be present. If a reader no longer exists,
discard the partitions that it emitted.

(cherry picked from commit 9cef043841)
2018-12-18 14:46:56 +02:00
Botond Dénes
4eb9836e64 multishard_mutation_query: add support for keeping stats for discarded partitions
In the next patches we will add code that will have to discard some of
the dismantled partitions/fragments/bytes. Prepare the
`dismantle_buffer_stats` struct for being able to track the discarded
partitions/fragments/bytes in addition to those that were successfully
dismantled.

(cherry picked from commit 438bef333b)
2018-12-18 14:46:56 +02:00
Botond Dénes
46af353209 multishard_mutation_query: expect evicted reader state when creating reader
Previously readers were created once, so `make_remote_reader()` had a
validation to ensure readers were not attempted at being created more
than once. This validation was done by checking that the reader-state is
either `inexistent` or `successful_lookup`. However with the
introduction of pausing shard readers, it is now possible that a reader
will have to be created and then re-created several times, however this
validation was not updated to expect this.
Update the validation so it also expects the reader-state to be
`evicted`, the state the reader will be if it was evicted while paused.

(cherry picked from commit ce52436af4)
2018-12-18 14:46:54 +02:00
Botond Dénes
76f70c676e multishard_mutation_query: pretty-print the reader state in log messages
(cherry picked from commit 1effb1995b)
2018-12-18 14:34:33 +02:00
Botond Dénes
afc9f0e177 querier_cache: check that the query wasn't evicted during registering
The reader concurrency semaphore can evict the querier when it is
registered as an inactive read. Make the `querier_cache` aware of this
so that it doesn't continue to process the inserted querier when this
happens.
Also add a unit test for this.

(cherry picked from commit 5780f2ce7a)
2018-12-18 14:34:33 +02:00
Botond Dénes
c899191ad5 reader_concurrency_semaphore: use the correct types in the constructor
Previously there was a type mismatch for `count` and `memory`, between
the actual type used to store them in the class (signed) and the type
of the parameters in the constructor (unsigned).
Although negative numbers are completely valid for these members,
initializing them to negative numbers don't make sense, this is why they
used unsigned types in the constructor. This restriction can backfire
however when someone intends to give these parameters the maximum
possible value, which, when interpreted as a signed value will be `-1`.
What's worse the caller might not even be aware of this unsigned->signed
conversion and be very suprised when they find out.
So to prevent surprises, expose the real type of these members, trusting
the clients of knowing what they are doing.

Also add a `no_limits` constructor, so clients don't have to make sure
they don't overflow internal types.

(cherry picked from commit e1d8237e6b)
2018-12-18 14:34:33 +02:00
Botond Dénes
a3563e5f7d reader_concurrency_semaphore: add consume_resources()
(cherry picked from commit dfd649a6b4)
2018-12-18 14:34:33 +02:00
Botond Dénes
78c5b09694 reader_concurrency_semaphore::inactive_read_handle: add operator bool()
(cherry picked from commit 21b44adbfe)
2018-12-18 14:34:33 +02:00
Amnon Heiman
a51878205a node-exporter.service: Update command line to fix service startup
The upgrade to node_exporter 0.17 commit
09c2b8b48a ("node_exporter_install: switch
to node_exporter 0.17") caused the service to no longer start. Turns out
node_exported broke backwards compatibility of the command line between
0.15 to 0.16. Fix it up.

While fixing the command line, all the collector that are enabled by
default were removed.

Fixes #3989

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
[ penberg@scylladb.com: edit commit message ]
Message-Id: <20181213114831.27216-1-amnon@scylladb.com>
(cherry picked from commit 571755e117)
2018-12-18 08:33:04 +02:00
Avi Kivity
46efc08882 Update seastar submodule
* seastar 6700dc3...08f1258 (1):
  > reactor: disable nowait aio due to a kernel bug

Fixes #3996.
2018-12-17 17:00:14 +02:00
Avi Kivity
c95433c967 Update seastar submodule
* seastar 1651a2a...6700dc3 (3):
  > build: link against libatomic
  > core/semaphore: Allow combining semaphore_units()
  > core/shared_ptr: Allow releasing a lw_shared_ptr to a non-const object

Fixes #3996.
2018-12-17 15:53:01 +02:00
Piotr Sarna
df3b6fb4a8 cql3: refuse to create index on COMPACT STORAGE with ck
To follow C* compatibility, creating an index on COMPACT STORAGE
table should be disallowed not only on base primary keys,
but also when the base table contains clustering keys.
Message-Id: <ab40c39730aff2e164d11ee5159ff62b8ec9e8e8.1544698186.git.sarna@scylladb.com>

(cherry picked from commit 6743af5dbd)
2018-12-17 09:45:43 +02:00
Piotr Sarna
44ee43bb17 cql3: add refusing to create an index on static column
Secondary indexes on static columns are not yet supported,
so creating such index should return an appropriate error.

Fixes #3993
Message-Id: <700b0a71e80da52d2d5250edacc12626b55681fa.1544785127.git.sarna@scylladb.com>

(cherry picked from commit 63bd43e57e)
2018-12-17 09:44:52 +02:00
Asias He
aac363ca86 storage_service: Notify NEW_NODE only when a node is new node
This is a backport of CASSANDRA-11038.

Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.

To fix, only send NEW_NODE notification when the node was not part of
the cluster

Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
(cherry picked from commit 71c1681f6c)
2018-12-16 13:59:19 +02:00
Duarte Nunes
9e6cc5b024 service/storage_proxy: Embed the expire timer in the response handler
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.

It will also make the timeout easily accessible inside the handler,
for future patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
(cherry picked from commit f8878238ed)
2018-12-13 13:24:09 +00:00
Duarte Nunes
13b72c7b92 Merge branch 'gossip: Send node UP event to cql client after cql server is up' from Asias
"
This is a backport of CASSANDRA-8236.

Before this patch, scylla sends the node UP event to cql client when it
sees a new node joins the cluster, i.e., when a new node's status
becomes NORMAL. The problem is, at this time, the cql server might not
be ready yet. Once the client receives the UP event, it tries to
connect to the new node's cql port and fails.

To fix, a new application_sate::RPC_READY is introduced, new node sets
RPC_READY to false when it starts gossip in the very beginning and sets
RPC_READY to true when the cql server is ready.

The RPC_READY is a bad name but I think it is better to follow Cassandra.

Nodes with or without this patch are supposed to work together with no
problem.

Refs #3843
"

* 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev:
  storage_service: Use cql_ready facility
  storage_service: Handle application_state::RPC_READY
  storage_service: Add notify_cql_change
  storage_service: Add debug log in notify_joined
  storage_service: Add extra check in notify_joined
  storage_service: Add notify_joined
  storage_service: Add debug log in notify_up
  storage_service: Add extra check in notify_up
  storage_service: Add notify_up
  storage_service: Make notify_left log debug level
  storage_service: Introduce notify_left
  storage_service: Add debug log in notify_down
  storage_service: Introduce notify_down
  storage_service: Add set_cql_ready
  gossip: Add gossiper::is_cql_ready
  gms: Add endpoint_state::is_cql_ready
  gms: Add application_state::RPC_READY
  gms: Introduce cql_ready in versioned_value

(cherry picked from commit a42b2895c2)
2018-12-13 12:06:59 +00:00
Avi Kivity
6b011fbe0a build: pass C compiler configuration in dist package build
Just like we allow customizing the C++ compiler, we should allow customizing
the C compiler.

Ref #3978
Message-Id: <20181211172821.30830-1-avi@scylladb.com>

(cherry picked from commit fa96e07e6b)
2018-12-12 14:41:38 +02:00
Tomasz Grabiec
9dd4e1b01f sstables: index_reader: Avoid schema copy in advance_to()
Introduced in 7e15e43.

Exposed by perf_fast_forward:

  running: large-partition-skips on dataset large-part-ds1
  Testing scanning large partition with skips.
  Reads whole range interleaving reads with skips according to read-skip pattern:
  read    skip      time (s)     frags     frag/s (...)
  1       0         5.268780   8000000    1518378

  1       1        31.695985   4000000     126199
Message-Id: <1544614272-21970-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0a853b8866)
2018-12-12 14:38:49 +02:00
Nadav Har'El
e91c741ef5 secondary indexes: fail attempts to create a CUSTOM INDEX
Cassandra supports a "CREATE CUSTOM INDEX" to create a secondary index
with a custom implementation. The only custom implementation that Cassandra
supports is SASI. But Scylla doesn't support this, or any other custom
index implementation. If a CREATE CUSTOM INDEX statement is used, we
shouldn't silently ignore the "CUSTOM" tag, we should generate an error.

This patch also includes a regression test that "CREATE CUSTOM INDEX"
statements with valid syntax fail (before this patch, they succeeded).

Fixes #3977

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-2-nyh@scylladb.com>
(cherry picked from commit a0379209e6)
2018-12-12 00:32:35 +00:00
Nadav Har'El
b18e9e115d Fix typo in error message
Interestingly, this typo was copied from the original Cassandra source
code :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-1-nyh@scylladb.com>
(cherry picked from commit 36db4fba23)
2018-12-12 00:32:35 +00:00
Avi Kivity
0b86ab0d2a build: build libdeflate with user selected C compiler
If the user specified a C compiler, use it to build libdeflate.

Fixes #3978.
Message-Id: <20181211145604.14847-1-avi@scylladb.com>

(cherry picked from commit 34a31a807d)
2018-12-11 19:24:24 +02:00
Duarte Nunes
97cd9108d6 db/system_distributed_keyspace: Create the schema with min_timestamp
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.

However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.

This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.

Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.

Fixes #3976

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
(cherry picked from commit 89ae3fbf11)
2018-12-11 14:53:30 +00:00
Hagit Segev
f81fe96b0b release: prepare for 3.0-rc2 2018-12-11 12:32:34 +02:00
Avi Kivity
91ce3a7957 sstables: fix overflow in clustering key blocks header bit access
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.

Fix by using a 64-bit mask.

Found by Fedora 29's ubsan.

Fixes #3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>

(cherry picked from commit 7c7da0b462)
2018-12-10 14:10:27 +02:00
Takuya ASADA
af7e58f4c5 dist/offline_installer/redhat: fix missing dependencies
Offline installer with Scylla 3.0 causes dependency error on CentOS, added
missing packages.

Fixes #3969

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181207020711.23055-1-syuu@scylladb.com>
(cherry picked from commit a2d0ebf4d9)
2018-12-10 14:10:15 +02:00
Amos Kong
bd3373b511 scylla_setup: only ask for nic in interactive mode
Current scylla_setup still asks for nic even nic is already assigned in cmdline.

Fixes #3908

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <6b867e17a5583c495c771a37d5fa1e8366b1d61b.1542337635.git.amos@scylladb.com>
(cherry picked from commit 09a3b11c2f)
2018-12-09 19:26:34 +02:00
Gleb Natapov
4820130abe storage_proxy: fix crash during write timeout callback invocation
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.

Fixes #3972.

Message-Id: <20181206164043.GN25283@scylladb.com>
(cherry picked from commit 9fb79bf379)
2018-12-09 15:25:52 +02:00
Tomasz Grabiec
9b299241e5 Merge "Fixes for collecting stats in SST3 + more tests" from Vladimir
This patchset fixes several remaining issues found during thorough
testing of SSTables 3.x statistics and enriches ~30 unit tests with
statistics validation against Cassandra-generated golden copies.

* https://github.com/argenet/scylla/tree/projects/sstables-30/sst3-tests-statistics/v1:
  sstables: Enforce estimated_partitions in generate_summary() to be
    always positive.
  sstables: Don't enforce default max_local_deletion_time value for 'mc'
    files.
  sstables: Update TTL/local deletion stats for non-expiring and live
    liveness_info.
  sstables: Collect statistics when writing RT markers to SSTables 3.x.
  tests: Return sstable_assertions from validate_read() helper.
  tests: Introduce helper for validating stats metadata in SSTables 3.x
    tests.
  tests: Add stats metadata validation to test_write_static_row.
  tests: Add stats metadata validation to
    test_write_composite_partition_key.
  tests: Add stats metadata validation to
    test_write_composite_clustering_key.
  tests: Add stats metadata validation to test_write_wide_partitions.
  tests: Add stats metadata validation to write_ttled_row
  tests: Add stats metadata validation to write_ttled_column
  tests: Add stats metadata validation to write_deleted_column
  tests: Add stats metadata validation to write_deleted_row
  tests: Add stats metadata validation to write_collection_wide_update
  tests: Add stats metadata validation to
    write_collection_incremental_update
  tests: Add stats metadata validation to write_multiple_partitions
  tests: Add stats metadata validation to write_multiple_rows
  tests: Add stats metadata validation to
    write_missing_columns_large_set
  tests: Add stats metadata validation to write_different_types
  tests: Add stats metadata validation to write_empty_clustering_values
  tests: Add stats metadata validation to write_large_clustering_key
  tests: Add stats metadata validation to write_compact_table
  tests: Add stats metadata validation to write_user_defined_type_table
  tests: Add stats metadata validation to write_simple_range_tombstone
  tests: Add stats metadata validation to
    write_adjacent_range_tombstones
  tests: Add stats metadata validation to
    write_non_adjacent_range_tombstones
  tests: Add stats metadata validation to
    write_mixed_rows_and_range_tombstones
  tests: Add stats metadata validation to
    write_adjacent_range_tombstones_with_rows
  tests: Add stats metadata validation to
    write_range_tombstone_same_start_with_row
  tests: Add stats metadata validation to
    write_range_tombstone_same_end_with_row
  tests: Add stats metadata validation to
    write_two_non_adjacent_range_tombstones
  tests: Delete unused (bogus) Statistics.db file from write_ SST3
    tests.

(cherry picked from commit bb24d378b2)
2018-12-08 14:08:46 +02:00
Avi Kivity
745a98e151 Merge "Fix deadlocking multishard readers" from Botond
"
Multishard combining readers, running concurrently, with limited
concurrency and no timeout may deadlock, due to inactive shard readers
sitting on permits. To avoid this we have to make sure that all shard
readers belonging to a multishard combining readers, that are not
currently active, can be evicted to free up their permits, ensuring that
all readers can make progress.
Making inactive shard readers evictable is the solution for this
problem, however the original series introducing this solution
(414b14a6bd) did not go all they way and
left some loose ends. These loose ends are tied up by this mini-series.
Namely, two issues remained:
* The last reader to reach EOS was not paused (made evictable).
* Readers created/resumed as part of a read-ahead were not paused
  immediately after finishing the read-ahead.

This series fixes both of these.

Fixes: #3865
Tests: unit(release, debug)
"

* 'fix-multishard-reader-deadlock/v1' of https://github.com/denesb/scylla:
  multishard_combining_reader: pause readers after reading ahead
  multishard_combining_reader: pause *all* EOS'd readers

(cherry picked from commit 21b4b2b9a1)
2018-12-08 14:08:46 +02:00
Avi Kivity
b9c99af18b Merge "Fix tombstone histogram when writing SSTables 3.x" from Vladimir
"
This patchset extends a number of existing tests to check SSTables
statistics for 'mc' format and fixes an issue discovered with the help
of one of the tests.

Tests: unit {release}
"

* 'projects/sstables-30/check-stats/v2' of https://github.com/argenet/scylla:
  tests: Run sstable_timestamp_metadata_correcness_with_negative with all SSTables versions.
  tests: Run sstable_tombstone_histogram_test for all SSTables versions.
  tests: Run min_max_clustering_key_test on all SSTables versions.
  tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
  tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
  tests: Extend test checking tombstones histogram to cover all SSTables versions.
  sstables: Properly track row-level tombstones when writing SSTables 3.x.
  tests: Run min_max_clustering_key_test_2 for all SSTables versions.
  tests: Make reusable_sst() helper accept SSTables version parameter.

(cherry picked from commit f073ea5f87)
2018-12-08 14:08:44 +02:00
Asias He
cded9c7ac7 gossip: Fix race in real_mark_alive and shutdown msg
In dtest, we have

   self.check_rows_on_node(node1, 2000)
   self.check_rows_on_node(node2, 2000)

which introduce the following cluster operations:

1) Initially:

- node1 up
- node2 up

2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)

3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)

Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.

In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"

As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.

   TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']

In the log we can see node1 marked as DOWN and UP almost at the same time on node2:

   INFO  2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
   INFO  2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown

Fixes #3940

Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
(cherry picked from commit eeeb2da7bb)
2018-12-08 13:42:43 +02:00
Gleb Natapov
4acfc5ed8f hints: make hints manager more resilient to unexpected directory content
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>

(cherry picked from commit b4a8802edc)
2018-12-08 13:42:43 +02:00
Gleb Natapov
cb9199bc7f hints: add auxiliary function for scanning high level hints directory
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>

(cherry picked from commit 9433d02624)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
695ff5383f Merge "Correct the usage of row ttl and add write-read test" from Piotr
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.

* seastar-dev.git haaawk/sst3/write-read-test/v3:
  Fix use_row_ttl condition
  Add test_all_data_is_read_back

(cherry picked from commit b8c405c019)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
730e48bf60 configure.py: Always add a rule for building gen_crc_combine_table
Fixes a build failure when only the scylla binary was selected for
building like this:

  ./configure.py --with scylla

In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o

Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit edbef7400b)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
af6d4f40e1 utils/gz: Fix compilation on non-x86 archs
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.

Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 9a4c00beb7)
2018-12-08 13:42:43 +02:00
Avi Kivity
9d8507de09 Merge "Optimize checksum_combine() for CRC32" from Tomek
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.

This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.

Performance results using perf_checksum and second buffer of length 64 KiB:

zlib CRC32 combine:   38'851   ns
libdeflate CRC32:      4'797   ns
fast_crc32_combine():     11   ns

So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.

Tested on i7-5960X CPU @ 3.00GHz

Performance was also evaluated using sstable writer benchmark:

  perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
     --value-size=10000 --rows 1000000 --datasets small-part

It yielded 9% improvement in median frag/s (129'055 vs 117'977).

Refs #3874
"

* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
  tests: perf_checksum: Test fast_crc32_combine()
  tests: Rename libdeflate_test to checksum_utils_test
  tests: libdeflate: Add more tests for checksum_combine()
  tests: libdeflate: Check both libdeflate and default checksummers
  sstables: Use fast_crc_combine() in the default checksummer
  utils/gz: Add fast implementation of crc32_combine()
  utils/gz: Add pre-computed polynomials
  utils/gz: Import Barett reduction implementation from libdeflate
  utils: Extract clmul() from crc.hh

(cherry picked from commit b098b5b987)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
07c980845d utils/crc: Add clmul_u32() implementation
Needed for backporting dependent changes.

Extracted from:

      commit 79136e895f
      Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
      Date:   Thu Nov 1 03:26:16 2018 +0000

        utils/crc: calculate crc in parallel
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
c52b8239d0 configure.py: Compile against Westmere on x86
Needed for backporting dependent changes.

Extracted from:

  commit 79136e895f
  Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
  Date:   Thu Nov 1 03:26:16 2018 +0000

    utils/crc: calculate crc in parallel
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
5a07a4fac8 configure.py: Use armv8-a+crc+crypto ISA on aarch64
Needed for backporting dependent changes.

Extracted from:

  commit 1c48e3fbec
  Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
  Date:   Mon Oct 29 02:58:19 2018 +0000

    utils/crc: leverage arm64 crc extension
2018-12-08 13:42:43 +02:00
Avi Kivity
b9c046b17b Merge "Optimize checksum computation for the MC sstable format" from Tomek
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.

perf_checksum test was introduced to measure performance of various
checksumming operations.

Results for 514 B (relevant for writing with compression enabled):

    test                                      iterations      median         mad         min         max
    crc_test.perf_deflate_crc32_combine            58414    16.711us     3.483ns    16.708us    16.725us
    crc_test.perf_adler_combine                165788278     6.059ns     0.031ns     6.027ns     7.519ns
    crc_test.perf_zlib_crc32_combine               59546    16.767us    26.191ns    16.741us    16.801us
    ---
    crc_test.perf_deflate_crc32_checksum        12705072    83.267ns     4.580ns    78.687ns    98.964ns
    crc_test.perf_adler_checksum                 3918014   206.701ns    23.469ns   183.231ns   258.859ns
    crc_test.perf_zlib_crc32_checksum            2329682   428.787ns     0.085ns   428.702ns   510.085ns

Results for 64 KB (relevant for writing with compression disabled):

    test                                      iterations      median         mad         min         max
    crc_test.perf_deflate_crc32_combine            25364    38.393us    17.683ns    38.375us    38.545us
    crc_test.perf_adler_combine                169797143     5.842ns     0.009ns     5.833ns     6.901ns
    crc_test.perf_zlib_crc32_combine               26067    38.663us    95.094ns    38.546us    40.523us
    ---
    crc_test.perf_deflate_crc32_checksum          202821     4.937us    14.426ns     4.912us     5.093us
    crc_test.perf_adler_checksum                   44684    22.733us   206.263ns    22.492us    25.258us
    crc_test.perf_zlib_crc32_checksum              18839    53.049us    36.117ns    53.013us    53.274us

The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.

Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().

SStable write performance was evaluated by running:

  perf_fast_forward --populate --data-directory /tmp/perf-mc \
     --rows=10000000 -c1 -m4G --datasets small-part

Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.

Before:

 [1] MC,lz4: 330'903
 [2] LA,lz4: 450'157
 [3] MC,checksum: 419'716
 [4] LA,checksum: 459'559

After:

 [1'] MC,lz4: 446'917 ([1] + 35%)
 [2'] LA,lz4: 456'046 ([2] + 1.3%)
 [3'] MC,checksum: 462'894 ([3] + 10%)
 [4'] LA,checksum: 467'508 ([4] + 1.7%)

After this series, the performance of the MC format writer is similar to that
of the LA format before the series.

There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"

* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
  tests: perf: Introduce perf_checksum
  tests: Add test for libdeflate CRC32 implementation
  sstables: compress: Use libdeflate for crc32
  sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
  licenses: Add libdeflate license
  Integrate libdeflate with the build system
  Add libdeflate submodule
  sstables: Avoid checksum_combine() for the crc32 checksummer
  sstables: compress: Avoid unnecessary checksum_combine()
  sstables: checksum_utils: Add missing include

(cherry picked from commit 5e759b0c07)
2018-12-08 13:42:43 +02:00
Avi Kivity
979cb636b8 Update seastar submodule
* seastar e64281d...1651a2a (1):
  > tests: perf: Make do_not_optimize() take the argument by const&
2018-12-08 13:42:43 +02:00
Botond Dénes
59cf9d9070 querier: fix evict_one() and evict_all_for_table()
Both of these have the same problem. They remove the to-be-evicted
entries from `_entries` but they don't unregister the `entry` from the
`read_concurrency_semaphore`. This results in the
`reader_concurrency_semaphore` being left with a dangling pointer to the
entries will trigger segfault when it tries to evict the associated
inactive reads.

Also add a unit test for `evict_all_for_table()` to check that it works
properly (`evict_one()` is only used in tests, so no dedicated test for
it).

Fixes: #3962

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <57001857e3791c6385721b624d33b667ccda2e7d.1544010868.git.bdenes@scylladb.com>
(cherry picked from commit 77dbc7d09a)
2018-12-06 11:38:44 +02:00
Duarte Nunes
c9ec9d4087 Merge seastar upstream
* seastar 880826e...e64281d (2):
  > core/semaphore: Change the access of semaphore_units main ctor
  > Merge "Add semaphore_units<>::split() function" from Duarte

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-05 20:25:17 +00:00
Gleb Natapov
2e8fefbc5a storage_proxy: store hint for CL=ANY if all nodes replied with failure
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.

Fixed: #3565
(cherry picked from commit 17197fb005)
2018-12-05 20:14:58 +00:00
Gleb Natapov
6be0635029 storage_proxy: complete write request early if all replicas replied with success of failure
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.

Fixes #3566

(cherry picked from commit d1d04eae3c)
2018-12-05 20:14:57 +00:00
Gleb Natapov
04a544c0a2 storage_proxy: check that write failure response comes from recognized replica
Before accounting failure response we need to make sure it comes from a
replica that participates in the request.

(cherry picked from commit 76ab3d716b)
2018-12-05 20:14:57 +00:00
Gleb Natapov
028f9b95d1 storage_proxy: move code executed on write timeout into separate function
Currently the callback is in lambda, but we will want to call the code
not only during timer expiration.

(cherry picked from commit 7bc68aa0eb)
2018-12-05 20:14:57 +00:00
Avi Kivity
54258ca8eb Merge "db/hints: Use frozen_mutation in hinted handoff" from Duarte
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.

Tests: unit(release)
"

* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
  db/hints/manager: Use frozen_mutation instead of mutation
  db/hints/manager: Use database::find_schema()
  db/commitlog/commitlog_entry: Allow moving the contained mutation
  service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
  service/storage_proxy: Build a shared_mutation from a frozen_mutation
  service/storage_proxy: Lift frozen_mutation_and_schema
  service/storage_proxy: Allow non-const ranges in mutate_prepare()

(cherry picked from commit 1891779e64)
2018-12-05 20:14:57 +00:00
Gleb Natapov
c9a030f1f0 storage_proxy: count number of timed out write attempts after CL is reached
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.

Message-Id: <20181009120900.GF22665@scylladb.com>
(cherry picked from commit 207b57a892)
2018-12-05 20:14:57 +00:00
Gleb Natapov
1c7daef554 storage_proxy: do not pass write_stats down to send_to_live_endpoints
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.

Message-Id: <20181009133017.GA14449@scylladb.com>
(cherry picked from commit 319ece8180)
2018-12-05 20:14:57 +00:00
Duarte Nunes
f8195a77b0 db/view/view_builder: Don't timeout waiting for view to be built
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.

This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.

Fixes #3920

Tests: unit release(view_build_test, view_schema_test)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
(cherry picked from commit 6fbf792777)
2018-12-05 19:20:36 +00:00
Duarte Nunes
5b724c80ab db/view: Don't copy keyspace name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181022104527.14555-1-duarte@scylladb.com>
(cherry picked from commit f3a5ec0fd9)
2018-12-05 19:19:26 +00:00
Nadav Har'El
4a7ae81b3f materialized views: update stats.write statistics in all cases
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.

In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.

Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 1d5f8d0015)
2018-12-05 19:19:26 +00:00
Piotr Sarna
3cf26a60a2 auth: add abort_source to waiting for schema agreement
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.

Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>

(cherry picked from commit 7b0a3fbf8a)
2018-12-04 14:33:05 +00:00
Tomasz Grabiec
2103d0d52b sstables: Write Statistics.db offset map entries in the same order as Cassandra
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.

Fixes #3955.

Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit aa19f98d18)
2018-12-04 14:30:19 +02:00
Avi Kivity
16ee3b3ebe Merge "Make inactive shard readers evictable" from Botond
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
  reads.
* Behave very badly when too many of them is running concurrently
  (trashing).
* May deadlock if enough of them is running without a timeout.

The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
  duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
  handle) as each scan now can make progress by evicting inactive shard
  readers belonging to other scans.
* Will not deadlock at all.

In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
  newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
  `foreign_reader`.

I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.

Fixes: #3835
"

* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
  tests/multishard_mutation_query_test: test stateless query too
  tests/querier_cache: fail resource-based eviction test gracefully
  tests/querier_cache: simplify resource-based eviction test
  tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
  tests/mutation_reader_test: restore indentation
  tests/mutation_reader_test: enrich pause-related multishard reader test
  multishard_combining_reader: use pause-resume API
  query::partition_slice: add clear_ranges() method
  position_in_partition: add region() accessor
  foreign_reader: add pause-resume API
  tests/mutation_reader_test: implement the pause-resume API
  query_mutations_on_all_shards(): implement pause-resume API
  make_multishard_streaming_reader(): implement the pause-resume API
  database: add accessors for user and streaming concurrency semaphores
  reader_lifecycle_policy: extend with a pause-resume API
  query_mutations_on_all_shards(): restore indentation
  query_mutations_on_all_shards(): simplify the state-machine
  multishard_combining_reader: use the reader lifecycle policy
  multishard_combining_reader: add reader lifecycle policy
  multishard_combining_reader: drop unnecessary `reader_promise` member
  ...

(cherry picked from commit 414b14a6bd)
2018-12-04 12:13:13 +02:00
Duarte Nunes
b0a9c40ab1 service/storage_proxy: Consider target liveness in sent_to_endpoint()
So we don't attempt to send mutations to unreachable endpoints and
instead store a hint for them, we now check the endpoint status and
populate dead_endpoints accordingly in
storage_proxy::send_to_endpoint().

Fixes #3820

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181007100640.2182-1-duarte@scylladb.com>
(cherry picked from commit 30d6ed8f92)
2018-12-03 18:38:05 +00:00
Duarte Nunes
53924e5c7f service/storage_proxy: Fix formatting of send_to_endpoint()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181006204756.32232-1-duarte@scylladb.com>
(cherry picked from commit a69d468101)
2018-12-03 18:37:59 +00:00
Avi Kivity
befe0012f5 Merge "Fix multiple summary regeneration bugs." from Vladimir
"
This patchset addresses two recently discovered bugs both triggered by
summary regeneration:

Tests: unit {release}

+

Validated with debug build of Scylla (ASAN) that no use-after-free
occurs when re-generating Summary.db.
"

* 'projects/sstables-30/summary-regeneration/v1' of https://github.com/argenet/scylla:
  tests: Add test reading SSTables in 'mc' format with missing summary.
  sstables: When loading, read statistics before summary.
  database: Capture io_priority_class by reference to avoid dangling ref.

(cherry picked from commit 009cbd3dcb)
2018-12-02 13:32:09 +02:00
Duarte Nunes
1953c5fa61 Merge 'Fix filtering with LIMIT' from Piotr
"
This series adds proper handling of filtering queries with LIMIT.
Previously the limit was erroneously applied before filtering,
which leads to truncated results.

To avoid that, paged filtering queries now use an enhanced pager,
which remembers how many rows dropped and uses that information
to fetch for more pages if the limit is not yet reached.

For unpaged filtering queries, paging is done internally as in case
of aggregations to avoid returning keeping huge results in memory.

Also, previously, all limited queries used the page size counted
from max(page size, limit). It's not good for filtering,
because with LIMIT 1 we would then query for rows one-by-one.
To avoid that, filtered queries ask for the whole page and the results
are truncated if need be afterwards.

Tests: unit (release)
"

* 'fix_filtering_with_limit_2' of https://github.com/psarna/scylla:
  tests: add filtering with LIMIT test
  tests: split filtering tests from cql_query_test
  cql3: add proper handling of filtering with LIMIT
  service/pager: use dropped_rows to adjust how many rows to read
  service/pager: virtualize max_rows_to_fetch function
  cql3: add counting dropped rows in filtering pager

(cherry picked from commit 1afda28cf3)
2018-12-02 12:07:46 +02:00
Duarte Nunes
b72a94b53e Merge 'Fix checking if system tables need view updates' from Piotr
"
This miniseries ensures that system tables are not checked
for having view updates, because they never do.
What's more, distributed system table is used in the process,
so it's unsafe to query the table while streaming it.

Tests: unit (release), dtest(update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test)
"

* 'fix_checking_if_system_tables_need_view_updates_3' of https://github.com/psarna/scylla:
  streaming: don't check view building of system tables
  database: add is_internal_keyspace
  streaming: remove unused sstable_is_staging bool class

(cherry picked from commit d09d4bbd91)
2018-11-28 15:39:34 +00:00
Piotr Sarna
3f82b697f2 main: fix deinitialization order for view update generator
View update generator should be stopped only after
drain_on_shutdown() is performed on storage service.
Message-Id: <4d2bda4c73422a2ebf46d6dcd06c95d960839889.1543230849.git.sarna@scylladb.com>

(cherry picked from commit 6ab8235369)
2018-11-27 12:34:50 +00:00
Takuya ASADA
ee1ef853e5 dist/common/systemd/scylla-housekeeping-restart.service.mustache: specify correct repo for Debian variants
We do specify correct repo for both Red Hat/Debian variants on -deily, but
mistakenly don't for -restart, so do same on -restart.

Fixes #3906

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181109224509.27380-1-syuu@scylladb.com>
(cherry picked from commit 7740cd2142)
2018-11-27 09:59:05 +02:00
Raphael S. Carvalho
6e7e7f3822 sstables: deprecate sstable metadata's ancestors
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
(cherry picked from commit d29482dce8)
2018-11-24 12:36:40 +02:00
Paweł Dziepak
82a36edc9d Merge "Optimize sstable writing of the MC format" from Tomasz
"
Tested with perf_fast_forward from:

  github.com/tgrabiec/scylla.git perf_fast_forward-for-sst3-opt-write-v1

Using the following command line:

  build/release/tests/perf/perf_fast_forward_g --populate --sstable-format=mc \
     --data-directory /tmp/perf-mc --rows=10000000 -c1 -m4G \
     --datasets small-part

The average reported flush throughput was (stdev for the avergages is around 4k):
  - for mc before the series: 367848 frag/s
  - for lc before the series: 463458 frag/s (= mc.before +25%)
  - for mc after the series: 429276 frag/s (= mc.before +16%)
  - for lc after the series: 466495 frag/s (= mc.before +26%)

Refs #3874.
"

* tag 'sst3-opt-write-v2' of github.com:tgrabiec/scylla:
  sstables: mc: Avoid serialization of promoted index when empty
  sstables: mc: Avoid double serialization of rows
  tests: sstable 3.x: Do not compare Statistics component
  utils: Introduce memory_data_sink
  schema: Optimize column count getters
  sstables: checksummed_file_data_sink_impl: Bypass output_stream

(cherry picked from commit 4aa5d83590)
2018-11-24 12:36:40 +02:00
Avi Kivity
d4efa3c9b2 Update seastar submodule
* seastar d6647df...880826e (1):
  > fstream: Introduce make_file_data_sink()
2018-11-24 12:36:40 +02:00
Avi Kivity
324dae3e12 Merge "compress: Restore lz4 as default compressor" from Duarte
"
Enables sstable compression with LZ4 by default, which was the
long-time behavior until a regression turned off compression by
default.

Fixes #3926
"

* 'restore-default-compression/v2' of https://github.com/duarten/scylla:
  tests/cql_query_test: Assert default compression options
  compress: Restore lz4 as default compressor
  tests: Be explicit about absence of compression

(cherry picked from commit bb85a21a8f)
2018-11-21 16:45:22 +02:00
Tomasz Grabiec
c0ffc9a2b7 utils: phased_barrier: Make advance_and_await() have strong exception guarantees
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.

One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.

This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>

Fixes #3931.

(cherry picked from commit 57e25fa0f8)
2018-11-21 12:17:27 +02:00
Glauber Costa
f81fa5f75c remove monitor if sstable write failed
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.

Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.

But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.

Fixes #3732 (hopefully)
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
(cherry picked from commit 9f403334c8)
2018-11-20 19:27:54 +02:00
Glauber Costa
6fd1cfcfce sstables: correctly parse estimated histograms
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.

The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.

We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.

However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.

Fixes #3918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
(cherry picked from commit c6811bd877)
2018-11-17 17:20:00 +02:00
Nadav Har'El
9d458ffea9 Materialized Views and Secondary Index: no longer experimental
After this patch, the Materialized Views and Secondary Index features
are considered generally-available and no longer require passing an
explicit "--experimental=on" flag to Scylla.

The "--experimental=on" flag and the db::config::check_experimental()
function remain unused, as we graduated the only two features which used
this flag. However, we leave the support for experimental features in
the code, to make it easier to add new experimental features in the future.
Another reason to leave the command-line parameter behind is so existing
scripts that still use it will not break.

Fixes #3917

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115144456.25518-1-nyh@scylladb.com>
(cherry picked from commit 78ed7d6d0c)
2018-11-15 19:50:30 +02:00
Duarte Nunes
9776a048e7 Merge 'Generating view updates during streaming' from Piotr
During streaming, there are cases when we should invoke the view write
path. In particular, if we're streaming because of repair or if a view
has not yet finished building and we're bootstrapping a new node.

The design constraints are:
1) The streamed writes should be visible to new writes, but the
   sstable should not participate in compaction, or we would lose the
   ability to exclude the streamed writes on a restart;
2) The streamed writes must not be considered when generating view
   updates for them;
3) Resilient to node restarts;
4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges.

We achieve this by writing the streamed writes to an sstable in a
different folder, call it "staging". We achieve 1) by publishing the
sstable to the column family sstable set, but excluding it from
compactions. We do these steps upon boot, by looking at the staging
directory, thus achieving 3).

Fixes #3275

* 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits)
  tests: add materialized views test
  tests: add view update generator to cql test env
  main: add registering staging sstables read from disk
  database: add a check if loaded sstable is already staging
  database: add get_staging_sstable method
  streaming: stream tables with views through staging sstables
  streaming: add system distributed keyspace ref to streaming
  streaming: add view update generator reference to streaming
  main: add generating missed mv updates from staging sstables
  storage_service: move initializing sys_dist_ks before bootstrap
  db/view: add view_update_from_staging_generator service
  db/view: add view updating consumer
  table: add stream_view_replica_updates
  table: split push_view_replica_updates
  table: add as_mutation_source_excluding
  table: move push_view_replica_updates to table.cc
  database: add populating tables with staging sstables
  database: add creating /staging directory for sstables
  database: add sstable-excluding reader
  table: add move_sstable_from_staging_in_thread function
  ...

(cherry picked from commit a38f6078fb)
2018-11-15 17:46:20 +02:00
Asias He
10cf97375e streaming: Expose reason for streaming
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.

Fixes #3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>

(cherry picked from commit 7f826d3343)
2018-11-15 17:45:31 +02:00
Paweł Dziepak
e6355a9a01 Merge "Write static rows for all partitions if there are static columns" from Vladimir
"
It appears that in case when there are any static columns in serialization header,
Cassandra would write a (possibly empty) static row to every partition
in the SSTables file.

This patchset alings Scylla's logic with that of Cassandra.

Note that Scylla optimizes the case when no partition contains a static
row because it keeps track of updated columns that Scylla currently does
not do - see #3901 for details.

Fixes #3900.
"

* 'projects/sstables-30/write-all-static-rows/v1' of https://github.com/argenet/scylla:
  tests: Test writing empty static rows for partitions in tables with static columns.
  sstables: Ignore empty static rows on reading.
  sstables: Write empty static rows when there are static columns in the table.

(cherry picked from commit 6469a1b451)
2018-11-12 15:59:35 -08:00
Raphael S. Carvalho
e57907a1d5 sstables: fix procedure to get fully expired sstables with MC format
MC format lacks ancestors metadata, so we need to workaround it by using
ancestors in metadata collector, which is only available for a sstable
written during this instance. It works fine here because we only want
to know if a sstable recently compacted has an ancestor which wasn't
yet deleted.

Fixes #3852.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20181102154951.22950-1-raphaelsc@scylladb.com>
(cherry picked from commit 1c5934c934)
2018-11-06 16:03:18 +02:00
Pekka Enberg
f94b46e7e0 docker: Switch to 3.0 RPM repository 2018-11-01 19:40:10 +02:00
Avi Kivity
6847c12668 Merge "dist: use perftune.py for disks tuning" from Vlad
"
Use perftune.py for tuning disks:
   - Distribute/pin disks' IRQs:
      - For NVMe drives: evenly among all present CPUs.
      - For non-NVMe drives: according to chosen tuning mode.
   - For all disks used by scylla:
      - Tune nomerges
      - Tune I/O scheduler.

It's important to tune NIC and disks together in order to keep IRQ
pinning in the same mode.

Disk are detected and tuned based on the current content of
/etc/scylla/scylla.yaml configuration file.
"

Fixes #3831.

* 'use_perftune_for_disks-v3' of https://github.com/vladzcloudius/scylla:
  dist: change the sysconfig parameter name to reflect the new semantics
  scylla_util.py::sysconfig_parser: introduce has_option()
  dist: scylla_setup and scylla_sysconfig_setup: change paremeters names to reflect new semantics
  dist: don't distribute posix_net_conf.sh any more
  dist: use perftune.py to tune disks and NIC

(cherry picked from commit f170e3e589)
2018-11-01 19:19:04 +02:00
Avi Kivity
80b86def1f Update seastar submodule
* seastar 0c8a2c8...d6647df (3):
  > scripts: perftune.py: properly merge parameters from the command line and the configuration file
  > scripts: perftune.py: prioritize I/O schedulers
  > Merge "scripts: perftune.py: support different I/O schedulers" from Vlad

Ref #3831.
2018-11-01 19:18:07 +02:00
Vlad Zolotarov
c6de9ea39b config: enable hinted handoff by default
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181019180401.12400-1-vladz@scylladb.com>
(cherry picked from commit 4d1bb719a4)
2018-11-01 10:41:44 +02:00
Avi Kivity
94bed81c1d Update seastar submodule
* seastar 39b89de...0c8a2c8 (1):
  > prometheus: Allow preemption between each metric

See scylladb/seastar#469.
2018-10-31 19:21:21 +02:00
Hagit Segev
0f3a21f0bb release: prepare for 3.0-rc1 2018-10-31 12:08:43 +02:00
Tomasz Grabiec
976db7e9e0 Merge "Proper support for static rows in SSTables 3.x" from Vladimir
This patchset addresses two issues with static rows support in SSTables
3.x. ('mc' format):

1. Since collections are allowed in static rows, we need to check for
complex deletion, set corresponding flag and write tombstones, if any.
2. Column indices need to be partitioned for static columns the same way
they are partitioned for regular ones.

 * github.com/argenet/scylla.git projects/sstables-30/columns-proper-order-followup/v1:
  sstables: Partition static columns by atomicity when reading/writing
    SSTables 3.x.
  sstables: Use std::reference_wrapper<> instead of a helper structure.
  sstables: Check for complex deletion when writing static rows.
  tests: Add/fix comments to
    test_write_interleaved_atomic_and_collection_columns.
  tests: Add test covering inverleaved atomic and collection cells in
    static row.

(cherry picked from commit 62c7685b0d)
2018-10-30 14:51:21 +01:00
Nadav Har'El
996b86b804 Materalized views: fix race condition in resharding while view building
When a node reshards (i.e., restarts with a different number of CPUs), and
is in the middle of building a view for a pre-existing table, the view
building needs to find the right token from which to start building on all
shards. We ran the same code on all shards, hoping they would all make
the same decision on which token to continue. But in some cases, one
shard might make the decision, start building, and make progress -
all before a second shard goes to make the decision, which will now
be different.

This resulted, in some rare cases, in the new materialized view missing
a few rows when the build was interrupted with a resharding.

The fix is to add the missing synchronization: All shards should make
the same decision on whether and how to reshard - and only then should
start building the view.

Fixes #3890
Fixes #3452

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181028140549.21200-1-nyh@scylladb.com>
(cherry picked from commit b8337f8c9d)
2018-10-29 09:52:25 +00:00
Avi Kivity
b7b217cc43 Merge "Re-order columns when reading/writing SSTables 3.x" from Vladimir
"
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
    - all atomic columns go first, then all multi-cell ones
    - columns of both types (atomic and multi-cell) are
      lexicographically ordered by name regarding each other

Scylla needs to store columns and their respective indices using the
same ordering as well as when reading them back.

Fixes #3853

Tests: unit {release}

+

Checked that the following SSTables are dumped fine using Cassandra's
sstabledump:

cqlsh:sst3> CREATE TABLE atomic_and_collection3 ( pk int, ck int, rc1 text, rc2 list<text>, rc3 text, rc4 list<text>, rc5 text, rc6 list<text>, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:sst3> INSERT INTO atomic_and_collection3 (pk, ck, rc1, rc4, rc5) VALUES (0, 0, 'hello', ['beautiful','world'], 'here');
<< flush >>

sstabledump:

[
  {
    "partition" : {
      "key" : [ "0" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 96,
        "clustering" : [ 0 ],
        "liveness_info" : { "tstamp" : "1540599270139464" },
        "cells" : [
          { "name" : "rc1", "value" : "hello" },
          { "name" : "rc5", "value" : "here" },
          { "name" : "rc4", "deletion_info" : { "marked_deleted" : "1540599270139463", "local_delete_time" : "1540599270" } },
          { "name" : "rc4", "path" : [ "45e22cb0-d97d-11e8-9f07-000000000000" ], "value" : "beautiful" },
          { "name" : "rc4", "path" : [ "45e22cb1-d97d-11e8-9f07-000000000000" ], "value" : "world" }
        ]
      }
    ]
  }
]
"

* 'projects/sstables-30/columns-proper-order/v1' of https://github.com/argenet/scylla:
  tests: Test interleaved atomic and multi-cell columns written to SSTables 3.x.
  sstables: Re-order columns (atomic first, then collections) for SSTables 3.x.
  sstables: Use a compound structure for storing information used for reading columns.

(cherry picked from commit 75dbff984c)
2018-10-28 15:51:47 +02:00
Tomasz Grabiec
c274430933 Merge "Properly write static rows missing columns for SSTables 3.x." from Vladimir
Before this fix, write_missing_columns() helper would always deal with
regular columns even when writing static rows.

This would cause errors on reading those files.

Now, the missing columns are written correctly for regular and static
rows alike.

* github.com/argenet/scylla.git projects/sstables-30/fix-writing-static-missing-columns/v1:
  schema: Add helper method returning the count of columns of specified
    kind.
  sstables: Honour the column kind when writing missing columns in 'mc'
    format.
  tests: Add test for a static row with missing columns (SStables 3.x.).

(cherry picked from commit cf2d5c19fb)
2018-10-26 13:30:12 +03:00
Avi Kivity
893a18a7c4 Merge "Properly writing/reading shadowable deletions with SSTables 3.x." from Vladimir
"
This patchset adddresses two problems with shadowable deletions handling
in SSTables 3.x. ('mc' format).

Firstly, we previously did not set a flag indicating the presence of
extended flags byte with HAS_SHADOWABLE_DELETION bitmask on writing.
This would break subsequent reading and cause all types of failures up
to crash.

Secondly, when reading rows with this extended flag set, we need to
preserve that information and create a shadowable_tombstone for the row.

Tests: unit {release}
+

Verified manually with 'hexdump' and using modified 'sstabledump' that
second (shadowable) tombstone is written for MV tables by Scylla.

+
DTest (materialized_views_test.py:TestMaterializedViews.hundred_mv_concurrent_test)
that originally failed due to this issue has successfully passed locally.
"

* 'projects/sstables-30/shadowable-deletion/v4' of https://github.com/argenet/scylla:
  tests: Add tests writing both regular and shadowable tombstones to SSTables 3.x.
  tests: Add test covering writing and reading a shadowable tombstone with SSTables 3.x.
  sstables: Support Scylla-specific extension for writing shadowable tombstones.
  sstables: Introduce a feature for shadowable tombstones in Scylla.db.
  memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
  sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
  sstables: Support checking row extension flags for Cassandra shadowable deletion.

(cherry picked from commit 8210f4c982)
2018-10-24 19:32:57 +03:00
Tomasz Grabiec
39b39058fc sstable_mutation_reader: Do not read partition index when scanning
Even when we're using a full clustering range, need_skip() will return
true when we start a new partition and advance_context() will be
called with position_in_partition::before_all_clustered_rows(). We
should detect that there is no need to skip to that position before
the call to advance_to(*_current_partition_key), which will read the
index page.

Fixes #3868.

Message-Id: <1539881775-8578-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 9e756d3863)
2018-10-24 19:32:40 +03:00
Avi Kivity
6bf4a73d88 thrift: limit message size
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.

We also need to limit memory used in aggregate by thrift, but that is left to
another patch.

Fixes #3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>

(cherry picked from commit a9836ad758)
2018-10-24 19:32:25 +03:00
Gleb Natapov
ca4846dd63 stream_session: remove unused capture
'Consumer function' parameter for distribute_reader_and_consume_on_shards()
captures schema_ptr (which is a seastar::shared_ptr), but the function
is later copied on another shard at which point schema_ptr is also copied
and its counter is incremented by the wrong shard. The capture is not
even used, so lets just drop it.

Fixes #3838

Message-Id: <20181011075500.GN14449@scylladb.com>
(cherry picked from commit ceb361544a)
2018-10-24 09:47:02 +03:00
Takuya ASADA
2663ff7bc1 dist/common/sysctl.d: add new conf file to set fs.aio-max-nr
We need raise fs.aio-max-nr to larger value since Seastar may allocates
more then 65535 AIO events (= kernel default value)

Fixes #3842

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181023030449.15445-1-syuu@scylladb.com>
(cherry picked from commit 950dbdb466)
2018-10-24 09:45:51 +03:00
Tomasz Grabiec
043a575fcd Merge "Correctly handle dropped columns in SSTable 3" from Piotr J.
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.

Fixes #3859

* seastar-dev.git haaawk/projects/sstables-30/handling-dropped-columns/v4:
  sstables 3: Correctly handle dropped columns in column_translation
  sstables 3: Add test for dropped columns handling

(cherry picked from commit fc37b80d24)
2018-10-24 09:45:25 +03:00
Vlad Zolotarov
00dc400993 storage_proxy::query_result_local: create a single tracing span on a replica shard
Every call of a tracing::global_trace_state_ptr object instead of a
tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get()
creates a new tracing session (span) object.

This should never be done unless query handling moves to a different shard.

Fixes #3862

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181018003500.10030-1-vladz@scylladb.com>
(cherry picked from commit a87c11bad2)
2018-10-24 09:45:14 +03:00
Duarte Nunes
522a48a244 Merge 'Fix for a select statement with filtered columns' from Eliran
"
This patchset fixes #3803. When a select statement with filtering
is executed and the column that is needed for the filtering is not
present in the select clause, rows that should have been filtered out
according to this column will still be present in the result set.

Tests:
 1. The testcase from the issue.
 2. Unit tests (release) including the
 newly added test from this patchset.
"

* 'issues/3803/v10' of https://github.com/eliransin/scylla:
  unit test: add test for filtering queries without the filtered column
  cql3 unit test: add assertion for the number of serialized columns
  cql3: ensure retrieval of columns for filtering
  cql3: refactor find_idx to be part of statement restrictions object
  cql3: add prefix size common functionality to all clustering restrictions
  cql3: rename selection metadata manipulation functions

(cherry picked from commit 3fe92663d4)
2018-10-24 09:44:46 +03:00
Paweł Dziepak
5faa28ce45 cql3: restore original timeout behaviour for aggregate queries
Commit 1d34ef38a8 "cql3: make pagers use
time_point instead of duration" has unintentionally altered the timeout
semantics for aggregate queries. Such requests fetch multiple pages before
sending a response to the client. Originally, each of those fetches had
a timeout-duration to finish, after the problematic commit the whole
request needs to complete in a single timeout-duration. This,
unsurprisingly, makes some queries that were successful before fail with
a timeout. This patch restores the original behaviour.

Fixes #3877.

Message-Id: <20181022125318.4384-1-pdziepak@scylladb.com>
(cherry picked from commit c94d2b6aa6)
2018-10-24 09:43:59 +03:00
Avi Kivity
52be02558e config: mark range_request_timeout_in_ms and request_timeout_in_ms as Used
This makes them available in scylla --help.

Fixes #3884.
Message-Id: <20181023101150.29856-1-avi@scylladb.com>

(cherry picked from commit d9e0ea6bb0)
2018-10-24 09:43:54 +03:00
Avi Kivity
a7cbfbe63f Merge "hinted handoff: give a sender a low priority" from Vlad
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.

In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.

Fixes #3817
"

* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
  db::hints::manager: use "streaming" I/O scheduling class for reads
  commitlog::read_log_file(): set the a read I/O priority class explicitly
  db::hints::manager: add hints sender to the "streaming" CPU scheduling group

(cherry picked from commit 1533487ba8)
2018-10-24 09:43:39 +03:00
Duarte Nunes
28fd2044d2 Merge 'hinted handoff: add manager::state and split storing and replaying enablement' from Vlad
"
Refs #3828
(Probably fixes it)

We found a few flaws in a way we enable hints replaying.
First of all it was allowed before manager::start() is complete.
Then, since manager::start() is called after messaging_service is
initialized there was a time window when hints are rejected and this
creates an issue for MV.

Both issues above were found in the context of #3828.

This series fixes them both.

Tested {release}:
dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test
dtest: hintedhandoff_additional_test.py
"

* 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla:
  hinted handoff: enable storing hints before starting messaging_service
  db::hints::manager: add a "started" state
  db::hints::manager: introduce a _state

(cherry picked from commit 3a53b3cebc)
2018-10-24 09:43:03 +03:00
Calle Wilund
76ff2e5c3d messaging_service: Make rpc streaming sink respect tls connection
Fixes #3787

Message service streaming sink was created using direct call to
rpc::client::make_sink. This in turn needs a new socker, which it
creates completely ignoring what underlying transport is active for the
client in question.

Fix by retaining the tls credential pointer in the client wrapper, and
using this in a sink method to determine whether to create a new tls
socker, or just go ahead with a plain one.

Message-Id: <20181010003249.30526-1-calle@scylladb.com>
(cherry picked from commit 3cb50c861d)
2018-10-23 07:36:21 +00:00
Avi Kivity
7b34d54a96 locator: fix abstract_replication_strategy::get_ranges() and friends violating sort order
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.

Fixes #3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>

(cherry picked from commit 1ce52d5432)
2018-10-23 07:36:21 +00:00
Duarte Nunes
26c31f6798 Merge "db/hints: Expose current backlog" from Duarte
"
Hints are stored on disk by a hints::manager, ensuring they are
eventually sent. A hints::resource_manager ensures the hints::managers
it tracks don't consume more than their allocated resources by
monitoring disk space and disabling new hints if needed. This series
fixes some bugs related to the backlog calculation, but mainly exposes
the backlog through a hints::manager so upper layers can apply flow
control.

Refs #2538
"

* 'hh-manager-backlog/v3' of https://github.com/duarten/scylla:
  db/hints/manager: Expose current backlog
  db/hints/manager: Move decision about blocking hints to the manager
  db/hints/resource_manager: Correctly account resources in space_watchdog
  db/hints/resource_manager: Replace timer with seastar::thread
  db/hints/resource_manager: Ensure managers are correctly registered
  db/hints/resource_manager: Fix formatting
  db/hints: Disallow moving or copying the managers
2018-10-23 07:36:21 +00:00
Glauber Costa
28fa66591a sstables: print sstable path in case of an exception
Without that, we don't know where to look for the problems

Before:

 compaction failed: sstables::malformed_sstable_exception (Too big ttl: 3163676957)

After:

 compaction_manager - compaction failed: sstables::malformed_sstable_exception (Too big ttl: 4294967295 in sstable /var/lib/scylla/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/mc-832-big-Data.db)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181016181004.17838-1-glauber@scylladb.com>
(cherry picked from commit 7edae5421d)
2018-10-23 07:36:14 +00:00
Piotr Sarna
0fee1d9e43 cql3: add asking for pk/ck in the base query
Base query partition and clustering keys are used to generate
paging state for an index query, so they always need to be present
when a paged base query is processed.
Message-Id: <f3bf69453a6fd2bc842c8bdbd602d62c91cf9218.1538568953.git.sarna@scylladb.com>

Fixes #3855.
(cherry picked from commit 4a23297117)
2018-10-16 19:59:42 +03:00
Piotr Sarna
76e72e28f4 cql3: add checking for may_need_paging when executing base query
It's not sufficient to check for positive page_size when preparing
a base query for indexed select statement - may_need_paging() should
be called as well.
Message-Id: <d435820019e4082a64ca9807541f0c9ad334e6a8.1538568953.git.sarna@scylladb.com>

(cherry picked from commit 50d3de0693)
2018-10-16 19:58:58 +03:00
Piotr Sarna
f969e80965 cql3: move base query command creation to a separate function
Message-Id: <6b48b8cbd6312da4a17bfd3c85af628b4215e9f4.1538568953.git.sarna@scylladb.com>
(cherry picked from commit 11b8831c04)
2018-10-16 19:58:56 +03:00
Vladimir Krivopalov
2029134063 sstables: Reset opened range tombstone when moving to another partition.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <f6dc6b0bd88ca44f2ef84c2a8bee43fde82c89cc.1539396572.git.vladimir@scylladb.com>
(cherry picked from commit 092276b13d)
2018-10-15 13:26:22 +03:00
Vladimir Krivopalov
f30fe7bd17 sstables: Factor out code resetting values for a new partition.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83a3a4ce6942b036be447bcfeb66142828e75293.1539396572.git.vladimir@scylladb.com>
(cherry picked from commit 926b6430fd)
2018-10-15 13:26:20 +03:00
Piotr Sarna
aeb418af9e service/pager: avoid dereferencing null partition key
The pager::state() function returns a valid paging object even
if the pager itself is exhausted. It may also not contain the partition
key, so using it unconditionally was a bug - now, in case there is no
partition key present, paging state will contain an empty partition key.

Fixes #3829

Message-Id: <28401eb21ab8f12645c0a33d9e92ada9de83e96b.1539074813.git.sarna@scylladb.com>
(cherry picked from commit b3685342a6)
2018-10-15 12:47:25 +03:00
Glauber Costa
714e6d741f api: use longs instead of ints for snapshot sizes
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.

Without this patch, listsnapshots is broken in all versions.

Fixes: #3845

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
(cherry picked from commit 98332de268)
2018-10-12 22:01:59 +03:00
Tomasz Grabiec
95c5872450 Merge "Enable sstable_mutation_test with SSTables 3.x." from Vladimir
Introduce uppermost_bound() method instead of upper_bound() in mutation_fragment_filter and clustering_ranges_walker.

For now, this has been only used to produce the final range tombstone
for sliced reads inside consume_partition_end().

Usage of the upper bound of the current range causes problems of two
kinds:
    1. If not all the slicing ranges have been traversed with the
    clustering range walker, which is possible when the last read
    mutation fragment was before some of the ranges and reading was limited
    to a specific range of positions taken from index, the emitted range
    tombstone will not cover the untraversed slices.

    2. At the same time, if all ranges have been walked past, the end
    bound is set to after_all_clustered_rows and the emitted RT may span
    more data than it should.

To avoid both situations, the uppermost bound is used instead, which
refers to the upper bound of the last range in the sequence.

* github.com/scylladb/seastar-dev.git haaawk/projects/sstables-30/enable-mc-with-sstable-mutation-test/v2
  sstables: Use uppermost_bound() instead of upper_bound() in
    mutation_fragment_filter.
  tests: Enable sstable_mutation_test for SSTables 'mc' format.

Rebased by Piotr J.

(cherry picked from commit b89556512a)
2018-10-12 17:46:49 +03:00
Tomasz Grabiec
87f8968553 Merge "Make SST3 pass test_clustering_slices test" from Piotr
* seastar-dev.git haaawk/sst3/test_clustering_slices/v8:
  sstables: Extract on_end_of_stream from consume_partition_end
  sstables: Don't call consume_range_tombstone_end in
    consume_partition_end
  sstables: Change the way fragments are returned from consumer

(cherry picked from commit 193efef950)
2018-10-12 17:46:46 +03:00
Tomasz Grabiec
2895428d44 Merge "Handle dead row markers when writing to SSTables 3.x" from Vladimir
There is a mismatch between row markers used in SSTables 2.x (ka/la) and
liveness_info used by SSTables 3.x (mc) in that a row marker can be
written as a deleted cell but liveness_info cannot.

To handle this, for a dead row marker the corresponding liveness_info is
written as expiring liveness_info with a fake TTL set to 1.
This approach is adapted from the solution for CASSANDRA-13395 that
exercised similar issue during SSTables upgrades.

* github.com/argenet/scylla.git projects/sstables-30/dead-row-marker/v7:
  sstables: Introduce TTL limitation and special 'expired TTL' value.
  sstables: Write dead row marker as expired liveness info.
  tests: Add test covering dead row marker writing to SSTables 3.x.

(cherry picked from commit a7a14e3af2)
2018-10-11 15:03:58 +03:00
Botond Dénes
e18f182cfc multishard_mutation_query(): don't attempt to stop broken readers
Currently, when stopping a reader fails, it simply won't be attempted to
be saved, and it will be left in the `_readers` array as-is. This can
lead to an assertion failure as the reader state will contain futures
that were already waited upon, and that the cleanup code will attempt to
wait on again. To prevent this, when stopping a reader fails, reset it
to nonexistent state, so that the cleanup code doesn't attempt to do
anything with it.

Refs: #3830

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a1afc1d3d74f196b772e6c218999c57c15ca05be.1539088164.git.bdenes@scylladb.com>
(cherry picked from commit d467b518bc)
2018-10-10 10:12:00 +03:00
Vladimir Krivopalov
cf8cdbf87d sstables: Add missing 'mc' format into format strings map in sstable::filename().
Fixes #3832.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <269421fb2ac8ab389231cbe9ed501da7e7ff936a.1539048008.git.vladimir@scylladb.com>
(cherry picked from commit e9aba6a9c3)
2018-10-10 05:53:28 +03:00
Avi Kivity
eb2814067d Update seastar submodule
* seastar 5712816...39b89de (1):
  > prometheus: Fix histogram text representation

Fixes #3827.
2018-10-09 16:35:54 +03:00
Avi Kivity
0c722d4547 Point seastar submodule at scylla-seastar.git
This allows us to freeze this branch's Seastar and only backport selected fixes.
2018-10-09 16:29:53 +03:00
Nadav Har'El
54cf463430 materialized views: refuse to filter by non-key column
A materialized views can provide a filter so as to pick up only a subset
of the rows from the base table. Usually, the filter operates on columns
from the base table's primary key. If we use a filter on regular (non-key)
columns, things get hairy, and as issue #3430 showed, wrong: merely updating
this column in the base table may require us to delete, or resurrect, the
view row. But normally we need to do the above when the "new view key column"
was updated, when there is one. We use shadowable tombstones with one
timestamp to do this, so it cannot take into account the two timestamp from
those two columns (the filtered column and the new key column).

So in the current code, filtering by a non-key column does not work correctly.
In this patch we provide two test cases (one involving TTLs, and one involves
only normal updates), which demonstrate vividly that it does *not* work
correctly. With normal updates, trying to resurect a view row that has
previously disappeared, fails. With TTLs, things are even worse, and the view
row fails to disappear when the filtered column is TTLed.

In Cassandra, the same thing doesn't work correctly as well (see
CASSANDRA-13798 and CASSANDRA-13832) so they decided to refuse creating
a materialized view filtering a non-key column. In this patch we also
do this - fail the creation of such an unsupported view. For this reason,
the two tests mentioned above are commented out in a "#if", with, instead,
a trivial test verifying a failure to create such a view.

Note that as explained above, when the filtered column and new view key
column are *different* we have a problem. But when they are the *same* - namely
we filter by a non-key base column which actually *is* a key in the view -
we are actually fine. This patch includes additional test cases verifying
that this case is really fine and provides correct results. Accordingly,
this case is *not* forbidden in the view creation code.

Fixes #3430.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181008185633.24616-1-nyh@scylladb.com>
(cherry picked from commit b8668dc0f8)
2018-10-09 10:18:58 +01:00
Nadav Har'El
d2a0622edd materialized views: enable two tests in view_schema_test
We had two commented out tests based on Cassandra's MV unit tests, for
the case that the view's filter (the "SELECT" clause used to define the
view) filtered by a non-primary-key column. These tests used to fail
because of problems we had in the filtering code, but they now succeed,
so we can enable them. This patch also adds some comments about what
the tests do, and adds a few more cases to one of the tests.

Refs #3430.

However, note that the success of these tests does not really prove that
the non-PK-column filtering feature works fully correctly and that issue
forbidding it, as explained in
https://issues.apache.org/jira/browse/CASSANDRA-13798. We can probably
fix this feature with our "virtual cells" mechanism, but will need to add
a test to confirm the possible problem and its (probably needed fix).
We do not add such a test in this patch.

In the meantime, issue #3430 should remain open: we still *allow* users
to create MV with such a filter, and, as the tests in this patch show,
this "mostly" works correctly. We just need to prove and/or fix what happens
with the complex row liveness issues a la issue #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181004213637.32330-1-nyh@scylladb.com>
(cherry picked from commit e4ef7fc40a)
2018-10-09 10:18:54 +01:00
Duarte Nunes
60edaec757 Merge 'Fix issues with endpoint state replication to other shards' from Tomasz
Fixes #3798
Fixes #3694

Tests:

  unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)

* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
  gms/gossiper: Replicate enpoint states in add_saved_endpoint()
  gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
  gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
  gms/gossiper: Always override states from older generations

(cherry picked from commit 48ebe6552c)
2018-10-09 10:14:30 +03:00
Avi Kivity
5802532cb3 Merge "Fix mutation fragments clobbering on fast_forward" from Vladimir
"
This patchset fixes a bug in SSTables 3.x reading when fast-forwarding
is enabled. It is possible that a mutation fragment, row or RT marker,
is read and then stored because it falls outside the current
fast-forwarding range.

If the reader is further fast-forwarded but the
row still falls outside of it, the reader would still continue reading
and get the next fragment, if any, that would clobber the currently
stored one. With this fix, the reader does not attempt to read on
after storing the current fragment.

Tests: unit {release}
"

* 'projects/sstables-30/row-skipped-on-double-ff/v2' of https://github.com/argenet/scylla:
  tests: Add test for reading rows after multiple fast-forwarding with SSTables 3.x.
  sstables: mp_row_consumer_m to notify reader on end of stream when storing a mutation fragment.
  sstables: In mp_row_consumer_m::push_mutation_fragments(), return the called helper's value.

(cherry picked from commit 0fa60660b8)
2018-10-09 09:35:51 +03:00
Eliran Sinvani
83ea91055e cql3 : add workaround to antlr3 null dereference bug
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.

Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).

Fixes #3740
Fixes #3764

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
(cherry picked from commit 20f49566a2)
2018-10-04 14:07:25 +03:00
Piotr Sarna
e7863d3d54 tests: add missing get() calls in threaded context
One test case missed a few get() calls in order to wait
for continuations, which only accidentally worked,
because it was followed by 'eventually()' blocks.
Message-Id: <69c145575ac81154c4b5f500d01c6b045a267088.1536839959.git.sarna@scylladb.com>

(cherry picked from commit a5570cb288)
2018-10-04 14:06:50 +03:00
Piotr Sarna
57f124b905 tests: add collections test for secondary indexing
Test case regarding creating indexes on collection columns
is added to the suite.

Refs #3654
Refs #2962
Message-Id: <1b6844634b6e9a353028545813571647c92fb330.1536839959.git.sarna@scylladb.com>

(cherry picked from commit 8a2abd45fb)
2018-10-04 14:06:48 +03:00
Piotr Sarna
40d8de5784 cql3: prevent creation of indexes on non-frozen collections
Until indexes for non-frozen collections is implemented,
creating such indexes should be disallowed to prevent unnecessary
errors on insertions/selections.

Fixes #3653
Refs #2962
Message-Id: <218cf96d5e38340806fb9446b8282d2296ba5f43.1536839959.git.sarna@scylladb.com>

(cherry picked from commit 2d355bdf47)
2018-10-04 14:06:47 +03:00
Avi Kivity
1468ec62de Merge "Handle simple column type schema changes in SST3" from Piotr
"
This patchset enables very simple column type conversions.
It covers only handling variable and fixed size type differences.
Two types still have to be compatiple on bits level to be able to convert a field from one to the other.
"

* 'haaawk/sst3/column_type_schema_change/v4' of github.com:scylladb/seastar-dev:
  Fix check_multi_schema to actually check the column type change
  Handle very basic column type conversions in SST3
  Enable check_multi_schema for SST3

(cherry picked from commit b9702222f8)
2018-10-03 17:44:26 +03:00
Avi Kivity
c6ef56ae1e Revert "compaction: demote compaction start/end messages to DEBUG level"
This reverts commit b443a9b930. The compaction
history table doesn't have enough information to be a replacement for this
log message yet.

(cherry picked from commit 7c8143c3c4)
2018-10-03 17:44:21 +03:00
Avi Kivity
ad62313b86 utils: crc32: mark power crc32 assembly as not requiring an executable stack
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).

However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.

Fix by adding the correct incantation to the file.

Fixes #3799.

Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
2018-10-02 23:22:56 +03:00
Avi Kivity
de87f798e1 release: prepare for 3.0-rc0 2018-10-02 12:00:50 +03:00
Calle Wilund
2996b8154f storage_proxy: Add missing re-throw in truncate_blocking
Iff truncation times out, we want to log it, but the exception should
not be swallowed, but re-thrown.

Fixes #3796.

Message-Id: <20181001112325.17809-1-calle@scylladb.com>
2018-10-01 19:07:04 +02:00
Paweł Dziepak
ad4a50dab6 Merge "multi range reader: add support for range generating functor" from Botond
"
This series adds support for range generator functors to multi range
reader. A range generator functor can lazily generate an uknown amount
of ranges on-the-fly for the reader to read.
The range generator support was added by refactoring
`flat_multi_range_mutation_reader` to work in terms of a generator
functor. The existing overload taking a `dht::partition_range_vector`
is adapted to the generator interface behind the scenes.
"

* 'multi-range-reader-generator/v9' of https://github.com/denesb/scylla:
  tests/flat_mutation_reader_test: extend multi-range reader tests
  make_flat_multi_range_reader: add documentation
  make_flat_multi_range_reader: add generator overload
  flat_multi_range_reader: refactor to work in terms of generator
  make_flat_multi_range_reader(): better handle the 0 range case
  flat_mutation_reader: add move_buffer_content_to()
  flat_multi_range_mutation_reader: drop fwd_mr ctor parameter
2018-10-01 12:53:31 +01:00
Duarte Nunes
e6630c627b Merge 'Add secondary index paging' from Piotr
"
Indexed select statement consists of two queries - the view query
used to extract base keys and the base query that uses those keys
to return base rows.
The main idea of this series is to replace raw proxy.query() call
during the view query to one that uses a pager.
Additionally, paging info from the view query needs to be returned
to the client, in order to be used later for requesting new pages.
"

* 'paging_indexes_7' of https://github.com/psarna/scylla:
  tests: add test for secondary index with paging
  cql3: remove execute(primary_keys) from select statement
  cql3: add incremental base queries to index query
  storage_proxy: make get_restricted_ranges public
  cql3: add base query handling function to indexed statement
  cql3: add generating base key from index keys
  cql3: add paging state generation function
  cql3: move getting index view schema to prepare stage
  pager: make state() defined for exhausted pagers
  cql3: add maybe_set_paging_state function
  cql3: rename set_has_more_pages to set_paging_state
  pager: add setters for partition/clustering keys
  cql3: add paging to read_posting_list
  cql3: add non-const get_result_metadata method
  cql3: make find_index_* functions return paging state
  cql3: make read_posting_list return future<rows>
  cql3: make pagers use time_point instead of duration
2018-10-01 10:42:21 +01:00
Avi Kivity
900ffad979 config: re-add murmur3_ignore_msb_bits to scylla.yaml
Commit d6b0c4dda4 changed the built-in default
murmur3_ignore_msb_bits to 12 (from 0) and removed the scylla.yaml default.

Removal of the scylla.yaml default was a mistake for two reasons:
 - if someone downgrades a cluster, keeping scylla.yaml derived from the
   master branch, they will experience resharding since the built-in default,
   which has changed, will take effect. While that scenario is not supported,
   it already happened and caused much consternation.
 - if, in the future, we wish to change the default, we will cause resharding
   again. Embedding the default in scylla.yaml allows us to change the default
   for new clusters while allowing upgraded clusters to retain older values.

Therefore, this patch restores murmur3_ignore_msb_bits in scylla.yaml. Future
changes to the configuration item should change both scylla.yaml and the
built-in default.

Message-Id: <20180930090053.21136-1-avi@scylladb.com>
2018-10-01 10:01:36 +03:00
Takuya ASADA
0a471c32cb dist/ami/files/scylla_install_ami: enable ssh_deletekeys
For some reason upstream AMI is disabling 'ssh_deletekeys' feature on
cloud-init, but generating SSH host keys should important for public AMI
images, so enable it again.

See: https://cloudinit.readthedocs.io/en/latest/topics/modules.html?highlight=ssh_deletekeys#ssh

Fixes scylladb/scylla-ami#31

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180927122816.27809-1-syuu@scylladb.com>
2018-09-30 16:29:46 +03:00
Paweł Dziepak
2bcaf4309e utils/reusable_buffer: do not warn about large allocations
Reusable buffers are meant to be used when protocol or third-party
library limiations force us to allocate large contiguous buffers. There
isn't much that can be done about this so there is little point in
warning about that.

Fixes #3788.
Message-Id: <20180928085141.6469-1-pdziepak@scylladb.com>
2018-09-30 11:12:23 +03:00
Asias He
91dae0149d token_metadata: Invalidate cached ring in update_normal_tokens
In commit 4a0b561376, "storage_service:
Get rid of moving operation", we removed remove_from_moving() in
update_normal_tokens(). However, remove_from_moving() calls
invalidate_cached_rings(). We should call invalidate_cached_rings() in
update_normal_tokens(), otherwise we will get wrong token range to
address map in the token_metadata cache.

This issue exists in master only. It is not in any of the releases.

Message-Id: <c03f2ed478cfdb84494f36dce9a8cfc05ed9e0cd.1538288364.git.asias@scylladb.com>
2018-09-30 11:06:46 +03:00
Alexys Jacob
6d6764133b dist/common/scripts: coding style fixes
dist/common/scripts/scylla_blocktune.py:24:10: E401 multiple imports on one line
dist/common/scripts/scylla_blocktune.py:27:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:35:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:48:1: E305 expected 2 blank lines after class or function definition, found 1
dist/common/scripts/scylla_blocktune.py:52:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:59:5: E306 expected 1 blank line before a nested definition, found 0
dist/common/scripts/scylla_blocktune.py:74:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:81:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:87:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_config_get.py:26:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_config_get.py:43:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_config_get.py:53:1: E305 expected 2 blank lines after class or function definition, found 1
dist/common/scripts/scylla_util.py:18:22: E401 multiple imports on one line
dist/common/scripts/scylla_util.py:19:22: E401 multiple imports on one line
dist/common/scripts/scylla_util.py:24:1: F401 'string' imported but unused
dist/common/scripts/scylla_util.py:32:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:50:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:61:30: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:75:53: E703 statement ends with a semicolon
dist/common/scripts/scylla_util.py:79:32: E272 multiple spaces before keyword
dist/common/scripts/scylla_util.py:80:25: E703 statement ends with a semicolon
dist/common/scripts/scylla_util.py:85:32: E201 whitespace after '['
dist/common/scripts/scylla_util.py:85:51: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:130:34: E201 whitespace after '['
dist/common/scripts/scylla_util.py:130:65: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:170:1: E266 too many leading '#' for block comment
dist/common/scripts/scylla_util.py:172:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:174:10: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:178:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:181:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:184:17: E201 whitespace after '['
dist/common/scripts/scylla_util.py:184:50: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:186:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:193:16: E201 whitespace after '['
dist/common/scripts/scylla_util.py:193:76: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:195:18: E201 whitespace after '{'
dist/common/scripts/scylla_util.py:195:27: E203 whitespace before ':'
dist/common/scripts/scylla_util.py:195:41: E203 whitespace before ':'
dist/common/scripts/scylla_util.py:195:48: E202 whitespace before '}'
dist/common/scripts/scylla_util.py:203:25: E201 whitespace after '['
dist/common/scripts/scylla_util.py:203:54: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:204:76: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:208:27: E703 statement ends with a semicolon
dist/common/scripts/scylla_util.py:217:27: E201 whitespace after '['
dist/common/scripts/scylla_util.py:217:62: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:238:25: E201 whitespace after '['
dist/common/scripts/scylla_util.py:238:87: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:257:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:258:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:259:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:268:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:277:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:280:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:283:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:286:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:297:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:302:5: E722 do not use bare except'
dist/common/scripts/scylla_util.py:305:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:325:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:329:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:335:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:338:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:341:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:343:81: E231 missing whitespace after ','
dist/common/scripts/scylla_util.py:352:1: E305 expected 2 blank lines after class or function definition, found 1
dist/common/scripts/scylla_util.py:352:21: E231 missing whitespace after ':'
dist/common/scripts/scylla_util.py:352:41: E231 missing whitespace after ':'
dist/common/scripts/scylla_util.py:352:65: E231 missing whitespace after ':'
dist/common/scripts/scylla_util.py:353:1: E302 expected 2 blank lines, found 0
dist/common/scripts/scylla_util.py:358:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:360:22: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:365:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:367:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:370:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:373:15: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:374:14: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:375:14: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:376:20: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:385:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:388:9: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:389:9: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:393:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:396:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:399:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:432:1: E302 expected 2 blank lines, found 1

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180918213707.6069-1-ultrabug@gentoo.org>
2018-09-30 11:00:37 +03:00
Botond Dénes
eba8d68313 tests/flat_mutation_reader_test: extend multi-range reader tests
Add unit tests for the generator version and extend existing ones with
tests for the corner cases (0 and 1 range).
2018-09-28 14:27:55 +03:00
Botond Dénes
bb7447bbe4 make_flat_multi_range_reader: add documentation 2018-09-28 14:27:55 +03:00
Botond Dénes
39bfd5d1df make_flat_multi_range_reader: add generator overload
Allows creating a multi range reader from an arbitrary callable that
return std::optional<dht::partition_range>. The callable is expected to
return a new range on each call, such that passing each successive range
to `flat_mutation_reader::fast_forward_to` is valid. When exhausted the
callable is expected to return std::nullopt.
2018-09-28 14:27:55 +03:00
Botond Dénes
8c5387890d flat_multi_range_reader: refactor to work in terms of generator
Instead of working with a dht::partition_range_vector directly, work
with an abstract generator that returns a pointer to the next range on
each invocation. When exhausted it returns nullptr. This opens up the
possibility to create multi range readers from a generator functor that
creates ranges lazily. This is indeed what the next path does.
2018-09-28 14:27:55 +03:00
Botond Dénes
f3bf2e83dd make_flat_multi_range_reader(): better handle the 0 range case
Previously, when the passed in range of partition ranges contained 0
ranges, an empty reader was returned. This means that the returned
reader was forwardable or not depending on the number of passed in
ranges. This is inconsistent and can lead to nasty surprises.
To solve this problem add `forwardable_empty_mutation_reader`, a
specialized reader that delays creating the underlying reader until
fast_forward_to() is called on it, and thus a range is available.

When `make_flat_multi_range_mutation_reader()` is called with
`mutation_reader::forwarding::no` a simple empty reader is created, like
before.
2018-09-28 14:27:55 +03:00
Botond Dénes
03be9510a7 flat_mutation_reader: add move_buffer_content_to()
`move_buffer_content_to()` makes it possible to implement more efficient
wrapping readers, readers that wrap another flat mutation reader but do
no transformation to the underlying fragment stream.
These readers, when filling their buffers, can simply fill the
underlying reader's buffer, then move its content into their own. When
the reader's own buffer is empty, this is very efficient, as it can be
done by simply swapping the buffers, avoiding the work of moving the
fragments one-by-one.
2018-09-28 14:27:54 +03:00
Botond Dénes
68b6c83ee8 flat_multi_range_mutation_reader: drop fwd_mr ctor parameter
The factory function creating this reader ensures that the passed-in
ranges vector has more then one range, which effectively makes the
`fwd_mr` constructor parameter have no effect. The underlying reader
will always be created with `mutation_reader::forwarding::yes` as it has
to be able to fast-forward between the ranges.
2018-09-28 14:25:03 +03:00
Duarte Nunes
b8749a61dc tests/aggregate_fcts_test: Fix formatting of create_table()
And drop the template.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223315.28254-1-duarte@scylladb.com>
2018-09-28 09:45:27 +02:00
Duarte Nunes
17578c3579 tests/aggregate_fcts_test: Add test case for wrapped types
Provide a test case which checks a type being wrapped in a
reverse_type plays no role in assignment.

Refs #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-2-duarte@scylladb.com>
2018-09-28 07:09:08 +03:00
Duarte Nunes
5e7bb20c8a cql3/selection/selector: Unwrap types when validating assignment
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.

Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.

Tests: unit(release)

Fixes #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
2018-09-28 07:08:19 +03:00
Piotr Sarna
da3821c598 tests: add test for secondary index with paging
A test case with enough rows to have multiple pages
is added to secondary_index_test suite.
2018-09-27 15:29:28 +02:00
Piotr Sarna
4b4f57747a cql3: remove execute(primary_keys) from select statement
Right now, with specialized execute() that takes primary keys
for indexed_table_select_statement, the original execute()
method implemented in select_statement is not used anywhere,
so it's removed.
2018-09-27 15:29:28 +02:00
Piotr Sarna
9e0b3cad1e cql3: add incremental base queries to index query
Base queries that are part of index queries are allowed to be short,
which can result in wasted work - e.g. when we query all replicas
in parallel, but have to discard most of the result, since the first
one (in token order) resulted in a short read.
Thus, we start by quering 1 range, check if the read is short,
and if not, continue by querying 2x more ranges than before.

Refs #2960
2018-09-27 15:29:28 +02:00
Piotr Sarna
c41e0ade6c storage_proxy: make get_restricted_ranges public
This function is useful for splitting ranges in indexed queries.
2018-09-27 15:29:28 +02:00
Piotr Sarna
5b16aeb395 cql3: add base query handling function to indexed statement
Handling a base query during the indexed statement execution
may require updating its paging state.
2018-09-27 15:29:28 +02:00
Piotr Sarna
bce7232555 cql3: add generating base key from index keys
A function that computes base partition/clustering key from index view
primary key is provided.
2018-09-27 15:29:28 +02:00
Piotr Sarna
2f085848d8 cql3: add paging state generation function
For indexed queries, the paging state needs to be updated
based on the results of base query when the read was short.
2018-09-27 15:29:28 +02:00
Piotr Sarna
f21bcbefdf cql3: move getting index view schema to prepare stage
Searching for index view schema for an indexed statement can be done
once in prepare stage, so it's moved to indexed_table_select_statement
prepare method.
2018-09-27 15:29:28 +02:00
Piotr Sarna
b6d90b2869 pager: make state() defined for exhausted pagers
If service::pager is exhausted, state() function used to return
a nullptr instead of a pointer to a valid paging state and the
documented return type in this case was 'unspecified'.
Sometimes a paging state may be needed anyway, even if the pager
is already exhausted - thus, state() return value becomes defined
after this commit. Exhausted pagers will return a valid object
to a state with _remaining field set to 0.
2018-09-27 15:29:28 +02:00
Piotr Sarna
c1be660c3a cql3: add maybe_set_paging_state function
set_paging_state is split into its unconditional variant and a maybe_
one in order to avoid double checks.
2018-09-27 15:29:28 +02:00
Piotr Sarna
744ac3bf7b cql3: rename set_has_more_pages to set_paging_state
This function's primary goal is to set the paging state passed
as a parameter, so its name is changed to match the semantics better.
2018-09-27 15:29:28 +02:00
Glauber Costa
c3f27784de database: guarantee a minimum amount of shares when manual operations are requested.
We have found issues when a flush is requested outside the usual
memtable flush loop and because there is not a lot of data the
controller will not have a high amount of shares.

To prevent this, this patch guarantees some minimum amount of shares
when extraneous operations (nodetool flush, commitlog-driven flush, etc)
are requested.

Another option would be to add shares instead of guarantee a minimum.
But in my view the approach I am taking here has two main advantages:

1) It won't cause spikes when those operations are requested
2) It is cumbersome to add shares in the current infrastructure, as just
adding backlog can cause shares to spike. Consider this example:

  Backlog is within the first range of very low backlog (~0.2). Shares
  for this would be around ~20. If we want to add 200 shares, that is
  equivalent to a backlog of 0.8. Once we add those two backlogs
  together, we end up with 1 (max backlog).

Fixes #3761

Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180927131904.8826-1-glauber@scylladb.com>
2018-09-27 15:20:31 +02:00
Piotr Sarna
336cc70438 pager: add setters for partition/clustering keys 2018-09-27 15:18:06 +02:00
Piotr Sarna
7c1e4c2deb cql3: add paging to read_posting_list
Instead of a single query, paging is used in order to query
an index.
2018-09-27 15:18:06 +02:00
Piotr Sarna
b83aa69a2e cql3: add non-const get_result_metadata method 2018-09-27 15:18:06 +02:00
Piotr Sarna
430a49f91a cql3: make find_index_* functions return paging state
In order to implement secondary index paging, intermediary query
functions now also return paging state for the view query.
2018-09-27 15:18:06 +02:00
Piotr Sarna
c3dd1775c8 cql3: make read_posting_list return future<rows>
Instead of returning a coordinator result and making a caller parse it
later, read_posting_list now extracts rows by itself.
This change is later needed when querying is replaced with a pager.
2018-09-27 15:18:06 +02:00
Piotr Sarna
1d34ef38a8 cql3: make pagers use time_point instead of duration
A standard way for passing a timeout parameter is specifying
a time_point, while pagers used to take a duration in order
to compute time points on the fly. This patch adds a timeout
parameter, which is a time_point, to fetch_page().
2018-09-27 15:18:06 +02:00
Tomasz Grabiec
78d9205a50 Merge "Multiple fixes to tests/normalizing_reader" from Vladimir
This patchset addresses multiple errors in normalizing_reader
implementation found during review.

I have decided to not make a clustering key full inside
before_key()/after_key() helpers. The reason is that for this they
would need schema to be passed as another parameter so existing
methods don't suit. OTOH, introducing new members for a class using
for testing purposes only seems an overkill.

* github.com/argenet/scylla.git projects/sstables-30/normalizing_reader_fixes/v1:
  range_tombstone: Add constructor accepting position_in_partition_views
    for range bounds.
  tests: Make sure range tombstone is properly split over rows with
    non-full keys.
  tests: Multiple fixes for draining and clearing range tombstones in
    normalizing_reader.
2018-09-27 12:51:47 +02:00
Vladimir Krivopalov
653fb37ea5 range_tombstone: Remove code that duplicates logic.
The actions performed by the call to set_start() were duplicated by the
immediately following code lines that are removed with this patch.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20eaa1338c1719ded34f5c9ada69ec03907936f5.1537989044.git.vladimir@scylladb.com>
2018-09-27 12:05:25 +02:00
Vladimir Krivopalov
b74706a8f5 tests: Multiple fixes for draining and clearing range tombstones in normalizing_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-26 19:24:10 -07:00
Vladimir Krivopalov
26d4d276e9 tests: Make sure range tombstone is properly split over rows with non-full keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-26 17:19:43 -07:00
Vladimir Krivopalov
fbccae0d15 range_tombstone: Add constructor accepting position_in_partition_views for range bounds.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-26 17:17:18 -07:00
Avi Kivity
e0b34003b5 tests: sstable_mutation_test: await background jobs
We only wait from the last test case, so if an individual test is executed,
a memory leak may be reported.

Fix by waiting from all test cases.
Message-Id: <20180926203723.18026-1-avi@scylladb.com>
2018-09-26 21:48:32 +01:00
Eliran Sinvani
44d93b4d4c cql3: fix incorrect results returned from prepared select with an IN clause
When executing a prepared select statement with a multicolumn IN, the
system returned incorrect results due to a memory violation (a bytes view
referring to an out of scope bytes object).
Added test for the prepared statement results correctness.

Tests:
1. unit (release) with the new test.
2. Python script.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <36c9cf9ed3fe72e3b4801e3cd120678429ce218a.1537947897.git.eliransin@scylladb.com>
2018-09-26 15:23:41 +03:00
Eliran Sinvani
22ad5434d1 cql3 : fix a crash upon preparing select with an IN restriction due to memory violation
When preparing a select query with a multicolumn in restriction, the
node crashed due to using a parameter after using a move on it.

Tests:
1. UnitTests (release)
2. Preparing a select statement that crashed the system before,
and verify it is not crashing.

Fixes #3204
Fixes #3692

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <7ebd210cd714a460ee5557ac612da970cee03270.1537947897.git.eliransin@scylladb.com>
2018-09-26 15:23:38 +03:00
Avi Kivity
8f5e80e61a Revert "setup: add the lazytime XFS version"
This reverts commit f828fe0d59. It causes
scylla_raid_setup to fail on CentOS 7.

Fixes #3784.
2018-09-26 11:10:07 +01:00
Avi Kivity
e8d988caf8 Merge "Enable existing SSTables unit tests for 'mc' format" from Vladimir and Piotr
"
This patchset fixes several issues in SSTables 3.x ('mc') writing and
parsing and extends existing SSTables unit tests to cover the new
format.

The only test enabled temporarily is check_multi_schema because it
turned out that reading SSTables 3.x with a different schema has not
been implemented in full. This will be addressed in a separate patchset.

This patchset depends on the "Support SSTables 3.x in Scylla runtime"
patchset.

Tests: unit {release}
"

* 'projects/sstables-30/unit-tests/v3' of https://github.com/argenet/scylla: (25 commits)
  tests: Enable existing SSTables tests for 'mc' format.
  tests: Fix test_wrong_range_tombstone_order for 'mc' format.
  tests: Extend reader assertions to check clustering keys made full.
  tests: Disable test_old_format_non_compound_range_tombstone_is_read for 'mc' format.
  tests: Disable check_multi_schema for 'mc' format.
  tests: Fix test_promoted_index_read for 'mc' format by using normalizing_reader.
  tests: Fix promoted_index_read to not rely on a specific index length
  tests: Add 'mc' files for test_wrong_range_tombstone_order
  tests: Add 'mc' files for test_wrong_counter_shard_order
  tests: Add 'mc' files for summary_test
  tests: Add 'mc' files for test_promoted_index_read
  tests: Add 'mc' files for test_partition_skipping
  tests: Add 'mc' files for large_partition tests (promoted_index_read, sub_partition_read, sub_partitions_read
  tests: Add 'mc' files for test_counter_read
  tests: Add 'mc' files for test_broken_promoted_index_is_skipped
  tests: SSTables 'mc' files for sliced_mutation_reads_test.
  tests: Introduce normalizing_reader helper for SSTables tests.
  mutation_fragment: Add range_tombstone_stream::empty() method.
  sstables: Make key full when setting a range tombstone start from end open marker.
  sstables: For 'mc' format, use excl_start when split an RT over a row with a full key.
  ...
2018-09-26 11:10:07 +01:00
Avi Kivity
337ee6153a Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr
"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.

Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.

Tests: unit {release}

+ tested Scylla with both 'la' and 'mc' formats to work fine:

cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};                                                                  [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
    <<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
    <<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;

 pk | ck | rc
----+----+------
  2 |  3 | null
  4 |  6 | null

(2 rows)
    <<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;

 pk | ck | rc
----+----+------
  5 |  7 | null
  8 | 10 | null
  2 |  3 | null
  4 |  6 | null
  7 |  9 | null
  6 |  8 | null

(6 rows)
"

* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
  database: Honour enable_sstables_mc_format configuration option.
  sstables: Support SSTables 'mc' format as a feature.
  db: Add configuration option for enabling SSTables 'mc' format.
  tests: Add test for reading a complex column with zero subcolumns (SST3).
  sstables: Fix parsing of complex columns with zero subcolumns.
  sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
  sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
  bytes: Add helper for turning bytes_view into sstring_view.
  sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
  sstables: Fix string formatting for exception messages in m_format_read_helpers.
  sstables: Don't validate timestamps against the max value on parsing.
  sstables: Always store only min bases in serialization_header.
  sstables: Support 'mc' version parsing from filename.
  SST3: Make sure we call consume_partition_end
2018-09-26 11:10:07 +01:00
Vladimir Krivopalov
38c8d1ce05 tests: Enable existing SSTables tests for 'mc' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
c33e0f3f15 tests: Fix test_wrong_range_tombstone_order for 'mc' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
ad2b9e44ee tests: Extend reader assertions to check clustering keys made full.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
9239195473 tests: Disable test_old_format_non_compound_range_tombstone_is_read for 'mc' format.
This test is not applicable to the 'mc' format as it covers a backward
compatibility case which may only occur with SSTables generated by older
Scylla versions in 'ka' format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
952536c9f5 tests: Disable check_multi_schema for 'mc' format.
Altering types in schema has been disabled in Origin (see
CASSANDRA-12443). We do the same.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
86aae36e04 tests: Fix test_promoted_index_read for 'mc' format by using normalizing_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
5422203714 tests: Fix promoted_index_read to not rely on a specific index length
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
be5fe11f22 tests: Add 'mc' files for test_wrong_range_tombstone_order
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
3dd6e6f899 tests: Add 'mc' files for test_wrong_counter_shard_order
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
f08a2b35da tests: Add 'mc' files for summary_test
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
7e40947a80 tests: Add 'mc' files for test_promoted_index_read
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
20f3edba61 tests: Add 'mc' files for test_partition_skipping
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
8c37801ae5 tests: Add 'mc' files for large_partition tests (promoted_index_read, sub_partition_read, sub_partitions_read
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
28c32a353a tests: Add 'mc' files for test_counter_read
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
60c9a25b38 tests: Add 'mc' files for test_broken_promoted_index_is_skipped
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
24342dc27d tests: SSTables 'mc' files for sliced_mutation_reads_test.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
4393233a86 tests: Introduce normalizing_reader helper for SSTables tests.
This is a helper flat_mutation_reader that wraps another reader and
splits range tombstones over rows before emitting them.

It is used to produce the same mutation streams for both old (ka/la) and
new (mc) SSTables formats in unit tests.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
7a5c4f0a63 mutation_fragment: Add range_tombstone_stream::empty() method.
The method checks if the underlying range_tombstone_list is empty.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
eddf846c8a sstables: Make key full when setting a range tombstone start from end open marker.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
fa48a78d71 sstables: For 'mc' format, use excl_start when split an RT over a row with a full key.
This fixes the monotonicity issue as otherwise the range tombstone
emitted after such clustering row has a start position that should be
ordered before that of the row.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
45082ef18c sstables: Don't write promoted index consisting of a single block in 'mc' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
8f5ac1d86f SST3: Make sure we emit range tombstone when slicing/fft
If we go past the slice to be read with a range tombstone being opened
we need to emit an RT corresponding to this slice.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
ade8027960 Add mutation_fragment_filter::upper_bound
This method returns end of current position range.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
82ff29cde8 Add clustering_ranges_walker::upper_bound
This method returns end of current position range.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
bff49345cd Add position_in_partition_view::as_end_bound_view
This will be used in sstables 3.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
cd80d6ff65 database: Honour enable_sstables_mc_format configuration option.
Only enable SSTables 'mc' format if the entire cluster supports it and
it is enabled in the configuration file.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
c98937e04c sstables: Support SSTables 'mc' format as a feature.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
650b245657 db: Add configuration option for enabling SSTables 'mc' format.
This flag will only be used for testing purposes until Scylla 3.o
release and will be removed once SSTables 'mc' testing is completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
0edd3c57a9 tests: Add test for reading a complex column with zero subcolumns (SST3).
The files are generated by Scylla as a compaction_history table.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
24590fe88c sstables: Fix parsing of complex columns with zero subcolumns.
Before this fix, a complex column with zero subcolumns would be
incorrecty parsed as it would re-read the deletion time twice.

Now, this case is handled properly.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
be3613bdb6 sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
This avoids noisy warnings like "signed value overflow" when ASAN is
turned on.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
0048f4814e sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
abstract_type::parse_type() only deals with simple types and fails to
parse wrapped types such as
org.apache.cassandra.db.marshal.FrozenType(org.apache.cassandra.db.marshal.ListType(org.apache.cassandra.db.marshal.UTF8Type))

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
0f298113c7 bytes: Add helper for turning bytes_view into sstring_view.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
9166badebe sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
It may happen that we hit the end of partition and then get
fast_forward_to() called in which case we attempt to call it from an
already destroyed object. We need to check the _mf_filter value before
doing so.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
fc901eb700 sstables: Fix string formatting for exception messages in m_format_read_helpers.
Before this fix, the code was a potential undefined behaviour and crash
because it would add a large value to a const char* and try to create a
std::string out of it.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
84341821b1 sstables: Don't validate timestamps against the max value on parsing.
Internally, timestamps are represented as signed integers (int64_t) but
stored as unsigned ones. So it is quite possible to store data with
timestamp that is represented as a number larger than the max value of
int64_t type.
One such example is api::min_timestamp() that is used when generating
system schema tables ("keyspaces"). When cast to uint64_t, it turns into
a large value.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
bdca27ae41 sstables: Always store only min bases in serialization_header.
There previously was an inconsistency in treating min values stored in a
serialization_header. They are written to or read from a Statistics.db
as deltas against fixed bases, but when we parse timeouts from the data
file, we need the full bases, not just deltas.

This inconsistency causes wrong timestamp values if we write an sstable
and then read from it using one and the same sstable object because we
turn min values into bases on write and then don't adjust them back
because we already have them in memory.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
057c26f894 sstables: Support 'mc' version parsing from filename.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Piotr Jastrzebski
d8e6d1ed98 SST3: Make sure we call consume_partition_end
even when we slice and fast forward to.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:23:40 -07:00
Raphael S. Carvalho
745e35fa82 database: Fix sstable resharding for mc format
SStable format mc doesn't write ancestors to metadata, so resharding
will not work with this new format because it relies on ancestors to
replace new unshared sstables with old shared ones.
Fix is about not relying on ancestors metadata for this operation.

Fixes #3777.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>
2018-09-25 18:37:48 +03:00
Nadav Har'El
05f8ed270b Add docs/metrics.md - documentation on metrics
Today I realised that although we have per-table metrics, they are not
*really* available by default. I was suprised to find that we don't have
(as far as I can tell) a document explaining why it is so, or how to enable
them anyway. Moreover, the more I investigated this issue, the more I
realised how little I know on Scylla's metrics - how they are calculated,
how they are collected, their different types, and so on.

So I sat down to figure out everything I wanted to learn about Scylla metrics,
and then wrote it all down in a new document, docs/metrics.md.

There are some missing pieces in this document marked by TODO, and probably
additional missing pieces that I'm not aware of, but I think this is already
a good start and can be (and should be) improved-on later.

We really need to have more of these documents describing various Scylla
subsystems to new developers - what each subsystem does, why it does what
it does, where is the code, and so on. I am facing these problems every
day as a seasoned developer - I can't even imagine what our new developers
face when trying to understand a subsystem they are not yet familiar with.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180920131103.20590-1-nyh@scylladb.com>
2018-09-25 17:51:20 +03:00
Paweł Dziepak
a3746d3b05 paging: make may_need_paging() more conservative
There is a bad interaction between may_need_paging() and query result
size limiter. The former is trying to avoid the complexity of paged
queries when the number of returned rows is going to be smaller than the
page size. The latter uses the fact that paged queries need not return
all requested rows to limit the size of a query results. Since
may_need_paging() may turn a paged query into non-paged one as a side
effect it disables the oversized result protection.

This patch limits the cases when may_need_paging() disables paging to
the situations when we know for sure that query result size limiter
won't be needed, i.e.: the result is not going to contain more than one
row. If the client knows for sure that the paging is not needed and
the performance impact is worthwhile it can disable paging on its side.
Otherwise, let's default to the safer behaviour.

Fixes #3620.

Message-Id: <20180925134431.24329-1-pdziepak@scylladb.com>
2018-09-25 17:01:04 +03:00
Avi Kivity
c6f651ead4 Merge "Use fragmented buffers in commitlog writes" from Paweł
"
This series changes commitlog write path so that it uses fragmented
buffers and therefore avoids large allocations. This is done by first
switching the code to use seastar memory_output_stream interface, which
can handle fragmented buffer without any additional actions from the
user code needed and then making it use buffers of fixed size 128 kB.

Tests: unit(release, debug) dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table)
"

* tag 'fragmented-commitlog-writes/v3' of https://github.com/pdziepak/scylla:
  commitlog: switch to fragmented buffers
  commitlog: drop buffer pools
  commitlog: drop recovery from bad alloc
  utils: drop data_output
  commitlog: use memory_output_stream
  serialization_visitors: add support for memory_output_stream
  utils: fragmented_temporary_buffer::view: add remove_prefix()
  utils: fragmented_temporary_buffer: add empty() and size_bytes()
  utils: fragmented_temporary_buffer: add get_ostream()
  idl: serializer: don't assume Iterator::value_type is bytes_view
  idl: serializer:  create buffer view from streams
  utils: crc: accept FragmentRange
2018-09-25 12:43:06 +03:00
Avi Kivity
8276ada1c4 tests: sstable_3_x_test: await sstable background tasks
When an sstable is deleted, this work is done as a background task
since it cannot be done from the destructor.  If we don't wait for
that background task, it is detected as a leak by ASAN.

Fix by waiting for background tasks in every test.

A more complete fix would involve having a factory class create
sstables and assume the responsibility for background tasks, and
something similar to with_cql_test_env(), but that is deferred until later.

Tests: sstable_3_x_test (debug).
Message-Id: <20180923111745.8313-1-avi@scylladb.com>
2018-09-24 10:43:58 +02:00
Takuya ASADA
21a12aa458 dist/redhat: specify correct repo file path on scylla-housekeeping services
Currently, both scylla-housekeeping-daily/-restart services mistakenly
specify repo file path as "@@REPOFILES@@", witch is copied from .in
template, need to be replace with actual path.

Fixes #3776

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180921031605.9330-1-syuu@scylladb.com>
2018-09-23 11:38:26 +03:00
Glauber Costa
f828fe0d59 setup: add the lazytime XFS version
Starting with kernel 4.17 XFS will support the lazytime mount option.
That will be beneficial for Scylla as updating times synchronously is
one of our current sources of stalls.

Fortunately, older kernels are able to parse the option and just ignore
it. We verified that to be the case in a 4.15 kernel on ubuntu.
Therefore, just add the option unconditionally.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180920170017.13215-1-glauber@scylladb.com>
2018-09-20 20:12:44 +03:00
Gleb Natapov
0bf9a78c78 sstables: wrap file into checked file after applying extensions
File extensions can also produce errors that checked file wants to
intercept and act upon. The patch changes the order in which files are
wrapped to make checked file the outermost wrapped to be able to handle
exception generated by all inner wrappers.

Message-Id: <20180920124430.GD2326@scylladb.com>
2018-09-20 15:57:38 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Asias He
de05df216f streaming: Use rpc::source on the shard where it is created
rpc::source can only work on the shard where it is created, thus we can
not apply the load distribution optimization. Disable it and let the
multishard_writer to forward the data to the correct shard.

Fixes #3731.

Message-Id: <0d1b4d3e7adcfdc4e392b83aeb2544b95f3f46dd.1537430162.git.asias@scylladb.com>
2018-09-20 12:29:24 +03:00
Avi Kivity
8b2bf73c6f Merge "Fix compaction metadata read/write for SSTables 3.x" from Vladimir
"
In SSTables 3.x, the 'ancestors' field of compaction metadata is no
longer stored in the Statistics.db file

The newly added test has previously failed due to this inconsistency.

Tests: unit {release}
"

* 'projects/sstables-30/empty_clustering_key/v1' of https://github.com/argenet/scylla:
  tests: Add test for reading table with empty clustering key from SSTables 3.x.
  tests: Update Statistics.db files for SSTables 3.x write tests.
  sstables: Do not parse ancestors from compaction metadata for SSTables 3.x
2018-09-20 09:53:46 +03:00
Vladimir Krivopalov
bf351c4a4f tests: Add test for reading table with empty clustering key from SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 20:57:23 -07:00
Vladimir Krivopalov
3bbb013ecd tests: Update Statistics.db files for SSTables 3.x write tests.
Those files have been generated with 'ancestors' field in compaction
metadata and so were invalid.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 20:57:23 -07:00
Vladimir Krivopalov
48fa088ec6 sstables: Do not parse ancestors from compaction metadata for SSTables 3.x
Ancestors array has been removed starting from 'ma' format
(CASSANDRA-7066).

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 17:11:43 -07:00
Vlad Zolotarov
043ced243e fix_system_distributed_tables.sh: adjust newly added 'request_size' and 'response_size' columns
Adjust the script to the new schema of system_traces.sessions. Two
new columns have been added:
   - request_size:  int
   - response_size: int

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180919005504.12498-1-vladz@scylladb.com>
2018-09-19 15:46:11 +01:00
Paweł Dziepak
4469f76e7c commitlog: switch to fragmented buffers
So far commitlog was using contiguous buffers for storing the data that
is about to be written to disk. It was able to coalesce small writes so
that multiple small mutations would use the same buffer, but if a
muation was large the commitlog would attempt to allocate a single,
appropriately large buffer. This excessively stresses the memory
allocator and may cause memory fragmentation to become an issue. The
solution is to use fixed-size buffers of 128 kB, which is the standard
buffer size in Scylla and keep large values fragmented.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
7c1add6769 commitlog: drop buffer pools
Buffer pools were added in 7191a130bb
"Commitlog: recycle buffers to reduce fragmentation." They introduce a
lot of complexity and will become unnecessary once the code is switched
to use fixed-size 128kB buffers.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
9fee8b8d76 commitlog: drop recovery from bad alloc
If a node cannot allocate a 128 kB it is already in a very bad shape, so
there isn't much value in trying to recover by attempting smaller
allocations and it just adds more complexity to the segment allocation.
It actually may be better to let some requests fail and give the node a
chance to recover rather than trying to use every last byte of free
memory and end up with bad_alloc in a noexcept context.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
2e5b375309 utils: drop data_output 2018-09-18 17:22:59 +01:00
Paweł Dziepak
fe48aaae46 commitlog: use memory_output_stream
memory_output_stream deals with all required pointer arithmetic and
allows easy transition to fragmented buffers.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
b9ab058834 serialization_visitors: add support for memory_output_stream 2018-09-18 17:22:59 +01:00
Paweł Dziepak
cbe2ef9e5c utils: fragmented_temporary_buffer::view: add remove_prefix() 2018-09-18 17:22:59 +01:00
Alexys Jacob
24b90ef527 configure.py: coding style fixes
configure.py:23:10: E401 multiple imports on one line
configure.py:39:61: W291 trailing whitespace
configure.py:47:1: E302 expected 2 blank lines, found 1
configure.py:53:16: W291 trailing whitespace
configure.py:55:1: E302 expected 2 blank lines, found 1
configure.py:62:1: E302 expected 2 blank lines, found 1
configure.py:63:53: E251 unexpected spaces around keyword / parameter equals
configure.py:63:55: E251 unexpected spaces around keyword / parameter equals
configure.py:63:68: E251 unexpected spaces around keyword / parameter equals
configure.py:63:70: E251 unexpected spaces around keyword / parameter equals
configure.py:63:92: E251 unexpected spaces around keyword / parameter equals
configure.py:63:94: E251 unexpected spaces around keyword / parameter equals
configure.py:64:33: E251 unexpected spaces around keyword / parameter equals
configure.py:64:35: E251 unexpected spaces around keyword / parameter equals
configure.py:65:54: E251 unexpected spaces around keyword / parameter equals
configure.py:65:56: E251 unexpected spaces around keyword / parameter equals
configure.py:65:69: E251 unexpected spaces around keyword / parameter equals
configure.py:65:71: E251 unexpected spaces around keyword / parameter equals
configure.py:65:94: E251 unexpected spaces around keyword / parameter equals
configure.py:65:96: E251 unexpected spaces around keyword / parameter equals
configure.py:66:33: E251 unexpected spaces around keyword / parameter equals
configure.py:66:35: E251 unexpected spaces around keyword / parameter equals
configure.py:68:1: E302 expected 2 blank lines, found 1
configure.py:72:18: E712 comparison to True should be 'if cond is True:' or 'if cond:'
configure.py:80:1: E302 expected 2 blank lines, found 1
configure.py:83:1: E302 expected 2 blank lines, found 1
configure.py:87:1: E302 expected 2 blank lines, found 1
configure.py:87:33: E251 unexpected spaces around keyword / parameter equals
configure.py:87:35: E251 unexpected spaces around keyword / parameter equals
configure.py:87:45: E251 unexpected spaces around keyword / parameter equals
configure.py:87:47: E251 unexpected spaces around keyword / parameter equals
configure.py:88:56: E251 unexpected spaces around keyword / parameter equals
configure.py:88:58: E251 unexpected spaces around keyword / parameter equals
configure.py:90:1: E302 expected 2 blank lines, found 1
configure.py:94:1: E302 expected 2 blank lines, found 1
configure.py:94:42: E251 unexpected spaces around keyword / parameter equals
configure.py:94:44: E251 unexpected spaces around keyword / parameter equals
configure.py:94:54: E251 unexpected spaces around keyword / parameter equals
configure.py:94:56: E251 unexpected spaces around keyword / parameter equals
configure.py:104:42: E251 unexpected spaces around keyword / parameter equals
configure.py:104:44: E251 unexpected spaces around keyword / parameter equals
configure.py:105:42: E251 unexpected spaces around keyword / parameter equals
configure.py:105:44: E251 unexpected spaces around keyword / parameter equals
configure.py:110:1: E302 expected 2 blank lines, found 1
configure.py:114:29: E251 unexpected spaces around keyword / parameter equals
configure.py:114:31: E251 unexpected spaces around keyword / parameter equals
configure.py:114:61: E251 unexpected spaces around keyword / parameter equals
configure.py:114:63: E251 unexpected spaces around keyword / parameter equals
configure.py:116:1: E302 expected 2 blank lines, found 1
configure.py:123:26: E251 unexpected spaces around keyword / parameter equals
configure.py:123:28: E251 unexpected spaces around keyword / parameter equals
configure.py:123:49: E251 unexpected spaces around keyword / parameter equals
configure.py:123:51: E251 unexpected spaces around keyword / parameter equals
configure.py:123:84: E251 unexpected spaces around keyword / parameter equals
configure.py:123:86: E251 unexpected spaces around keyword / parameter equals
configure.py:129:1: E302 expected 2 blank lines, found 1
configure.py:135:1: E302 expected 2 blank lines, found 1
configure.py:137:35: E251 unexpected spaces around keyword / parameter equals
configure.py:137:37: E251 unexpected spaces around keyword / parameter equals
configure.py:137:53: E251 unexpected spaces around keyword / parameter equals
configure.py:137:55: E251 unexpected spaces around keyword / parameter equals
configure.py:137:83: E251 unexpected spaces around keyword / parameter equals
configure.py:137:85: E251 unexpected spaces around keyword / parameter equals
configure.py:143:1: E302 expected 2 blank lines, found 1
configure.py:148:1: E302 expected 2 blank lines, found 1
configure.py:152:5: E301 expected 1 blank line, found 0
configure.py:159:5: E301 expected 1 blank line, found 0
configure.py:161:5: E301 expected 1 blank line, found 0
configure.py:163:5: E301 expected 1 blank line, found 0
configure.py:165:5: E301 expected 1 blank line, found 0
configure.py:168:1: E302 expected 2 blank lines, found 1
configure.py:169:5: F841 local variable 'mach' is assigned to but never used
configure.py:175:1: E302 expected 2 blank lines, found 1
configure.py:178:5: E301 expected 1 blank line, found 0
configure.py:183:5: E301 expected 1 blank line, found 0
configure.py:185:5: E301 expected 1 blank line, found 0
configure.py:187:5: E301 expected 1 blank line, found 0
configure.py:189:5: E301 expected 1 blank line, found 0
configure.py:192:1: E305 expected 2 blank lines after class or function definition, found 1
configure.py:329:5: E123 closing bracket does not match indentation of opening bracket's line
configure.py:335:5: E123 closing bracket does not match indentation of opening bracket's line
configure.py:340:41: E251 unexpected spaces around keyword / parameter equals
configure.py:340:43: E251 unexpected spaces around keyword / parameter equals
configure.py:340:60: E251 unexpected spaces around keyword / parameter equals
configure.py:340:62: E251 unexpected spaces around keyword / parameter equals
configure.py:340:85: E251 unexpected spaces around keyword / parameter equals
configure.py:340:87: E251 unexpected spaces around keyword / parameter equals
configure.py:341:30: E251 unexpected spaces around keyword / parameter equals
configure.py:341:32: E251 unexpected spaces around keyword / parameter equals
configure.py:342:29: E251 unexpected spaces around keyword / parameter equals
configure.py:342:31: E251 unexpected spaces around keyword / parameter equals
configure.py:343:38: E251 unexpected spaces around keyword / parameter equals
configure.py:343:40: E251 unexpected spaces around keyword / parameter equals
configure.py:343:54: E251 unexpected spaces around keyword / parameter equals
configure.py:343:56: E251 unexpected spaces around keyword / parameter equals
configure.py:344:29: E251 unexpected spaces around keyword / parameter equals
configure.py:344:31: E251 unexpected spaces around keyword / parameter equals
configure.py:345:37: E251 unexpected spaces around keyword / parameter equals
configure.py:345:39: E251 unexpected spaces around keyword / parameter equals
configure.py:345:52: E251 unexpected spaces around keyword / parameter equals
configure.py:345:54: E251 unexpected spaces around keyword / parameter equals
configure.py:346:29: E251 unexpected spaces around keyword / parameter equals
configure.py:346:31: E251 unexpected spaces around keyword / parameter equals
configure.py:349:43: E251 unexpected spaces around keyword / parameter equals
configure.py:349:45: E251 unexpected spaces around keyword / parameter equals
configure.py:349:59: E251 unexpected spaces around keyword / parameter equals
configure.py:349:61: E251 unexpected spaces around keyword / parameter equals
configure.py:349:84: E251 unexpected spaces around keyword / parameter equals
configure.py:349:86: E251 unexpected spaces around keyword / parameter equals
configure.py:350:29: E251 unexpected spaces around keyword / parameter equals
configure.py:350:31: E251 unexpected spaces around keyword / parameter equals
configure.py:351:44: E251 unexpected spaces around keyword / parameter equals
configure.py:351:46: E251 unexpected spaces around keyword / parameter equals
configure.py:351:60: E251 unexpected spaces around keyword / parameter equals
configure.py:351:62: E251 unexpected spaces around keyword / parameter equals
configure.py:351:86: E251 unexpected spaces around keyword / parameter equals
configure.py:351:88: E251 unexpected spaces around keyword / parameter equals
configure.py:352:29: E251 unexpected spaces around keyword / parameter equals
configure.py:352:31: E251 unexpected spaces around keyword / parameter equals
configure.py:353:43: E251 unexpected spaces around keyword / parameter equals
configure.py:353:45: E251 unexpected spaces around keyword / parameter equals
configure.py:353:59: E251 unexpected spaces around keyword / parameter equals
configure.py:353:61: E251 unexpected spaces around keyword / parameter equals
configure.py:353:79: E251 unexpected spaces around keyword / parameter equals
configure.py:353:81: E251 unexpected spaces around keyword / parameter equals
configure.py:354:29: E251 unexpected spaces around keyword / parameter equals
configure.py:354:31: E251 unexpected spaces around keyword / parameter equals
configure.py:355:45: E251 unexpected spaces around keyword / parameter equals
configure.py:355:47: E251 unexpected spaces around keyword / parameter equals
configure.py:355:61: E251 unexpected spaces around keyword / parameter equals
configure.py:355:63: E251 unexpected spaces around keyword / parameter equals
configure.py:355:78: E251 unexpected spaces around keyword / parameter equals
configure.py:355:80: E251 unexpected spaces around keyword / parameter equals
configure.py:356:29: E251 unexpected spaces around keyword / parameter equals
configure.py:356:31: E251 unexpected spaces around keyword / parameter equals
configure.py:359:45: E251 unexpected spaces around keyword / parameter equals
configure.py:359:47: E251 unexpected spaces around keyword / parameter equals
configure.py:359:61: E251 unexpected spaces around keyword / parameter equals
configure.py:359:63: E251 unexpected spaces around keyword / parameter equals
configure.py:359:83: E251 unexpected spaces around keyword / parameter equals
configure.py:359:85: E251 unexpected spaces around keyword / parameter equals
configure.py:360:29: E251 unexpected spaces around keyword / parameter equals
configure.py:360:31: E251 unexpected spaces around keyword / parameter equals
configure.py:361:48: E251 unexpected spaces around keyword / parameter equals
configure.py:361:50: E251 unexpected spaces around keyword / parameter equals
configure.py:361:69: E251 unexpected spaces around keyword / parameter equals
configure.py:361:71: E251 unexpected spaces around keyword / parameter equals
configure.py:361:87: E251 unexpected spaces around keyword / parameter equals
configure.py:361:89: E251 unexpected spaces around keyword / parameter equals
configure.py:362:29: E251 unexpected spaces around keyword / parameter equals
configure.py:362:31: E251 unexpected spaces around keyword / parameter equals
configure.py:363:48: E251 unexpected spaces around keyword / parameter equals
configure.py:363:50: E251 unexpected spaces around keyword / parameter equals
configure.py:363:64: E251 unexpected spaces around keyword / parameter equals
configure.py:363:66: E251 unexpected spaces around keyword / parameter equals
configure.py:363:89: E251 unexpected spaces around keyword / parameter equals
configure.py:363:91: E251 unexpected spaces around keyword / parameter equals
configure.py:364:29: E251 unexpected spaces around keyword / parameter equals
configure.py:364:31: E251 unexpected spaces around keyword / parameter equals
configure.py:365:46: E251 unexpected spaces around keyword / parameter equals
configure.py:365:48: E251 unexpected spaces around keyword / parameter equals
configure.py:365:62: E251 unexpected spaces around keyword / parameter equals
configure.py:365:64: E251 unexpected spaces around keyword / parameter equals
configure.py:365:82: E251 unexpected spaces around keyword / parameter equals
configure.py:365:84: E251 unexpected spaces around keyword / parameter equals
configure.py:365:97: E251 unexpected spaces around keyword / parameter equals
configure.py:365:99: E251 unexpected spaces around keyword / parameter equals
configure.py:366:29: E251 unexpected spaces around keyword / parameter equals
configure.py:366:31: E251 unexpected spaces around keyword / parameter equals
configure.py:367:48: E251 unexpected spaces around keyword / parameter equals
configure.py:367:50: E251 unexpected spaces around keyword / parameter equals
configure.py:367:70: E251 unexpected spaces around keyword / parameter equals
configure.py:367:72: E251 unexpected spaces around keyword / parameter equals
configure.py:368:1: E101 indentation contains mixed spaces and tabs
configure.py:368:1: W191 indentation contains tabs
configure.py:368:4: E128 continuation line under-indented for visual indent
configure.py:368:8: E251 unexpected spaces around keyword / parameter equals
configure.py:368:10: E251 unexpected spaces around keyword / parameter equals
configure.py:369:48: E251 unexpected spaces around keyword / parameter equals
configure.py:369:50: E251 unexpected spaces around keyword / parameter equals
configure.py:369:73: E251 unexpected spaces around keyword / parameter equals
configure.py:369:75: E251 unexpected spaces around keyword / parameter equals
configure.py:370:1: E101 indentation contains mixed spaces and tabs
configure.py:370:13: E128 continuation line under-indented for visual indent
configure.py:370:17: E251 unexpected spaces around keyword / parameter equals
configure.py:370:19: E251 unexpected spaces around keyword / parameter equals
configure.py:371:47: E251 unexpected spaces around keyword / parameter equals
configure.py:371:49: E251 unexpected spaces around keyword / parameter equals
configure.py:371:71: E251 unexpected spaces around keyword / parameter equals
configure.py:371:73: E251 unexpected spaces around keyword / parameter equals
configure.py:372:13: E128 continuation line under-indented for visual indent
configure.py:372:17: E251 unexpected spaces around keyword / parameter equals
configure.py:372:19: E251 unexpected spaces around keyword / parameter equals
configure.py:373:50: E251 unexpected spaces around keyword / parameter equals
configure.py:373:52: E251 unexpected spaces around keyword / parameter equals
configure.py:373:76: E251 unexpected spaces around keyword / parameter equals
configure.py:373:78: E251 unexpected spaces around keyword / parameter equals
configure.py:374:13: E128 continuation line under-indented for visual indent
configure.py:374:17: E251 unexpected spaces around keyword / parameter equals
configure.py:374:19: E251 unexpected spaces around keyword / parameter equals
configure.py:375:52: E251 unexpected spaces around keyword / parameter equals
configure.py:375:54: E251 unexpected spaces around keyword / parameter equals
configure.py:375:68: E251 unexpected spaces around keyword / parameter equals
configure.py:375:70: E251 unexpected spaces around keyword / parameter equals
configure.py:375:94: E251 unexpected spaces around keyword / parameter equals
configure.py:375:96: E251 unexpected spaces around keyword / parameter equals
configure.py:375:109: E251 unexpected spaces around keyword / parameter equals
configure.py:375:111: E251 unexpected spaces around keyword / parameter equals
configure.py:376:29: E251 unexpected spaces around keyword / parameter equals
configure.py:376:31: E251 unexpected spaces around keyword / parameter equals
configure.py:377:43: E251 unexpected spaces around keyword / parameter equals
configure.py:377:45: E251 unexpected spaces around keyword / parameter equals
configure.py:377:59: E251 unexpected spaces around keyword / parameter equals
configure.py:377:61: E251 unexpected spaces around keyword / parameter equals
configure.py:377:79: E251 unexpected spaces around keyword / parameter equals
configure.py:377:81: E251 unexpected spaces around keyword / parameter equals
configure.py:378:29: E251 unexpected spaces around keyword / parameter equals
configure.py:378:31: E251 unexpected spaces around keyword / parameter equals
configure.py:379:30: E251 unexpected spaces around keyword / parameter equals
configure.py:379:32: E251 unexpected spaces around keyword / parameter equals
configure.py:379:46: E251 unexpected spaces around keyword / parameter equals
configure.py:379:48: E251 unexpected spaces around keyword / parameter equals
configure.py:379:62: E251 unexpected spaces around keyword / parameter equals
configure.py:379:64: E251 unexpected spaces around keyword / parameter equals
configure.py:380:30: E251 unexpected spaces around keyword / parameter equals
configure.py:380:32: E251 unexpected spaces around keyword / parameter equals
configure.py:380:44: E251 unexpected spaces around keyword / parameter equals
configure.py:380:46: E251 unexpected spaces around keyword / parameter equals
configure.py:380:58: E251 unexpected spaces around keyword / parameter equals
configure.py:380:60: E251 unexpected spaces around keyword / parameter equals
configure.py:395:36: E251 unexpected spaces around keyword / parameter equals
configure.py:395:38: E251 unexpected spaces around keyword / parameter equals
configure.py:395:76: E251 unexpected spaces around keyword / parameter equals
configure.py:395:78: E251 unexpected spaces around keyword / parameter equals
configure.py:398:18: E127 continuation line over-indented for visual indent
configure.py:424:32: W291 trailing whitespace
configure.py:649:18: E124 closing bracket does not match visual indentation
configure.py:650:17: E127 continuation line over-indented for visual indent
configure.py:650:17: W503 line break before binary operator
configure.py:651:17: W503 line break before binary operator
configure.py:652:17: E124 closing bracket does not match visual indentation
configure.py:784:8: E713 test for membership should be 'not in'
configure.py:790:45: W291 trailing whitespace
configure.py:819:32: E261 at least two spaces before inline comment
configure.py:832:5: E123 closing bracket does not match indentation of opening bracket's line
configure.py:836:35: E251 unexpected spaces around keyword / parameter equals
configure.py:836:37: E251 unexpected spaces around keyword / parameter equals
configure.py:836:49: E251 unexpected spaces around keyword / parameter equals
configure.py:836:51: E251 unexpected spaces around keyword / parameter equals
configure.py:845:45: E251 unexpected spaces around keyword / parameter equals
configure.py:845:47: E251 unexpected spaces around keyword / parameter equals
configure.py:845:59: E251 unexpected spaces around keyword / parameter equals
configure.py:845:61: E251 unexpected spaces around keyword / parameter equals
configure.py:848:43: E251 unexpected spaces around keyword / parameter equals
configure.py:848:45: E251 unexpected spaces around keyword / parameter equals
configure.py:869:1: E302 expected 2 blank lines, found 1
configure.py:879:1: E305 expected 2 blank lines after class or function definition, found 1
configure.py:965:118: E225 missing whitespace around operator
configure.py:967:18: E124 closing bracket does not match visual indentation
configure.py:969:27: F821 undefined name 'python'
configure.py:969:73: E251 unexpected spaces around keyword / parameter equals
configure.py:969:75: E251 unexpected spaces around keyword / parameter equals
configure.py:976:7: E201 whitespace after '{'
configure.py:976:12: E203 whitespace before ':'
configure.py:976:73: E202 whitespace before '}'
configure.py:981:58: E251 unexpected spaces around keyword / parameter equals
configure.py:981:60: E251 unexpected spaces around keyword / parameter equals
configure.py:987:10: E222 multiple spaces after operator
configure.py:1001:17: E124 closing bracket does not match visual indentation
configure.py:1026:29: E251 unexpected spaces around keyword / parameter equals
configure.py:1026:31: E251 unexpected spaces around keyword / parameter equals
configure.py:1100:82: W291 trailing whitespace
configure.py:1110:29: E251 unexpected spaces around keyword / parameter equals
configure.py:1110:31: E251 unexpected spaces around keyword / parameter equals
configure.py:1110:49: E251 unexpected spaces around keyword / parameter equals
configure.py:1110:51: E251 unexpected spaces around keyword / parameter equals
configure.py:1111:64: E251 unexpected spaces around keyword / parameter equals
configure.py:1111:66: E251 unexpected spaces around keyword / parameter equals
configure.py:1112:13: E128 continuation line under-indented for visual indent
configure.py:1112:22: E251 unexpected spaces around keyword / parameter equals
configure.py:1112:24: E251 unexpected spaces around keyword / parameter equals
configure.py:1140:106: W291 trailing whitespace
configure.py:1149:86: E127 continuation line over-indented for visual indent
configure.py:1191:116: E251 unexpected spaces around keyword / parameter equals
configure.py:1191:118: E251 unexpected spaces around keyword / parameter equals
configure.py:1191:139: E251 unexpected spaces around keyword / parameter equals
configure.py:1191:141: E251 unexpected spaces around keyword / parameter equals
configure.py:1197:83: E231 missing whitespace after ','
configure.py:1200:76: E231 missing whitespace after ','
configure.py:1215:99: W291 trailing whitespace
configure.py:1242:31: E251 unexpected spaces around keyword / parameter equals
configure.py:1242:33: E251 unexpected spaces around keyword / parameter equals

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917155438.12410-1-ultrabug@gentoo.org>
2018-09-18 13:49:23 +03:00
Avi Kivity
e5e59ea9cf Merge "More SSTables 3.x write tests enriched with read after write." from Vladimir
"
Some of the write tests were missing the read after write validation
which has now been added for better coverage.

Tests: unit {release}
"

* 'projects/sstables-30/more-enriched-tests/v1' of https://github.com/argenet/scylla:
  tests: Enrich test_write_adjacent_range_tombstones_with_rows with read after write
  tests: Enrich test_write_many_range_tombstones with read after write
  tests: Enrich test_write_mixed_rows_and_range_tombstones with read after write
  tests: Enrich test_write_non_adjacent_range_tombstones with read after write
  tests: Enrich test_write_adjacent_range_tombstones with read after write
  tests: Enrich test_write_simple_range_tombstone with read after write.
  tests: Enrich test_write_deleted_column with read after write.
2018-09-18 13:45:52 +03:00
Paweł Dziepak
e464ad4f5d utils: fragmented_temporary_buffer: add empty() and size_bytes() 2018-09-18 11:29:37 +01:00
Paweł Dziepak
f4bb219a8b utils: fragmented_temporary_buffer: add get_ostream() 2018-09-18 11:29:37 +01:00
Paweł Dziepak
196c5a5eee idl: serializer: don't assume Iterator::value_type is bytes_view 2018-09-18 11:29:36 +01:00
Paweł Dziepak
953942b256 idl: serializer: create buffer view from streams 2018-09-18 11:29:36 +01:00
Paweł Dziepak
252cf0c681 utils: crc: accept FragmentRange 2018-09-18 11:29:36 +01:00
Avi Kivity
9d90ba470b Merge "Fix deleted counters handling in SSTables 3.x" from Vladimir
"
This patchset fixes the bug in SSTables 3.x parser that did not properly
handle deleted counter cells.

A write test is enriched to validate read after write so that this case
is covered.

Tests: unit {release}
"

* 'projects/sstables-30/fix-deleted-counters-read/v1' of https://github.com/argenet/scylla:
  tests: Read after write in test_write_counter_table.
  sstables: Fix deleted counter cells processing in SSTables 3.x parser.
2018-09-18 12:20:54 +03:00
Vladimir Krivopalov
8c08ccbd3b tests: Enrich test_write_adjacent_range_tombstones_with_rows with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:06:24 -07:00
Vladimir Krivopalov
f0966a935e tests: Enrich test_write_many_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:06:10 -07:00
Vladimir Krivopalov
262874a90c tests: Enrich test_write_mixed_rows_and_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:05:56 -07:00
Vladimir Krivopalov
6fbf4d3589 tests: Enrich test_write_non_adjacent_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:05:42 -07:00
Vladimir Krivopalov
4bf9c87a1a tests: Enrich test_write_adjacent_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:05:26 -07:00
Vladimir Krivopalov
5b087daf91 tests: Enrich test_write_simple_range_tombstone with read after write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:04:57 -07:00
Vladimir Krivopalov
e63d960b8e tests: Enrich test_write_deleted_column with read after write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:04:25 -07:00
Eliran Sinvani
83628f5881 cql3: maintain correctness of multicolumn restriction on mixed order columns
When a query with multicolumn inequality is issued on clustering columns
having mixed order (ASC and DESC together), if the ranges are not
broken to none overlapping lexicographically monotonic ones, the node
return incorrect rows. This is due to the search nature
(prefix comparison). The solution is to break the range imposed
by the restriction into several single column restrictions OR-ed
together that will be logically equivalent and preserve the
monotonicity assumption. This commit also fixes incorrect results
returned by a multicolumn query on an all descending columns.

A unit test have been added to account for both issues fixed.

Fixes #2050
Tests: Unit test, manual tests of the use case in the issue.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <3b96620a3bd8b0614359a3b0757f324d45189dbb.1536478193.git.eliransin@scylladb.com>
2018-09-17 20:35:55 +03:00
Vladimir Krivopalov
e796fa2b02 tests: Read after write in test_write_counter_table.
This covers the case of deleted counter cells.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 10:11:48 -07:00
Vladimir Krivopalov
79ccce147c sstables: Fix deleted counter cells processing in SSTables 3.x parser.
Deleted counter cells should be processed the same way as regular
deleted cells.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 10:10:57 -07:00
Alexys Jacob
cd74dfebfb scripts: coding style fixes
scripts/create-relocatable-package.py:24:1: F401 'shutil' imported but unused
scripts/create-relocatable-package.py:24:1: F401 'tempfile' imported but unused
scripts/create-relocatable-package.py:24:16: E401 multiple imports on one line
scripts/create-relocatable-package.py:26:1: E302 expected 2 blank lines, found 1
scripts/create-relocatable-package.py:47:1: E305 expected 2 blank lines after class or function definition, found 1
scripts/create-relocatable-package.py:93:6: E225 missing whitespace around operator

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917152520.5032-1-ultrabug@gentoo.org>
2018-09-17 18:40:23 +03:00
Alexys Jacob
c80d7b97cc scyllatop: more coding style fixes
tools/scyllatop/metric.py:2:1: F401 're' imported but unused
tools/scyllatop/metric.py:53:20: E221 multiple spaces before operator
tools/scyllatop/metric.py:69:20: E221 multiple spaces before operator

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917153308.7240-1-ultrabug@gentoo.org>
2018-09-17 18:39:53 +03:00
Raphael S. Carvalho
5bc028f78b database: fix 2x increase in disk usage during cleanup compaction
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.

Fixes #3735.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>
2018-09-17 17:26:46 +03:00
Alexys Jacob
46d101c1f2 scyllatop: coding style fixes
tools/scyllatop/prometheus.py:3:1: F401 'sys' imported but unused
tools/scyllatop/prometheus.py:7:1: E302 expected 2 blank lines, found 1
tools/scyllatop/prometheus.py:12:5: E301 expected 1 blank line, found 0
tools/scyllatop/prometheus.py:17:1: W293 blank line contains whitespace
tools/scyllatop/prometheus.py:22:82: E225 missing whitespace around operator

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180914110847.1862-1-ultrabug@gentoo.org>
2018-09-17 15:45:43 +03:00
Botond Dénes
a84c26799d tests/mutation_reader_test: fix flaky restricted reader timeout test
The test in question is `restricted_reader_timeout`.

Use `eventually_true()` instead of `sleep()` to wait on the timeout
expiring, making the test more robust on overloaded machines.

Also fix graceful failing, another longstanding issue with this test.
The readers created for the test need different destruction logic
depending whether the test failed or succeeded. Previously this was
dealt with by using the logic that worked in case of success and using
asserts to abort when the test failed, thus avoiding developers
investigating the invalid memory accesses happening due to the wrong
destruction logic.
The solution is to use BOOST_CHECK() macro in the check that validates
whether timeout works as expected. This allows for execution to continue
even if the test failed, and thus allows for running the proper cleanup
code even when the test failed.

Fixes: #3719
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <911921dffc924f1b0a3e86408757467e9be2b65b.1537169933.git.bdenes@scylladb.com>
2018-09-17 09:40:45 +01:00
Nadav Har'El
0006e21c4d tests/view_complex_test: add missing timestamp
test_partial_delete_selected_column() does a long string of various
updates and deletes, each specifies a different timestamp. In one
of these updates, the timestamp was forgotten. This means that the
server picks the current time, a large number.

As the test is currently written, it doesn't matter which timestamp
was chosen, the test would still succeed (if timestamp >= 15, and it
must be since the timestamp is the time from the epoch).
But the intention was probably to use timestamp = 15, so let's make
this intention clear.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180905095552.11883-2-nyh@scylladb.com>
2018-09-17 00:38:55 +01:00
Nadav Har'El
2ae4ed151e tests/view_complex_test - add test passpoints
We recently saw a failure in test_partial_delete_selected_column() but
this is a very long test doing many operations and comparisons of their
results, and without BOOST_TEST_PASSPOINT() we can't know which of them
really failed.

So let's sprinkle BOOST_TEST_PASSPOINT() calls between the different parts
of test_partial_delete_selected_column(). If this test ever fails again,
we'll know where.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180905095552.11883-1-nyh@scylladb.com>
2018-09-17 00:38:55 +01:00
Jesse Haber-Kucharsky
9d27045c76 auth: Shorten random_device instance life-span
On Fedora 28, creating an instance of `std::random_device` opens a file
descriptor for `/dev/urandom` (observed via `strace`).

By declaring static thread-local instances of `std::random_device`,
these descriptors will be open (barring optimization by the compiler)
for the entire duration of the Scylla process's life.

However, the `std::random_device` instance is only necessary for
initializing the `RandomNumberEngine` for generating salts. With this
change, the file-descriptor is closed immediately after the engine is
initialized.

I considered generalizing this pattern of initialization into a
function, but with only two uses (and simple ones) I think this would
only obscure things.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Tests: unit (release)
Message-Id: <f1b985d99f66e5e64d714fd0f087e235b71557d2.1536697368.git.jhaberku@scylladb.com>
2018-09-12 12:14:21 +01:00
Botond Dénes
dfad223ea2 multishard_mutation_reader: shard_reader: don't do concurrent read-aheads
multishard_mutation_reader starts read-aheads on the
shards-to-be-read-soon. When doing this it didn't check whether the
respective shards had an ongoing read-ahead already. This lead to a
single shard executing multiple concurrent read-aheads. This is damaging
for multiple reasons:
    * Can lead to concurrent access of the remote reader's data members.
    * The `shard_reader` was designed around a single read-ahead and
    thus will synchronise foreground reads with only the last one.

The practical implications of this seen so far was that queries reading
a large number of rows (large enough to reliably trigger the
bug) would stop the read early, due the `combined_mutation_reader`'s
internal accounting being messed up by concurrent access.

Also add a unit test. Instead of coming up with a very specific, and
very contrived unit test, use the test-case that detected this bug in
the first place: count(*) on a table with lots of rows (>1000). This
unit-test should serve well for detecting any similar bugs in the
future.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ff1c49be64e2fb443f9aa8c5c8d235e682442248.1536746388.git.bdenes@scylladb.com>
2018-09-12 11:43:18 +01:00
Botond Dénes
6a07b8ae83 multishard_mutation_reader: update shard_reader's comment
The `adandoned` member was renamed to `stopped`. Update the comment
accordingly.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1d655785f28fe1e5fa041f2f49852f0ad88be53e.1536743950.git.bdenes@scylladb.com>
2018-09-12 11:32:08 +02:00
Botond Dénes
d9a2ffad84 mutation_partition: don't move tracing_state early
Currently the `trace_state` is moved into the `querier` object's
constructor when one has to be created. Since the trace_state is used
below this lines this had the effect that on the first page of the
query, when a querier object has to be created, tracing would not work
inside the `querier_cache` which received a move-from `trace_state` (a
nullptr effectively).
Change the move to a copy so the other half of the function doesn't use
a moved-from `trace_state`.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4987419781aa287141aa9dc8ce99c5068b564c84.1536739052.git.bdenes@scylladb.com>
2018-09-12 11:32:08 +02:00
Botond Dénes
49704755b0 combined_mutation_reader: propagate timeout in fill_buffer()
All user reads go through the combined reader. Not propagating the
timeout down from there means that the storage layer's timeout
functionality is effectively disabled. Spotted while reading the code.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7fc10eca1c231dd04ac433913d9e6a51b6b17139.1536657041.git.bdenes@scylladb.com>
2018-09-11 15:44:28 +02:00
Botond Dénes
99ab43a1cc flat_mutation_reader: add timeout parameter to operator()()
For consistency with fast_foward_to() and fill_buffer(), and for
correctness: operator()() calls fill_buffer() and thus should provide a
timeout for the storage layer.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6e97552ac2372e5846c955d94400b5315dbd2a89.1536657041.git.bdenes@scylladb.com>
2018-09-11 15:44:12 +02:00
Tomasz Grabiec
eb321a0830 Merge "Enrich SSTables 3.x write tests with subsequent read" from Vladimir
As our support for reading SSTables 3.x rows is nearly complete, the
write tests can be extended to read data after write.
This patchset adds reading to a handful of write tests.

* https://github.com/argenet/scylla/tree/projects/sstables-30/enrich-write-tests/v6:
  tests: Factor out the helper building SSTables path for write tests.
  tests: Add validate_read() helper to use in SSTables 3.x write tests.
  tests: Preserve tmpdir in SSTables 3.x write tests upon comparison.
  tests: Read SSTables for write_static_row test after validating write.
  tests: Read SSTables for write_composite_partition_key test after
    validating write.
  tests: Read SSTables for write_composite_clustering_key test after
    validating write.
  tests: Read SSTables for write_wide_partitions test after validating
    write.
  tests: Read SSTables for write_ttled_column test after validating
    write.
  tests: Read SSTables for write_collection_wide_update test after
    validating write.
  tests: Read SSTables for write_collection_incremental_update test
    after validating write.
  tests: Read SSTables for write_missing_columns_large_set test after
    validating write.
  tests: Read SSTables for write_multiple_partitions test after
    validating write.
  tests: Read SSTables for write_multiple_rows test after validating
    write.
  tests: Read SSTables for write_different_types test after validating
    write.
  tests: Read SSTables for write_empty_clustering_values test after
    validating write.
  tests: Read SSTables for write_large_clustering_keys test after
    validating write.
  tests: Read SSTables for write_user_defined_type_table test after
    validating write.
  tests: Read SSTables for write_deleted_row test after validating
    write.
  sstables: Fix SSTables 3.x parsing: check use_row_ttl() for TTLed
    columns.
  tests: Read SSTables for write_ttled_row test after validating write.
  Read SSTables for write_compact_table test after validating write.
  tests: Read SSTables for tests of many partitions after validating
    write.
2018-09-11 15:42:43 +02:00
Duarte Nunes
3f0643f34f Merge 'Misc improvements to stateful range scans' from Botond
"
This series contains miscellaneous improvements to the stateful range
scans. These improvements are either things that I forgot to include in
the original series (tracing), was requested by other developers
(comments) or I discovered them while reading the code (lockup and
cleanup).
"

* 'multishard_mutation_query_fixes/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: add some tracing
  multishard_mutation_query: add comment to `read_context`
  multishard_mutation_query: always cleanup readers properly
  multishard_mutation_query: fix possible deadlock when creating a reader fails
2018-09-11 10:26:05 +01:00
Botond Dénes
7d71b42651 multishard_mutation_query: add some tracing
Add tracing for the following events:
1) Dismantling of the combined buffer.
2) Dismantling of the compaction state.
3) Cleaning up the readers.

(1) and (2) can possibly have adverse effects on the performance of the
query and hence it is important that details about the dismantled
fragments is exposed in the tracing data.
(3) is less critical but still good to know how much readers were
created by the read (in case they aren't saved). Since normally (in
strateful queries) this will always be 0 only trace this when it is
non-zero (and is interesting).
2018-09-11 08:18:16 +03:00
Botond Dénes
b41be7c8e5 multishard_mutation_query: add comment to read_context
Explain the purpose of the class and its intended usage and any gotchas
the reader/modifier of the code has to keep in mind.
2018-09-11 08:18:16 +03:00
Botond Dénes
b6e1a8f32d multishard_mutation_query: always cleanup readers properly
Currently the reader cleanup code, which ensures the readers and their
dependent objects are destroyed in the corect order and a single
smp::submit_to() message, are only run when the readers are attempted to
be saved. However proper cleanup is needed not only then, but also when
the query is not stateful. Rename the current `cleanup()` method to
`stop()`, make it public and call it from a `finally()` block after the
page is finalized to ensure readers are properly cleaned up at all
times.
Also make sure that failures in `stop()` are never propagated so that
a failure in the cleanup doesn't fail the read itself.
2018-09-11 08:18:16 +03:00
Vladimir Krivopalov
c4a4ef6e3c tests: Read SSTables for tests of many partitions after validating write.
This covers five tests, including three for compressed tables:
  - write_many_partitions_deflate
  - write_many_partitions_lz4
  - write_many_partitions_snappy
  - write_many_live_partitions
  - write_many_deleted_partitions

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
f1214bfceb Read SSTables for write_compact_table test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
a39638c0ba tests: Read SSTables for write_ttled_row test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
bcae761d72 sstables: Fix SSTables 3.x parsing: check use_row_ttl() for TTLed columns.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
9b55f06456 tests: Read SSTables for write_deleted_row test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
8869f1a591 tests: Read SSTables for write_user_defined_type_table test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
dae49358d8 tests: Read SSTables for write_large_clustering_keys test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
8c2bc4a16a tests: Read SSTables for write_empty_clustering_values test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
6f23446962 tests: Read SSTables for write_different_types test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
4865f2f5a3 tests: Read SSTables for write_multiple_rows test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
3594b887df tests: Read SSTables for write_multiple_partitions test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
eee775dab7 tests: Read SSTables for write_missing_columns_large_set test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
2d764da415 tests: Read SSTables for write_collection_incremental_update test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
88a3b05210 tests: Read SSTables for write_collection_wide_update test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
abdae2dd9e tests: Read SSTables for write_ttled_column test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
cdf148dc67 tests: Read SSTables for write_wide_partitions test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
5b1a4686eb tests: Read SSTables for write_composite_clustering_key test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
e908d07fe7 tests: Read SSTables for write_composite_partition_key test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
aa5dc16dbb tests: Read SSTables for write_static_row test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
42ab8ed3cd tests: Preserve tmpdir in SSTables 3.x write tests upon comparison.
It can be used to do other checks on written files, like reading them
back.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
bc16304e99 tests: Add validate_read() helper to use in SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
6cddd7500a tests: Factor out the helper building SSTables path for write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Botond Dénes
b3f1fe14e8 multishard_mutation_query: fix possible deadlock when creating a reader fails
Failing to create a reader (`do_make_remote_reader()`) can lead to a
deadlock if the reader is in any of the future_*_state states, as the
`then()` block is not executed and hence the promise of the first
future in the chain is not set. Avoid this by changing the `then()` to a
`then_wrapped()` and using `set_exception()` and `set_value()`
accordingly, such that the future is resolved on both the happy and
error path.
2018-09-10 16:41:13 +03:00
Avi Kivity
4553238653 messaging: fix unbounded allocation in TLS RPC server
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.

Fix by passing the limits to the TLS server too.

Fixes #3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>
2018-09-10 12:11:16 +01:00
Gleb Natapov
9e438933a2 mutation_query_test: add test for result size calculation
Check that digest only and digest+data query calculate result size to be
the same.

Message-Id: <20180906153800.GK2326@scylladb.com>
2018-09-06 20:54:57 +03:00
Gleb Natapov
d7674288a9 mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
2018-09-06 20:52:44 +03:00
Takuya ASADA
2136479012 dist/debian: delete mounts.conf on scylla-server.postrm
Since we added mounts.conf on 687372bc48,
we need to delete the file on uninstall the package.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180905204631.9265-1-syuu@scylladb.com>
2018-09-06 16:50:14 +03:00
Gleb Natapov
98092353df mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
2018-09-06 13:09:41 +03:00
Takuya ASADA
ab361e9897 dist/redhat: add mounts.conf to ghost file
Since we added mounts.conf on 687372bc48,
we need to delete the file on uninstall the package.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180905191037.1570-1-syuu@scylladb.com>
2018-09-05 22:14:48 +03:00
Jesse Haber-Kucharsky
682805b22c auth: Use finite time-out for all QUORUM reads
Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.

It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.

This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.

Fixes #3736.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
2018-09-05 21:55:26 +03:00
Tomasz Grabiec
82270c8699 storage_proxy: Fix misqualification of reads as foreground or background in some cases
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.

Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.

Fixes #3734.

Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.

Fixes #3745.

This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.

Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
2018-09-05 20:42:51 +03:00
Avi Kivity
c168805ca6 Merge "Filtering and fast-forwarding of range tombstones in SSTables 3.x" from Vladimir
"
This patchset adds proper support for sliced reads of partitions
containing range tombstones.

Given the SSTables 3.x repesentation of range tombstones by separate
start and end markers, we refer to the index for the information about
the currently opened range tombstone, if any, when skipping to the next
promoted index block.

Note that for this we have to take the promoted index block immediately
preceding the one we are jumping to.

Tests: unit {release}
"

* 'projects/sstables-30/range-tombstones-slicing/v3' of https://github.com/argenet/scylla:
  tests: Test filtering and forwarding on a partition with interleaved rows and RTs.
  tests: Add tests for reading wide partitions with range tombstones.
  sstables: Support slicing for range tombstones.
  sstables: Set/reset range tombstone start from end open marker.
  sstables: Fix end_open_marker population in promoted index blocks.
  sstables: Add need_skip() helper to data_consume_context.
  sstables: For end_open_marker, return both position in partition and deletion time.
2018-09-05 20:38:39 +03:00
Vladimir Krivopalov
3d13ee3909 tests: Test filtering and forwarding on a partition with interleaved rows and RTs.
In this test, rows lie inside range tombstones so we split them on
reading.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
d39e58a97a tests: Add tests for reading wide partitions with range tombstones.
Test the case where rows lie outside range tombstones.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
ec2047e1e6 sstables: Support slicing for range tombstones.
Both filtering on queried ranges and fast-forwarding are supported.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
d57380f44c sstables: Set/reset range tombstone start from end open marker.
When we skip through a wide partition using promoted index, we may land
to a position that lies in the middle of a range tombstone so we need to
be aware of it. For this, we check if the previous promoted block has an
end open marker and either set the range tombstone start using it or
reset if missing.

Note several things about the implementation.

Firstly, we have to peek back at the previous promoted index block for the
end open marker, and so we have to always preserve one more promoted
index block when we read the next batch so that we can stil access it.

Secondly, we use the previous promoted block end position to build
position in partition for the range tombstone start.

Lastly, we don't have a notion of end open marker in older consumers
that work with SSTables of ka/la formats so we only call the
corresponding methods if the consumer supports them.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
939e4893ef sstables: Fix end_open_marker population in promoted index blocks.
We should not access the internal object stored in std::optional when
passing the end_open_marker, moreover that it can be disengaged.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
84bff86fbc sstables: Add need_skip() helper to data_consume_context.
This methods tells whether we will need to skip to reach the input
position or not.
It can be used for skipping with index when reading SSTables 3.x because
we only want to to set/reset the open range tombstone bound when we
actually move to another promoted index block.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Tomasz Grabiec
cd201d1987 db/batchlog_manager: Do not return a value from timer callback
Timer callbacks are std::function<void()>.

Exposed by changing callback_t to noncopyable_function<>.

Message-Id: <1536138045-29209-1-git-send-email-tgrabiec@scylladb.com>
2018-09-05 12:32:21 +03:00
Asias He
89b769a073 storage_service: Wait for range setup before announcing join status
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:

   token_metadata - sorted_tokens is empty in first_token_index!
   storage_proxy - Failed to apply mutation from 127.0.4.1#0:
   std::runtime_error (sorted_tokens is empty in first_token_index!)

To fix, wait for the token range setup before announcing the join
status.

Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test

Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
2018-09-05 10:51:43 +03:00
Vlad Zolotarov
dae70e1166 tests: loading_cache_test: configure a validity timeout in test_loading_cache_loading_different_keys to a greater value
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
2018-09-05 10:19:59 +03:00
Vladimir Krivopalov
ac0c71bdc1 sstables: For end_open_marker, return both position in partition and deletion time.
Prior to this fix, the end_open_marker has been only accessible as a
plain deletion_time structure. Now it also contains the start position
of a promoted index block so that it can be used for setting range
tombstone open bound.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-04 18:16:21 -07:00
Piotr Sarna
f494d03c3f tests: add test case for filtering with DESC clustering order
Refs #3741

Message-Id: <1b8eab8d668eb000b306686c15324e6acde8e616.1535981852.git.sarna@scylladb.com>
2018-09-04 16:05:19 +03:00
Piotr Sarna
8e52b66516 cql3: fix filtering with descending clustering order
When slice::is_satisfied_by() restriction check is performed
on raw data represented as bytes, it should always use a regular
type comparator, not a reversed one. Reversed types are used to
preserve descending clustering order, but comparison with constants
should be used with a regular underlying type comparator (for x < 1
to actually mean 'lesser than 1' instead of 'bigger than 1, because
the clustering order is reversed').

Fixes #3741

Message-Id: <3e25fc66688c9253287f2c4f31ede8339b9bbe23.1535981852.git.sarna@scylladb.com>
2018-09-04 16:05:15 +03:00
Piotr Sarna
5b5c9f2707 cql3: fix a 'pratition_key' typo
partition_key got misspelled with 'pratition_key' typo in the original
series.

Message-Id: <de59fe6161df5442b19d8ba4336e2f828b7ede32.1535981852.git.sarna@scylladb.com>
2018-09-04 16:05:09 +03:00
Takuya ASADA
bd8a5664b8 dist/common/scripts/scylla_raid_setup: create scylla-server.service.d when it doesn't exist
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.

Fixes #3738

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
2018-09-04 10:12:32 +03:00
Tomasz Grabiec
4fb3f7e8eb managed_vector: Make external_memory_usage() ignore reserved space
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.

It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.

Fixes #3625 (hopefully).

Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
2018-09-03 17:09:54 +03:00
Takuya ASADA
d78762d627 dist/debian: fix broken debian/changelog
It also need $MUSTACHE_DIST.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180903094558.3862-1-syuu@scylladb.com>
2018-09-03 14:04:01 +03:00
Duarte Nunes
e49a14e308 Merge 'Stateful range scans' from Botond
"
This series extends the query statefullness, introduced by f8613a841 to
point queries, to range scans as well. This means that queriers will be
saved and reused for range scans too.
This series builds heavily on the infrastructure introduced by stateful
point queries, namely the querier object and the querier_cache. It also
builds on another critical piece of infrastructure, the
multishard_combining_reader, introduced by 2d126a79b.
To make the range scan on a given node suspendable and resumable we move
away from the current code in
`storage_proxy::query_nonsingular_mutations_locally()` and use a
multishard_combining_reader to execute the read. When the page is filled
this reader is dismantled and its shard readers are saved in the
querier cache.
There are of course a lot more details to it but this is the gist of it.

Tests: unit(release, debug), dtest(paging_test.py, paging_additional_test.py)
"

* '1865/range-scans/v7.1' of https://github.com/denesb/scylla: (33 commits)
  query_pagers: generate query_uuid for range-scans as well
  storage_proxy: use preferred/last replicas
  storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent
  db::consistency_level::filter_for_query() add preferred_endpoints
  storage_proxy: use query_mutations_from_all_shards() for range scans
  tests: add unit test for multishard_mutation_query()
  tests/mutation_assertions.hh: add missing include
  multishard_mutation_query: add badness counters
  database: add query_mutations_on_all_shards()
  mutation_compactor: add detach_state()
  flat_mutation_reader: add unpop_mutation_fragment()
  Move reconcilable_result_builder declaration to mutation_query.hh
  mutation_source_test: add an additional REQUIRE()
  mutation: add missing assert to mutation from reader
  querier: add shard_mutation_querier
  querier: prepare for multi-ranges
  tests/querier_cache: add tests specific for multiple entry-types
  querier: split querier into separate data and mutation querier types
  querier: move consume_page logic into a free function
  querier: move all matching related logic into free functions
  ...
2018-09-03 09:09:17 +01:00
Botond Dénes
cd49c23a66 query_pagers: generate query_uuid for range-scans as well
And thus enable stateful range scans.
2018-09-03 10:31:44 +03:00
Botond Dénes
6486d6c8bd storage_proxy: use preferred/last replicas 2018-09-03 10:31:44 +03:00
Botond Dénes
577a06ce1b storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent 2018-09-03 10:31:44 +03:00
Botond Dénes
6e59cee244 db::consistency_level::filter_for_query() add preferred_endpoints
To the second overload (the one without read-repair related params) too.
2018-09-03 10:31:44 +03:00
Botond Dénes
2f66bde26f storage_proxy: use query_mutations_from_all_shards() for range scans 2018-09-03 10:31:44 +03:00
Botond Dénes
6779b63dfe tests: add unit test for multishard_mutation_query() 2018-09-03 10:31:44 +03:00
Botond Dénes
c678b665b4 tests/mutation_assertions.hh: add missing include 2018-09-03 10:31:44 +03:00
Botond Dénes
253407bdc8 multishard_mutation_query: add badness counters
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves

The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.

The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
2018-09-03 10:31:44 +03:00
Botond Dénes
97364c7ad9 database: add query_mutations_on_all_shards()
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
2018-09-03 10:31:44 +03:00
Botond Dénes
33d72efa49 mutation_compactor: add detach_state()
Allow the state of the compaction to be detached. The detached state is
a set of mutation fragments, which if replayed through a new compactor
object will result in the latter being in the same state as the previous
one was.
This allows for storing the compaction state in the compacted reader by
using `unpop_mutation_fragment()` to push back the fragments that
comprise the detached state into the reader. This way, if a new
compaction object is created it can just consume the reader and continue
where the previous compaction left off.
2018-09-03 10:31:44 +03:00
Botond Dénes
48054ed810 flat_mutation_reader: add unpop_mutation_fragment()
This is the inverse of `pop_mutation_fragment()`. Allow fragments to be
pushed back into the buffer of the reader to undo a previous consumtion
of the fragments.
2018-09-03 10:31:44 +03:00
Botond Dénes
3bcd577907 Move reconcilable_result_builder declaration to mutation_query.hh
It will be used by code outside of mutation_partition.cc so it needs to
be public. The definition remains in mutation_partition.cc.
2018-09-03 10:31:44 +03:00
Botond Dénes
b8b34223a4 mutation_source_test: add an additional REQUIRE()
test_streamed_mutation_forwarding_is_consistent_with_slicing already has
a REQUIRE() for the mutation read with the slicing reader. Add another
one for the forwarding reader. This makes it more consistent and also
helps finding problems with either the forwarding or slicing reader.
2018-09-03 10:31:44 +03:00
Botond Dénes
d347866664 mutation: add missing assert to mutation from reader
read_mutation_from_flat_mutation_reader's internal adapter can build a
single mutation only and hence can consume only a single partition.
If more than one partitions are pushed down from the producer the
adaptor will very likely crash. To avoid unnecessary investigations add
an assert() to fail early and make it clear what the real problem is.
All other consume_ methods have an assert() already for their
invariants so this is just following suit.
2018-09-03 10:31:44 +03:00
Botond Dénes
ecb1e79bcc querier: add shard_mutation_querier
The querier to be used for saving shard readers belonging to a
multishard range scan. This querier doesn't provide a `consume_page`
method as it doesn't support reading from it directly. It is more
of a storage to allow caching the reader and any objects it depends on.
2018-09-03 10:31:44 +03:00
Botond Dénes
07cdf766c5 querier: prepare for multi-ranges
In the next patch a querier will be added that reads multiple ranges as
opposed to a single range that data and mutation queriers read.
To keep `querier_cache` code seamless regarding this difference change all
range-matching logic to work in terms of `dht::partition_ranges_view`.
This allows for cheap and seamless way of having a single code-base for
the insert/lookup logic. Code actually matching ranges is updated to be
able to handle both singular and multi-ranges while maintaining backward
compatibility.
2018-09-03 10:31:44 +03:00
Botond Dénes
88a7effd8d tests/querier_cache: add tests specific for multiple entry-types 2018-09-03 10:31:44 +03:00
Botond Dénes
c12008b8cb querier: split querier into separate data and mutation querier types
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.

Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.

Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
2018-09-03 10:31:44 +03:00
Botond Dénes
e46251ebf6 querier: move consume_page logic into a free function
In preparation of the now single querier being split into multiple more
specialized ones. Make it possible for the multiple queriers sharing the
same implementation. Also, the code can now be reused by outside code as
well, not just queriers.
2018-09-03 10:31:44 +03:00
Botond Dénes
c53f17ddb8 querier: move all matching related logic into free functions
So that they can be used for multiple querier classes easily, without
inheritance. The functions are not visible from the header.
Also update the comments on `querier` to w.r.t. the disappeared
checking functions. Change the language to be more general. In practice
these checks are never done by client code, instead they are done by the
`querier_cache`.
2018-09-03 10:31:44 +03:00
Botond Dénes
43f464c52d querier: inline querier::current_position() and make it public 2018-09-03 10:31:44 +03:00
Botond Dénes
86a61ded7d querier: s/position/position_view/
Also treat it as a view, that is take it by value in functions,
instead of reference.
2018-09-03 10:31:44 +03:00
Botond Dénes
6e4ec53679 querier: move position outside of querier
In preparation for having multiple querier types that can share code
without inheritance.
2018-09-03 10:31:44 +03:00
Botond Dénes
a172dfec4e querier: move clustering_position_tracker outside of querier
In preparation for having multiple querier types that can share code
without inheritance.
2018-09-03 10:31:44 +03:00
Botond Dénes
7bd955e993 querier_cache: move insert/lookup related logic into free functions
In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
2018-09-03 10:31:44 +03:00
Botond Dénes
cded477b94 querier: return std::optional<querier> instead of using create_fun()
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
2018-09-03 10:31:44 +03:00
Botond Dénes
5f726e9a89 querier: move all to query namespace
To avoid name clashes.
2018-09-03 10:31:44 +03:00
Botond Dénes
867f69b9d1 dht::i_partitioner: add partition_ranges_view 2018-09-03 10:31:44 +03:00
Botond Dénes
a011a9ebf2 mutation_reader: multishard_combining_reader: support custom dismantler
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.

The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
2018-09-03 10:31:44 +03:00
Botond Dénes
f13b878a94 mutation_reader: pass all standard reader params to remote_reader_factory
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
2018-09-03 10:31:44 +03:00
Botond Dénes
e67c6d9f39 flat_mutation_reader::impl: add protected buffer() member
To allow implementations to access the buffer in a read-only way.
2018-09-03 10:31:44 +03:00
Botond Dénes
8915293257 multishard_combining_reader: fix incorrect comment 2018-09-03 10:31:44 +03:00
Botond Dénes
75d60b0627 docs: add paged-queries.md design doc 2018-09-03 10:31:44 +03:00
Duarte Nunes
6593226849 Merge branch 'loading_cache: fix a consistency of size() and iterators APIs' from Vlad
"
After we fixed reloading flow it enabled situations when items are no longer cached but
still held in the underlying loading_shared_values object. Since loading_cache::size() returns
the size of its loading_shared_values object and loading_cache::begin()/end()/find() are returning
iterators based on loading_shared_values iterators these APIs may return very weird values, e.g.
size() may return the same value after one of the items have been removed using remove(key) API.

This series fixes this by switching mentioned above APIs to work on top of lru_list object instead
of loading_shared_values.
"

* 'loading_cache_fix_api_semantics-v1' of https://github.com/vladzcloudius/scylla:
  loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
  loading_cache: make size() return the size of lru_list instead of loading_shared_values
2018-09-01 11:05:28 +01:00
Avi Kivity
fd8eae50db build: add relocatable package target
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).

We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.

With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.

The packages are created by running

    ninja build/release/scylla-package.tar

or

    ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
2018-08-31 23:14:42 +01:00
Vlad Zolotarov
945d26e4ee loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.

This may create weird situations like this:

<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
    std::out << e << std::endl;
}

<all 10 entries are printed, including the one for "key1">

In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 20:56:44 -04:00
Vlad Zolotarov
1e56c7dd58 loading_cache: make size() return the size of lru_list instead of loading_shared_values
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 15:55:30 -04:00
Paweł Dziepak
dbbd664600 Update seastar submodule
* seastar 12f18ce...5712816 (6):
  > tests: add signal_test to test list
  > Merge "Enhancements for memory_output_stream" from Paweł
  > seastar-addr2line: don't print an empty line between backtrace lines
  > seastar-addr2line: add --verbose option
  > seastar-addr2line: make prefix matching non-greedy
  > future: make available() const
2018-08-30 11:41:27 +01:00
Glauber Costa
8dea1b3c61 database: fix directory for information when loading new SSTables from upload dir
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.

Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>
2018-08-30 10:34:25 +03:00
Nadav Har'El
2f02d006b3 materialized views: more tests
Additional tests for cases surrounding issue #3362, where base rows
disappear (or not) and view rows need to disappear (or not) as well.
These new tests focus on checking that view_updates::do_delete_old_entry()
is correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-2-nyh@scylladb.com>
2018-08-29 14:33:48 +01:00
Nadav Har'El
16a6f76873 materialized views: simplify do_delete_old_entry()
In previous patches, we gave up on an old (and broken) attempt to track
the timestamps of many unselected base-table columns through one row marker
in the view table - and replaced them by "virtual cells", one per unselected
cell.

The do_delete_old_entry() function still contains old code which maintained
that row marker, and is no longer needed. That old code is no only no longer
needed, it also no longer did anything because all columns now appear in
the view (as virtual columns) so the code ignored them when calculating the
row marker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-1-nyh@scylladb.com>
2018-08-29 14:33:41 +01:00
Duarte Nunes
79d796e710 Merge 'Materialized Views: row liveness correction' from Nadav
"
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness - existance or disappearance -
of a view-table row is tied to the liveness of the base table row. And
that, in turn, depends not only on selected columns (base-table columns
SELECTed to also appear in the view) but also on unselected columns.

This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch set we tried to build a single "row marker" in the view column which
tried to summarize the liveness information in all unselected columns.
But this proved unworkable, as explained in issue #3362 and as will be
demonstrated in unit tests at the end of this series.

Because we can't replace several unselected cells by one row marker, what
we do in this series is to add for each for the unselected cells a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.

Fixes #3362
"

* 'virtual-cols-v3' of https://github.com/nyh/scylla:
  Materialized Views: test that virtual columns are not visible
  Materialized Views: unit test reproducing fixed issue #3362
  Materialized Views: no need for elaborate row marker calculations
  Materialized Views: add unselected columns as virtual columns
  Materialized Views: fill virtual columns
  Do not allow selecting a virtual column
  schema: persist "view virtual" columns to a separate system table
  schema: add "view virtual" flag to schema's column_definition
  Add "empty" type name to CQL parser, but only for internal parsing
2018-08-29 14:32:38 +01:00
Paweł Dziepak
6f1c3e6945 Merge "Convert more execution_stages to inherit scheduling_groups" from Avi
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.

This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"

* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
  database: make database's mutation apply stage inherit its scheduling group from the caller
  database: make database::_mutation_query_stage inherit the scheduling group
  database: make database::_data_query_stage inheriting its caller's scheduling_group
  storage_proxy: make _mutate_stage inherit its caller's scheduling_group
2018-08-28 13:49:31 +01:00
Duarte Nunes
f6aadd8077 Merge 'utils::loading_cache: improve reload() robustness' from Vlad
"This series introduces a few improvements related to a reload flow.

From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.

It may also safely evict as many items from the cache as needed."

Fixes #3606

* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: hold a shared_value_ptr to the value when we reload
  utils::loading_cache::on_timer(): remove not needed capture of "this"
  utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
2018-08-28 10:52:20 +01:00
Piotr Sarna
aa2bfc0a71 tests: add multi-column pk test to INSERT JSON case
Refs #3687
Message-Id: <6ba1328549ed701691ca7cbdacc7d6fa72f2c3de.1534171422.git.sarna@scylladb.com>
2018-08-28 11:34:13 +03:00
Piotr Sarna
fa72422baa cql3: fix handling multi-column partition key in INSERT JSON
Multiple column partition keys were previously handled incorrectly,
now the implementation is based on from_exploded instead of
from_singular.

Fixes #3687
Message-Id: <09e0bdb0f1c18d49b9e67c21777d93ba1545a13c.1534171422.git.sarna@scylladb.com>
2018-08-28 11:34:11 +03:00
Avi Kivity
1fd9974b6b Merge "tests/loading_cache_test: Fix flakiness" from Duarte
"
Fix loading_cache_test flakiness by retrying assertions.

Tests: unit(loading_cache_test(debug, release))

Fixes #3723
"

* 'loading-cache-test-flake/v4' of https://github.com/duarten/scylla:
  tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
  tests/loading_cache_test: Use eventually() instead of open-coding it
  tests/mutation_reader_test: Extract eventually_true() to eventually.hh
  tests/cql_test_env: Lift eventually() to its own header file
2018-08-28 09:35:09 +03:00
Takuya ASADA
4a5157857a dist/debian: support package renaming on build script
To automatically rename packages on enterprise release, added package name
prefix as a variable on build_deb.sh.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180828010445.11920-1-syuu@scylladb.com>
2018-08-28 09:25:07 +03:00
Avi Kivity
22396d57c2 Update seastar submodule
* seastar 9bb1611...12f18ce (17):
  > correctly configure I/O Scheduler for usage with the YAML file
  > Added support for user-defined signal handlers
  > Added reactor method to modify blocked_reactor_notify_ms
  > configure.py: Use the user-specified compiler for dialect detection
  > seastar-addr2line: clear current trace when omitting already seen trace
  > seastar-addr2line: fix redirecting output to a file
  > seastar-addr2line: don't require a space before the addresses
  > tests: Ensure test thread is always joined
  > README.md: Add cute badges
  > iotune: adjust num-io-queues recommendation
  > dns: add SRV record lookup
  > reactor: define max_aio_per_queue for C++14
  > reactor,alien: silence GCC warnings
  > core,json,net: silence GCC warnings
  > fstream: "using data_sink_impl::put" to silence gcc warning
  > Merge 'Ensure Seastar compiles in C++14 mode' from Jesse
  > Revert "foreign_ptr: allow waiting for the destruction of the managed ptr"
2018-08-28 09:10:14 +03:00
Tomasz Grabiec
75cde85349 Merge "Support reading range tombstones" from Piotr and Vladimir
Implement and test support for reading range tombstones in SSTables 3.

Does not yet support reads which are using slicing or fast forwarding.

From github.com/scylladb/seastar-dev.git haaawk/sstables3/tombstones_v11:

Piotr Jastrzebski (5):
  sstables: Add consumer_m::consume_range_tombstone
  sstables: Support null columns in ck
  sstables: Support reading range_tombstones
  sstables: Test reading range_tombstones
  sstables: Add test for RT with non-full key

Vladimir Krivopalov (2):
  sstables: Add operator<< overload for bound_kind_m.
  keys: Add clustering_key_prefix::make_full helper.
2018-08-27 20:43:38 +02:00
Duarte Nunes
40044c0460 tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
The `loading_cache_test::test_loading_cache_loading_reloading` test
case is flaky, and fails in both debug and release mode. In an
over-provisioned environment, it's possible that when the reactor
runs, the timers for the `sleep()` and for reloading the
`loading_cache` are both expired, and continuations are scheduled with
an arbitrary order, causing the test to fail.

Fixes #3723

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00
Duarte Nunes
0cb03b966d tests/loading_cache_test: Use eventually() instead of open-coding it
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00
Duarte Nunes
b89fa0d67b tests/mutation_reader_test: Extract eventually_true() to eventually.hh
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00
Duarte Nunes
636c5ded6c tests/cql_test_env: Lift eventually() to its own header file
Retrying is needed everywhere.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:00 +01:00
Avi Kivity
5792a59c96 migration_manager: downgrade frightening "Can't send migration request" ERROR
This error is transient, since as soon as the node is up we will be able
to send the migration request.  Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).

The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.

Fixes #3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
2018-08-27 14:49:36 +02:00
Takuya ASADA
10b67c7934 dist/ami: package scylla-ami as rpm
Now scylla-ami is not submodule of scylla repo, it will works as
independent repository just like scylla-jmx and scylla-tools, provides
.rpm package to install AMI scripts on AMI.

Most files are gone from dist/ami/files, but scylla_install_ami copied
from scylla-ami, since it requires to install scylla .rpms, cannot
pacakge in scylla-ami rpm.

On scylla_install_ami, we dropped ixgbevf/ena drivers code, we will
provide 'scylla-ixgbevf' and 'scylla-ena' DKMS .rpm instead.
It will automatically build kernel modules for current kernel.

A repo of the driver packages is on
https://copr.fedorainfracloud.org/coprs/scylladb/scylla-ami-drivers/

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180821201101.4631-1-syuu@scylladb.com>
2018-08-27 11:48:52 +03:00
Avi Kivity
62750eb517 Merge "Prepare for removing Iterator from simple_memory_input_stream" from Paweł
"
Right now, simple_memory_input_stream takes Iterator as a template
parameter. That iterator is supposed to point to fragments in a
underlying fragmented buffer. This makes no sense, since simple streams
deal only with contiguous buffer.

This series removes any assumption that simple_memory_input_stream has
iterator_type member from Scylla so that it can be removed.
"

* tag 'prepare-simple-stream-no-iterator/v1' of https://github.com/pdziepak/scylla:
  idl: deserialized_bytes_proxy do not assume presence of iterator_type
  idl-compiler: specify return type of with_serialized_stream() lambdas
2018-08-26 16:29:06 +03:00
Avi Kivity
16478355be Merge "Refactor password handling" from Jesse
"
This series is a refactor of password management, motivated by a
combination of correctness bugs, improving testability, improving
clarity, and adding documentation.

Tests: unit (release)
"

* 'jhk/passwords_refactor/v2' of https://github.com/hakuch/scylla:
  auth: Clean up implementation comments
  auth: Remove unnecessary local variable
  auth: Allow different random engines for salt
  auth: Correct modulo bias in salt generation
  auth: Extract random byte generation for salt
  auth: Split out test for best supported scheme
  auth: Rename function to use full words
  auth: Add domain-specific exception for passwords
  auth: Document passwords interface
  auth: Move passsword stuff to its own namespace
  auth: Identify password hashing errors correctly
  auth: Add unit tests for password handling
  auth: Move password handling to its own files
  auth: Construct `std::random_device` instances once
2018-08-26 11:18:31 +03:00
Tomasz Grabiec
2afce13967 database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:03:58 +03:00
Tomasz Grabiec
1e50f85288 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:34 +03:00
Tomasz Grabiec
364418b5c5 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:32 +03:00
Avi Kivity
54ac334f4b Update scylla-ami submodule
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
  > scylla-ami-setup.service: run only on first startup
  > Use fstab to mount RAID volume on every reboot
2018-08-26 10:57:32 +03:00
Takuya ASADA
ff55e3c247 dist/common/scripts/scylla_raid_setup: refuse start scylla-server.service when RAID volume is not mounted
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.

Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.

See #3640

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
2018-08-26 10:55:34 +03:00
Avi Kivity
37f9a3c566 database: make database's mutation apply stage inherit its scheduling group from the caller
Like the two preceeding patches, convert the mutation apply stage
to an inheriting_concrete_scheduling_group.  This change has two
added benefits: we get rid of a thread_local, and we drop a
with_scheduling_group() inside an execution stage which just creates a bunch
of continuations and somewhat undoes the benefit of the execution stage.
2018-08-24 19:04:49 +03:00
Avi Kivity
ebff1cfc37 database: make database::_mutation_query_stage inherit the scheduling group
Like the preceeding patch and for the same reasons, adjust
database::_mutation_query_stage to inherit the scheduling group from its
caller.
2018-08-24 19:04:49 +03:00
Avi Kivity
596fb6f2f7 database: make database::_data_query_stage inheriting its caller's scheduling_group
Now (8c993e0728) that replica-side operations run under the correct
scheduling group, we can inherit the scheduling_group for _data_query_stage
from the caller.  By itself this doesn't do much, but it will later allow us
to have multiple groups for statement executions.
2018-08-24 19:04:49 +03:00
Avi Kivity
908e497f3d storage_proxy: make _mutate_stage inherit its caller's scheduling_group
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.

Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
2018-08-24 19:04:49 +03:00
Paweł Dziepak
4ca991ea65 idl: deserialized_bytes_proxy do not assume presence of iterator_type
deserialized_bytes_proxy assumes that the provided input stream has
iterator_type that represents the iterator pointing to the next
fragment of the fragmented underlying buffyer. This makes little sense
if the input stream is a contiguous one (i.e.
simple_memory_input_stream) so let's not make such assumptions.
2018-08-24 16:19:40 +01:00
Paweł Dziepak
3b7579aa0e idl-compiler: specify return type of with_serialized_stream() lambdas
IDL-generated code uses with_serialized_stream() to optimise for cases
when the underlying buffer is not fragmented. The provided lambda will
be called with wither simple or fragmented stream as an argument. The
consequence of this is that both instantations of generic lambda need to
return the same type. This is a problem if the type is deduced and
depends on the provided input stream (e.g. different type for fragmented
and simple streams). The solution is to explictly specify the return
type as the type returned by deserialising general utils::input_stream.
This way each instantation of lambda can return whatever it wants as
long as it is convertible to the type that the serialiser would return
if utils::input_stream was given.
2018-08-24 16:07:20 +01:00
Tomasz Grabiec
10f6b125c8 database: Run system table flushes in the main scheduling group
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.

It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.

I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.

Fixes #3717.

Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
2018-08-23 15:07:05 +03:00
Piotr Sarna
94262cf5d0 tests: add null collection test scenario to INSERT JSON
Refs #3664
Message-Id: <a34b9f5e8b9d7e3dd8906b559957220d74734b41.1534848313.git.sarna@scylladb.com>
2018-08-23 11:22:07 +03:00
Piotr Sarna
465045368f cql3: add proper setting of empty collections in INSERT JSON
Previously empty collections where incorrectly added as dead cells,
which resulted in serialization errors later.

Fixes #3664
Message-Id: <a9c90d66c6737641cafe40edb779df490ada0309.1534848313.git.sarna@scylladb.com>
2018-08-23 11:22:05 +03:00
Duarte Nunes
36a293bb23 cell_locking: Use xxhash instead of fnv1a
Being the single user of fnv1a, this allows us to get rid of it. As
the TODO inside fnv1a_hasher.hh indicates, and judging by any
independent benchmark, fnv1a is very slow. As we have added xx_hash
since then, and we know it to be fast, use it instead.

Tests: unit(release/cell_locker_test)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180823081715.26089-1-duarte@scylladb.com>
2018-08-23 11:21:00 +03:00
Piotr Jastrzebski
2997fda1b1 sstables: Add test for RT with non-full key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 18:28:11 +02:00
Piotr Jastrzebski
c50929233f sstables: Test reading range_tombstones
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 18:28:11 +02:00
Piotr Jastrzebski
7434be348c sstables: Support reading range_tombstones
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 18:27:41 +02:00
Piotr Jastrzebski
d19a108d87 sstables: Support null columns in ck
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 14:32:10 +02:00
Piotr Jastrzebski
3636697663 sstables: Add consumer_m::consume_range_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 12:53:15 +02:00
Vladimir Krivopalov
8acf4ddb8e keys: Add clustering_key_prefix::make_full helper.
This method fills non-full clustering key with trailing empty values to
make it full.
This can be used for clustering keys of rows in a compact table as,
unlike in regular tables, they can be non-full.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-22 12:13:23 +02:00
Amnon Heiman
ab207356a5 API: storage_service stream endpoints
This patch changes how list of tokens returned from the storage_service
API.

Instead of create a vector and construct a json object of it, use the
streaming capabilities of the http.

This is important for large cluster and prevent large allocations.

Fixes #3701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180820195631.26792-1-amnon@scylladb.com>
2018-08-22 11:24:38 +03:00
Takuya ASADA
e4f38b7c22 dist/redhat: support package renaming on build script
To automatically rename packages on enterprise release, added package name
prefix as a variable on build_rpm.sh.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180822072105.9420-1-syuu@scylladb.com>
2018-08-22 11:03:39 +03:00
Piotr Sarna
4a274ee7e2 tests: add parsing varint from JSON string test
Refs #3666
Message-Id: <f4205e9484f5385796fade7986e3e38dcbc65bac.1534845398.git.sarna@scylladb.com>
2018-08-21 11:20:11 +01:00
Piotr Sarna
37a5c38471 types: enable deserializing varint from JSON string
Previously deserialization failed because the JSON string
representing a number was unnecessarily quoted.

Fixes #3666
Message-Id: <a0a100dbac7c151d627522174303657d1da05c27.1534845398.git.sarna@scylladb.com>
2018-08-21 11:20:11 +01:00
Tomasz Grabiec
6937cc2d1c Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

 std::experimental::optional<clustering_key> row_key;
 if (!column.is_static()) {
   row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
 }
 ...
 auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

* https://github.com/duarten/scylla cql-list-fixes/v1:
  tests/cql_query_test: Test multi-cell static list updates with ckeys
  cql3/lists: Fix multi-cell static list updates in the presence of ckeys
  keys: Add factory for an empty clustering_key_prefix_view
2018-08-21 12:14:30 +02:00
Vladimir Krivopalov
c8422c9a91 sstables: Add operator<< overload for bound_kind_m.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-20 16:22:53 -07:00
Duarte Nunes
ff7304b190 tests/cql_query_test: Test multi-cell static list updates with ckeys
Refs #3703

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-20 21:39:37 +01:00
Duarte Nunes
05731cb5ad cql3/lists: Fix multi-cell static list updates in the presence of ckeys
This patch fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
  row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
...
auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-20 21:39:37 +01:00
Duarte Nunes
ce461b06d7 keys: Add factory for an empty clustering_key_prefix_view
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-20 21:39:37 +01:00
Avi Kivity
231174cda9 build: auto-detect g++ -gz support
Older combinations of g++/binutils don't support -gz, so auto-detect its
presence.

Fixes #3697.
Message-Id: <20180817161113.2287-1-avi@scylladb.com>
2018-08-20 18:48:18 +02:00
Tomasz Grabiec
c31dff8211 Merge 'Skip inside wide partitions using index (rows only)' from Vladimir
This patchset adds support for skipping inside wide partitions using
index for sliced queries. This can significantly reduce disk I/O for
queries that only need to read a small amount of data from a wide
partition.

Other changes include general code clean-up and simplification.

 * github.com/argenet/scylla.git tree/projects/sstables-30/skip_using_index/v6:
  sstables: Support resetting data_consume_rows_context_m to
    indexable_element::cell.
  tests: Add tests to cover skipping with index through SSTables 3.x.
  sstables: Support skipping inside wide partitions using index.
  to_string: Add operator<< overload for std::optional.
  sstables: Use std::optional instead of std::experimental::optional.
2018-08-20 18:39:51 +02:00
Avi Kivity
e605cd4ff8 multishard_writer_test: reduce mutation count in release mode
We see occasional bad_alloc failures in release mode; this is due
to the random mutation generator generating large mutations.

Reduce the mutation count to 300. I tested 100 runs and all passed,
so it reduces the false positive rate to < 1%.
2018-08-20 16:53:05 +03:00
Gleb Natapov
7277ee2939 storage_proxy: do not fail read without speculation on connection error
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.

Fixes #3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
2018-08-20 10:12:31 +03:00
Vladimir Krivopalov
f1b9f82ff5 sstables: Use std::optional instead of std::experimental::optional.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:20:05 -07:00
Vladimir Krivopalov
7b1d4915a1 to_string: Add operator<< overload for std::optional.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:20:05 -07:00
Vladimir Krivopalov
3e92434eed sstables: Support skipping inside wide partitions using index.
This fix adds proper support for skipping inside wide partitions using
index for sliced reads. This significantly reduces disk I/O for filtered
queries.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:20:04 -07:00
Vladimir Krivopalov
ec78fb9f13 tests: Add tests to cover skipping with index through SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:19:22 -07:00
Vladimir Krivopalov
4bf1e9de3f sstables: Support resetting data_consume_rows_context_m to indexable_element::cell.
Set the proper parsing state when resetting to indexable_element::cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 10:09:19 -07:00
Eliran Sinvani
f5f6cf2096 cql3: remove rejection of an IN relation if not on last partition KEY
The constraint is no longer relevant, since Casandra removed
it in version 2.2. In addition the mechanism for handling this
case is already implemented and is identical in case of
clustering keys with single column EQ,= and IN relations.
(Cartesian product of singular ranges).

A unit test for this test case was added.

Fixes #1735
Tests:
1. Unit Tests.
2. Manual testing with the case described in the issue.
3. dtest: ql_additional_tests.py:TestCQL.composite_row_key_test

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <83b43fdc1ca0e0cc287f66f11816fc71b8bd2925.1534430405.git.eliransin@scylladb.com>
2018-08-16 19:32:43 +01:00
Eliran Sinvani
d743ceae76 cql3: ignore LIMIT in select statement with aggregate
LIMIT should restrict the output result and not the query whose result
set is aggregated. when using aggregate the output is guarantied to
be only one row long. since LIMIT accepts only none negative numbers,
it has no effect and can be ignored.

Fixes #2028
Tests: The issue described Testcase ,  UnitTests.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <6c235376c81f052020e2ed23d0a3d071b36d4415.1534416997.git.eliransin@scylladb.com>
2018-08-16 19:31:56 +01:00
Nadav Har'El
8c604921ac Materialized Views: test that virtual columns are not visible
In the previous patches, we added "virtual columns" to materialized views
to solve row liveness issues (issue #3362). Here we add a test that confirms
that although these virtual columns exist in the view, they should not be
visible to the user. They cannot be explicitly SELECTed from the view table,
and a "SELECT *" will skip them.

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:51:46 +03:00
Nadav Har'El
5ca974547a Materialized Views: unit test reproducing fixed issue #3362
This patch includes several tests reproducing issue #3362 - the effect
of unselected columns on view-table row liveness - and confirming
that it was fixed.

We found two example scenarios to demonstrate the bug. One scenario,
test_3362_with_ttls(), involves an unselected column with a TTL. The other,
test_3362_no_ttls() demonstrates the same bug without using TTL, and using
explicit updates and deletions instead. These two tests are heavily
commented, to explain what they test, and why.

In addition to these two basic tests, we also include similar tests
involving multiple items in a collection column, instead of multiple
separate columns, which demonstrate the same problem exists there (and
why, unfortunately, the "virtual columns" we add in that case need to
be collections too).

We also test that the virtual columns - and the problems they fix -
work not only on columns originally created with the view, but also
with unselected columns added later with ALTER TABLE on the base table.

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:48:07 +03:00
Nadav Har'El
6c00341383 Materialized Views: no need for elaborate row marker calculations
Now that we have separate virtual cells to represent unselected columns
in a materialized view, we no longer need the elaborate row-marker liveness
calculations which aimed (but failed) to do the same thing. So that code
can be removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:45:41 +03:00
Nadav Har'El
30f721afab Materialized Views: add unselected columns as virtual columns
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness (existance or disappearance)
of a view-table row is tied to the liveness of the base table row - and
that depends not only on selected columns (base-table columns SELECTed to
also appear in the view) but also on unselected columns.

This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch we tried to build a single "row marker" in the view column which
summarizes the liveness information in all unselected columns, but this
proved unworkable, as explained in issue #3362 and as will be demonstrated
in unit tests in a later patch.

Because we can't replace several unselected cells by one row marker, what
we do in this patch is to add for each for the unselected cell a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.

This patch just adds the virtual columns to the view schema. Code in
the previous patch, when it notices the virtual columns in the view's
schema, added the appropriate content into these columns.

We may need to add virtual columns to a view when first created, but also
when an unselected column is added to the base table with "ALTER TABLE",
so both are supported in this patch.

Fixes #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:42:22 +03:00
Nadav Har'El
782baa44ef Materialized Views: fill virtual columns
The add_cells_to_view() function usually adds selected cells from the base
table to the view mutation. For issue #3362, we sometimes want to also
add unselected cells as "virtual" cells -  truncated versions of the
base-table cells just without the values.

This patch contains the code to fill the virtual columns' data using the
regular columns from the base table.

This patch does not yet actually *add* any virtual columns to the schema,
so until that is done (in the next patch), this patch will not yet cause
any behavior change. This is important for bisectability.

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:38:27 +03:00
Nadav Har'El
3f3a76aa8f Do not allow selecting a virtual column
For issue #3362, we will need to add to a materialized view also unselected
base-table columns as "virtual columns". We need these columns to exist
to keep view rows alive, but we don't want the user to be able to see
them.

In this patch we prevent SELECTing the virtual columns of the view,
and also exclude the virtual columns from a "SELECT *" on a view.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:34:22 +03:00
Nadav Har'El
36a657fc10 schema: persist "view virtual" columns to a separate system table
In the previous patch, we added a "view virtual" flag on columns. In this
patch we add persistance to this flag: I.e., writing it to the on-disk
schema table and reading it back on startup. But the implementation is
not as simple as adding a flag:

In the on-disk system tables, we have a "columns" table listing all the
columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED
VIEW" works by reading this "columns" table, and listing all of the
requested view's columns. Therefore, we cannot add "virtual columns" -
which are columns not added by the user and not intended to be seen -
to this list.

We therefore need to create in this patch a separate list for virtual
columns, in a new table "view_virtual_columns". This table is essentially
identical to the existing "columns" table, just separate. We need to write
each column to the appropriate table (columns with the view_virtual flag to
"view_virtual_columns", columns without it to the old "columns"), read
from both on startup, and remember to delete columns from both when a table
is dropped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:30:06 +03:00
Nadav Har'El
0a1d93138d schema: add "view virtual" flag to schema's column_definition
In this patch we add a flag, "view virtual", that we can mark on on a
column defined in a schema. In following patches, we will add such virtual
columns to materialized views to allow view rows to remain alive despite
having no data (refs #3362).

After this patch, the "view virtual" flag exists in our in-memory
representation of the schema, but not persisted to disk - we will
fix this in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:23:09 +03:00
Nadav Har'El
b4fc711903 Add "empty" type name to CQL parser, but only for internal parsing
Even before this patch, Scylla supported the "empty" type (a column with
no content) but only internally - i.e., in code but not in CQL syntax.
The "empty" type was used in dense tables without regular columns, and a
special optimization in db::cql_type_parser::parse() allowed this type
name to be parsed when reading the schema tables, without allowing the
"empty" type to be used by users in CQL statements.

However, parse() only supported "empty" itself, and more complex types
like list<empty> were not recognized by parse(). In the following patches,
we plan to add to virtual columns to materialized views, with types empty,
list<empty> or map<something, empty>. We need all these types to work, and
before this patch, they don't. But we want all of these types to only work
internally - when Scylla's code creates these hidden columns; we do not
want to add the "empty" type to CQL's syntax.

This is what we do in this patch: The CQL parser's comparator_type rule
now has a parameter, "internal", used to differenciate internal calls
via db::cql_type_parser::parse() from calls from CQL query parsing.
If a user tries something like:

    CREATE TABLE e (pk empty PRIMARY KEY);

He will get the error:

    Invalid (reserved) user type name empty

Note that here, as usual, unknown types are treated as "user types",
and "empty" is not allowed as a user type name - we "reserve" it in case
one day in the future we will want to allow users a direct syntax to
create empty columns. We already have, following Cassandra, a bunch of
other names reserved from being user type names, including "byte",
"complex", and others (see _reserved_type_names()), and using "empty"
as a type name will result in a similar error message.

Just like all other type names, the name "empty" is not a reserved
keyword in other senses: a user can create a table or a column with
the name "empty", just like he can create one with the name "int".

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:12:27 +03:00
Duarte Nunes
a4355fe7e7 cql3/query_options: Use _value_views in prepare()
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.

In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.

As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.

Fixes #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
2018-08-15 10:38:09 +01:00
Duarte Nunes
8751a58a2b cql3/query_options: Preserve unset values when building value_views
A raw value can be in one of three states: a valid value, an unset
value, a null value. When translating raw_values to their views, we
were treating both unset and null values are null raw_value_views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814231051.14385-1-duarte@scylladb.com>
2018-08-15 10:37:29 +01:00
Duarte Nunes
805ce6e019 cql3/query_processor: Validate presence of statement values timeously
We need to validate before calling query_options::prepare() whether
the set of prepared statement values sent in the query matches the
amount of names we need to bind, otherwise we risk an out-of-bounds
access if the client also specified names together with the values.

Refs #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814225607.14215-1-duarte@scylladb.com>
2018-08-15 10:37:13 +01:00
Eliran Sinvani
d734d316a6 cql3: ensure repeated values in IN clauses don't return repeated rows
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.

Added queries for the in restriction unitest and fixed with a bad result check.

Fixes #2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
2018-08-15 10:21:22 +01:00
Duarte Nunes
a025bf6a7d Merge seastar upstream
Seastar introduced a "compat" namespace, which conflicts with Scylla's
own "compat" namespaces. The merge thus includes changes to scope
uses of Scylla's "compat" namespaces.

* seastar 8ad870f...9bb1611  (5):
  > util/variant_utils: Ensure variant_cast behaves well with rvalues
  > util/std-compat: Fix infinite recursion
  > doc/tutorial: Undo namespace changes
  > util/variant_utils: Add cast_variant()
  > Add compatbility with C++17's library types

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-14 13:07:09 +01:00
Duarte Nunes
25a0a0f83d tests/cql_test_env: Increase eventually() attempts
The current value has proved to be insufficient for our CI
infrastructure.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112201.8595-1-duarte@scylladb.com>
2018-08-14 12:37:32 +01:00
Duarte Nunes
495a92c5b6 tests/gossip_test: Use RAII for orderly destruction
Change the test so that services are correctly teared down, by the
correct order (e.g., storage_service access the messaging_service when
stopping).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-2-duarte@scylladb.com>
2018-08-14 12:27:14 +01:00
Duarte Nunes
3956a77235 tests/gossip_test: Don't bind address to avoid conflicts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-1-duarte@scylladb.com>
2018-08-14 12:27:02 +01:00
Piotr Sarna
310d0a74b9 cql3: throw proper request exception for INSERT JSON
JSON code is amended in order to return proper
"Missing mandatory PRIMARY KEY part" message instead of generic
"Attempt to access value of a disengaged optional object".

Fixes #3665
Message-Id: <69157d659d51ce5a2d408614ce3ba7bf8e3a5d88.1534161127.git.sarna@scylladb.com>
2018-08-13 23:57:37 +01:00
Piotr Sarna
b73669c329 tests: add parsing numeric values from string
Numeric values (ints, doubles) should accept string representation
when passed in INSERT JSON statement.

Refs #3666
Message-Id: <586fea8fd08fe01f7a133f82f517e26d08d7cb76.1534153955.git.sarna@scylladb.com>
2018-08-13 23:57:37 +01:00
Piotr Sarna
b3f438bfec types: enable parsing numeric JSON values from string
In order to be Cassandra-compatible, JSON values passed in INSERT JSON
statement should accept string parameters for numeric types - int,
double, etc.

Fixes #3666
Message-Id: <4da9a2f68de31492a2e9432493663a62b138c2f2.1534153955.git.sarna@scylladb.com>
2018-08-13 23:57:37 +01:00
Duarte Nunes
5de02ab98c tracing: Pass string_view instead of string to add_query
This resulted in superfluous copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180812085326.6260-1-duarte@scylladb.com>
2018-08-13 23:57:37 +01:00
Jesse Haber-Kucharsky
b95bbb2e72 auth: Clean up implementation comments 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
9519a03351 auth: Remove unnecessary local variable
The variable could be declared `const`, but removing it outright seems
more clear and this way we don't have to come up with a name.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
52d3ff057a auth: Allow different random engines for salt
This makes the function useable in more contexts due to
flexibility (including in tests), since the state is not captured and
the characteristics of salt generation can be customized to the caller's
needs.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
836fd954e1 auth: Correct modulo bias in salt generation
Instead of reducing the large value via `%`, which can produce
non-uniformly distributed values when the range is small, we specify the
range in the distribution, which is uniform by construction.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
fe58a0b207 auth: Extract random byte generation for salt 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
fd60d61ebf auth: Split out test for best supported scheme
The `generate_salt` function invokes this function internally now.

This change means that `generate_salt` is now thread-safe and therefore
does not have to be invoked by a single thread only when starting the
`password_authenticator`.

This further means that `generate_salt` does not need to be part of the
public interface of the module, and can be moved to the implementation
file.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
adf058bd1f auth: Rename function to use full words 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
9b8cbb8542 auth: Add domain-specific exception for passwords 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
dbea3f5a01 auth: Document passwords interface 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
b272d622f8 auth: Move passsword stuff to its own namespace
For clarity and nicer function names.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
de01aaf181 auth: Identify password hashing errors correctly
See fce10f2c6e for reference.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
c10fcbf7a5 auth: Add unit tests for password handling
This will mean we can make changes more confidently.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
2a40bcb281 auth: Move password handling to its own files
While the `password_authenticator` is a complex component with lots of
dependencies, password hashing and checking itself is a process with
limited logical state and dependencies, which makes it easy to isolate
and test.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
03cf57db62 auth: Construct std::random_device instances once
`std::random_device` has a lot of implementation-specific behavior, and
as a result we cannot assume much about its performance characteristics.

We initialize thread-specific static instances of `std::random_device`
once so that we don't have the overhead of invoking the ctor during
every invocation of `gensalt`.
2018-08-13 13:24:45 -04:00
Duarte Nunes
f86811a3c9 Merge seastar upstream
* seastar d40faff...8ad870f (9):
  > reactor: switch indentation
  > properly configure I/O Scheduler when --max-io-requests is passed
  > IOTune: tell users that the evaluation will take a while
  > exceptions: fix compilation with static libstdc++
  > apps/iotune: print out which config file updated
  > foreign_ptr: allow waiting for the destruction of the managed ptr
  > Merge "Improve UX for backtraces read from stdin" from Botond

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-12 14:01:36 +01:00
Avi Kivity
183d5ba178 build: compress debug sections
Compressing debug section reduces build size by 30% with no
significant increase in build time.

Results on a 4-core system (ninja release, size in MB):

before:

18056	build

real	59m43.138s
user	229m3.180s
sys	6m49.460s

after:

12387	build

real	60m30.112s
user	232m8.962s
sys	6m49.364s

Presumably, the difference in debug mode is even greater.x
Message-Id: <20180811180444.30578-1-avi@scylladb.com>
2018-08-11 19:41:55 +01:00
Takuya ASADA
2ef1b094d7 dist/common/scripts/scylla_setup: don't proceed RAID setup until user type 'done'
Need to wait user confirmation before running RAID setup.

See #3659
Fixes #3681

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810194507.1115-1-syuu@scylladb.com>
2018-08-11 18:48:05 +03:00
Takuya ASADA
b7cf3d7472 dist/common/scripts/scylla_setup: don't mention about interactive mode prompt when running on non-interactive mode
Skip showing message when it's non-interactive mode.

Fixes #3674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810191945.32693-1-syuu@scylladb.com>
2018-08-11 18:48:03 +03:00
Takuya ASADA
ef9475dd3c dist/common/scripts/scylla_setup: check existance of housekeeping.cfg before asking to run version check
Skip asking to run version check when housekeeping.cfg is already
exists.
Fixes #3657

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180807232313.15525-1-syuu@scylladb.com>
2018-08-11 18:48:02 +03:00
Takuya ASADA
f30b701872 dist/debian: fix install scylla-server.service
On previous commit we moved debian/scylla-server.service to
debian/scylla-server.scylla-server.service to explicitly specify
subpackage name, but it doesn't work for dh_installinit without '--name'
option.

Result of that current scylla-server .deb package missing
scylla-server.service, so we need to rename the service to original
file name.

Fixes #3675

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810221944.24837-1-syuu@scylladb.com>
2018-08-11 15:07:37 +03:00
Duarte Nunes
1521dc56ae Merge 'Pass query options to restrictions filter' from Piotr
"
This miniseries fixes ALLOW FILTERING support for prepared statements
by passing correct query options to the filter instead of empty ones.
"

* 'pass_query_options_to_restrictions_filter' of https://github.com/psarna/scylla:
  tests: add testing prepared statements with ALLOW FILTERING
  cql3: pass query options to restrictions filter
2018-08-09 18:15:18 +01:00
Duarte Nunes
95677877c2 Merge 'JSON support fixes' from Piotr
"
This series addresses SELECT/INSERT JSON support issues, namely
handling null values properly and parsing decimals from strings.
It also comes with updated cql tests.

Tests: unit (release)
"

* 'json_fixes_3' of https://github.com/psarna/scylla:
  cql3: remove superfluous null conversions in to_json_string
  tests: update JSON cql tests
  cql3: enable parsing decimal JSON values from string
  cql3: add missing return for dead cells
  cql3: simplify parsing optional JSON values
  cql3: add handling null value in to_json
  cql3: provide to_json_string for optional bytes argument
2018-08-09 18:05:34 +01:00
Piotr Sarna
9ba218c161 cql3: remove superfluous null conversions in to_json_string
Some types checked when passed bytes argument was empty, and if so,
returned "null" as a JSON string. Now, with to_json_string(bytes_opt)
it's not needed anymore. Also, some types returned "null" instead
of signaling a deserialization error.
2018-08-09 18:07:12 +02:00
Piotr Sarna
fc187fa31e tests: update JSON cql tests
Tests are updated to check for recently fixed issues, i.e.
 * proper handling of null values
 * parsing decimal values from string

Refs #3664
Refs #3666
Refs #3667
2018-08-09 18:07:12 +02:00
Piotr Sarna
957cc712b6 cql3: enable parsing decimal JSON values from string
In order to be Cassandra-compatible, decimal type should be parsable
from both numeric values and strings.

Fixes #3666
2018-08-09 18:07:12 +02:00
Piotr Sarna
f962b85fa3 cql3: add missing return for dead cells
Fixes #3664
2018-08-09 18:07:12 +02:00
Piotr Sarna
cdbeed4e3b cql3: simplify parsing optional JSON values
With new to_json_string implementation that accepts bytes_opt,
parsing optional values can be simplified to remove explicit
branching.
2018-08-09 18:07:12 +02:00
Piotr Sarna
e4396e17cb cql3: add handling null value in to_json
Previously to_json function would fail with null passed as a parameter.

Fixes #3667
2018-08-09 18:07:12 +02:00
Piotr Sarna
52052b53a8 cql3: provide to_json_string for optional bytes argument
In order to handle optional arguments in a neat way, a wrapper
for to_json_string is provided.
2018-08-09 18:07:07 +02:00
Piotr Sarna
4a9014675f tests: add testing prepared statements with ALLOW FILTERING
ALLOW FILTERING support for prepared statements was buggy,
so a test case for prepared statements is added to cql test suite.
2018-08-09 18:06:09 +02:00
Piotr Sarna
8c18aaa511 cql3: pass query options to restrictions filter
Query options may contain bound values needed for checking filtering
restrictions. Previously, empty query_options{} were used, which
caused prepared statements to fail.

Fixes #3677
2018-08-09 17:44:45 +02:00
Eliran Sinvani
3f2bb07599 cql3: Count unpaged select queries
If the counter goes up this can be a possible reason for slowdown in
queries (since it means that potentially a large amount of data will
be sent to the client at once).

Fixes #2478
Tests: cqlsh with PAGING OFF and ON and validating with a print.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <01253cee0b8c1110aaee3da41d1f434ca798b430.1533817568.git.eliransin@scylladb.com>
2018-08-09 13:53:44 +01:00
Tomasz Grabiec
024b3c9fd9 mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
2018-08-09 15:29:10 +03:00
Tomasz Grabiec
fd543603dd tests: random_mutation_generator: Use collection_member::yes for collection cells
Caused assert failure when collection cells were so large as to
require fragmentation. Currently collection cells are not fragmented,
and deserialization asserts that.

Message-Id: <1533817077-27583-1-git-send-email-tgrabiec@scylladb.com>
2018-08-09 15:27:20 +03:00
Vladimir Krivopalov
55d2fdee9a clustering_key_filter_ranges: Fix move assignment to avoid undefined behaviour.
Get rid of the new(this) trick that results in undefined behaviour
because the class contains a const reference member.

Use std::reference_wrapper instead to ease the transition.

Refs #3032.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <5642bf79659231627dd7f8693c17cb46f274bcda.1533765105.git.vladimir@scylladb.com>
2018-08-09 00:53:17 +01:00
Takuya ASADA
ad7bc313f7 dist/common/scripts: pass format variables to colorprint()
When we use str.format() to pass variables on the message it will always
causes Exception like "KeyError: 'red'", since the message contains color
variables but it's not passed to str.format().
To avoid the error we need to pass all format variables to colorprint()
and run str.format() inside the function.

Fixes #3649

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803015216.14328-1-syuu@scylladb.com>
2018-08-08 18:37:50 +03:00
Avi Kivity
d6b0c4dda4 config: default murmur3_ignore_msb_bits to 12 even if not specified in scylla.yaml
When murmur3_ignore_msb_bits was introduced in 1.7, we set its default zero
(to avoid resharding on upgrade) and set it to 12 in the scylla.yaml template
(to make sure we get the right value for new clusters).

Now, however, things have changed:
 - clusters installed before 1.7 are a small minority
 - they should have resharded long ago
 - resharding is much better these days
 - we have more migrations from Cassandra compared to old clusters

To allow clusters that migrated using their cassandra.yaml, and to clean up
the default scylla.yaml, make the default 12.

Users upgrading from pre-1.7 clusters will need to update their scylla.yaml,
or to reshard (which is a good idea anyway).

Fixes #3670.
Message-Id: <20180808063003.26046-1-avi@scylladb.com>
2018-08-08 13:46:06 +02:00
Asias He
d47d46e1a8 streaming: Use streaming_write_priority for the sstable writer
Use the streaming io priority otherwise it uses the default io priority.

Message-Id: <e1836a9a93e7204d4bc9bba9c841d57c8b24aff8.1533715438.git.asias@scylladb.com>
2018-08-08 11:08:06 +03:00
Takuya ASADA
15825d8bf1 dist/common/scripts/scylla_setup: print message when EC2 instance is optimized for Scylla
Currently scylla_ec2_check exits silently when EC2 instance is optimized
for Scylla, it's not clear a result of the check, need to output
message.

Note that this change effects AMI login prompt too.

Fixes #3655

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808024256.9601-1-syuu@scylladb.com>
2018-08-08 10:17:52 +03:00
Takuya ASADA
652eb5ae0e dist/common/scripts/scylla_setup: fix typo on interactive setup
Scylls -> Scylla

Fixes #3656

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808002443.1374-1-syuu@scylladb.com>
2018-08-08 09:15:13 +03:00
Vladimir Krivopalov
7f77087caa tests: Add tests performing compaction on SSTables 3.x.
These tests check the correctness of resulting compacted SSTables based
on the files produced by compacting input files with Cassandra.

Note that output files are not identical to those generated by Cassandra
because Scylla compaction does not yet optimise delta-encoded values
using serialization header.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <3fa05ce72352292d1026ce80ac87552889d10d96.1533667535.git.vladimir@scylladb.com>
2018-08-08 08:50:41 +03:00
Rafi Einstein
c7f41c988f Add a counter to count large partition warning in compaction
Fixes #3562

Tests: dtest(compaction_test.py)
Message-Id: <20180807190324.82014-1-rafie@scylladb.com>
2018-08-07 20:15:09 +01:00
Avi Kivity
c9caaa8e6e docker: adjust for script conversion to Python
Since our scripts were converted to Python, we can no longer
source them from a shell. Execute them directly instead. Also,
we now need to import configuration variables ourselves, since
scylla_prepare, being an independent process, won't do it for
us.

Fixes #3647
Message-Id: <20180802153017.11112-1-avi@scylladb.com>
2018-08-07 15:34:03 +01:00
Takuya ASADA
a300926495 dist/common/scripts/scylla_setup: use specified NIC ifname correctly
Interactive NIC selection prompt always returns 'eth0' as selected NIC name
mistakenly, need to fix.

Fixes #3651

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803020724.15155-1-syuu@scylladb.com>
2018-08-06 20:59:19 +03:00
Amnon Heiman
80b1ef0f47 storage_service: Add nodes_status related metrics
This patch adds a metric for a node own operation mode, the
operation_mode metric represent the enum modes as gauge values according
to: UNKNOWN = 0, STARTING = 1, JOINING = 2, NORMAL = 3, LEAVING = 4, DECOMMISSIONED =
5, DRAINING = 6, DRAINED = 7, MOVING = 8

Fixes: #3482

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180806142706.23579-1-amnon@scylladb.com>
2018-08-06 18:19:56 +03:00
Tomasz Grabiec
88053b3bc9 tests: sstables: Replace sleep with accurate synchronzation
Message-Id: <1533545829-31109-1-git-send-email-tgrabiec@scylladb.com>
2018-08-06 10:09:39 +01:00
Avi Kivity
13b729bf71 Merge "tracing: store request and response sizes" from Vlad
"
Store sizes of the request and the response for each traces query.

In the example below I traced the cassandra-stress write workload with a default schema using the probabilistic tracing.

Here is an entry created for one of queries:

cassandra@cqlsh> SELECT parameters FROM system_traces.sessions where session_id=30c3a8ea-96bb-11e8-8a97-000000000000;

 parameters
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {'consistency_level': 'LOCAL_ONE', 'page_size': '5000', 'param[0]': 'f749eb03d6a995d8b3496075da8f20aa9228c5db12401e8a37000fa5baa13531...', 'param[1]': '845809b53a9aff7eef8f85308eaef79e03c696653ca23957f1ed5d539dc00463...', 'param[2]': 'd303585def93a5d40e41ceb12880ad3ede3d9f6308a1b1c5e42e911a191f1de1...', 'param[3]': 'be77c7da059d4b52687cd9b3eaa7d04cdfe7b5e38e84a8eea318299a01c7845f...', 'param[4]': '32faaaea1b3d73d9d628a4945b69a8531740348d49ee30c03f697dd2d63e8dee...', 'param[5]': '50503850374d34323330', 'query': 'UPDATE "standard1" SET "C0" = ?,"C1" = ?,"C2" = ?,"C3" = ?,"C4" = ? WHERE KEY=?', 'serial_consistency_level': 'SERIAL'}

(1 rows)
cassandra@cqlsh> SELECT request_size,response_size FROM system_traces.sessions where session_id=30c3a8ea-96bb-11e8-8a97-000000000000;

 request_size | response_size
--------------+---------------
          239 |             4

(1 rows)

Now let's try to read the same keyspace1.standard1 entry (based on the "key" value in "param[5]") from cqlsh and trace it using TRACING ON.

cassandra@cqlsh> TRACING ON
Now Tracing is enabled
cassandra@cqlsh> SELECT * from keyspace1.standard1 where key=0x50503850374d34323330;

 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     |
C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+-
-----------------------------------------------------------------------
 0x50503850374d34323330 | 0xf749eb03d6a995d8b3496075da8f20aa9228c5db12401e8a37000fa5baa135315430 | 0x845809b53a9aff7eef8f85308eaef79e03c696653ca23957f1ed5d539dc00463e10e | 0xd303585def93a5d40e41ceb12880ad3ede3d9f6308a1b1c5e42e911a191f1de12924 | 0xbe77c7da059d4b52687cd9b3eaa7d04cdfe7b5e38e84a8eea318299a01c7845fb8a2 |
0x32faaaea1b3d73d9d628a4945b69a8531740348d49ee30c03f697dd2d63e8dee5dde

(1 rows)

Tracing session: 639ca0a0-96bb-11e8-8a97-000000000000

 activity                                                                                                                                 | timestamp                  | source        | source_elapsed
------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
                                                                                                                       Execute CQL3 query | 2018-08-02 21:20:20.906000 | 192.168.1.138 |              0
                                                                                                            Parsing a statement [shard 0] | 2018-08-02 21:20:20.906358 | 192.168.1.138 |             --
                                                                                                         Processing a statement [shard 0] | 2018-08-02 21:20:20.906405 | 192.168.1.138 |             47
 Creating read executor for token -5698461774438220979 with all: {192.168.1.138} targets: {192.168.1.138} repair decision: NONE [shard 0] | 2018-08-02 21:20:20.906445 | 192.168.1.138 |             87
                                                                                                    read_data: querying locally [shard 0] | 2018-08-02 21:20:20.906448 | 192.168.1.138 |             90
                                                           Start querying the token range that starts with -5698461774438220979 [shard 0] | 2018-08-02 21:20:20.906452 | 192.168.1.138 |             94
                                                                                                               Querying is done [shard 0] | 2018-08-02 21:20:20.906509 | 192.168.1.138 |            151
                                                                                           Done processing - preparing a result [shard 0] | 2018-08-02 21:20:20.906533 | 192.168.1.138 |            175
                                                                                                                         Request complete | 2018-08-02 21:20:20.906186 | 192.168.1.138 |            186

cassandra@cqlsh> TRACING OFF
Disabled Tracing.

cassandra@cqlsh> SELECT request_size,response_size FROM system_traces.sessions where session_id=639ca0a0-96bb-11e8-8a97-000000000000;

 request_size | response_size
--------------+---------------
           82 |           369

(1 rows)
"

* 'tracing_request_response_size-v2' of https://github.com/vladzcloudius/scylla:
  tracing: move all tracing related API functions to a cold path
  tracing: store a query response size
  tracing: store request size
2018-08-05 18:26:29 +03:00
Jesse Haber-Kucharsky
fce10f2c6e auth: Don't use unsupported hashing algorithms
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.

This is consistent with the documentation of the function in its man
page.

As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".

The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):

    Some implementations return `NULL` on failure, and others return an
    _invalid_ hashed passphrase, which will begin with a `*` and will
    not be the same as SALT.

Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.

With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.

Fixes #3637.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
2018-08-05 08:57:36 +03:00
Vlad Zolotarov
896c1822b5 tracing: move all tracing related API functions to a cold path
This patch completes what was started in a4282c2c6e

Make trace_state_ptr to be a wrapper class around lw_shared_ptr<trace_state> that
hints that bool(trace_state_ptr) is likely to return FALSE.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-03 12:32:54 -04:00
Vlad Zolotarov
6db90a2e63 tracing: store a query response size
Add a new "response_size" column to system_traces.sessions and store a size of an uncompressed response
for a traced query.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-03 12:29:36 -04:00
Vlad Zolotarov
05020921bb tracing: store request size
Add a new column "request_size" to system_traces.sessions and store
the uncompressed request frame data size.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-03 12:29:36 -04:00
Avi Kivity
3b42fcfeb2 Merge "Fix exception safety in imr::utils::object" from Paweł
"

There is an exception safety problem in imr::utils::object. If multiple
memory allocations are needed and one of them fails the main object is
going to be freed (as expected). However, at this stage it is not
constructed yet, so  when LSA asks its migrator for the size it may get
a meaningless value. The solution is to remember the size until object
is fully created and use sized deallocation in case of failures.

Fixes #3618.

Tests: unit(release, debug/imr_test)
"
2018-08-02 12:10:24 +03:00
Takuya ASADA
1bb463f7e5 dist/debian: install *.service on correct subpackage
We mistakenly installing scylla-housekeeping-*.service to scylla-conf
package, all *.service should explicitly specified subpackage name.

Fixes #3642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180801233042.307-1-syuu@scylladb.com>
2018-08-02 11:39:52 +03:00
Paweł Dziepak
fd44d13145 tests/imr: add test for exception safety in imr::utils::object::make() 2018-08-01 16:50:58 +01:00
Paweł Dziepak
7ec906e657 imr: detect lsa migrator mismatch
Each IMR type needs its own LSA migrator. It is possible that user will
provide a migrator for a different type than the one which instance is
being created. This patch adds compile-time detection of that bug.
2018-08-01 16:50:58 +01:00
Benny Halevy
6b179b0183 HACKING.md: update ./install-dependencies.sh filename
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20180801150813.25408-1-bhalevy@scylladb.com>
2018-08-01 18:09:29 +03:00
Paweł Dziepak
6fbf2d72e9 imr::utils::object_context: fix context_for for backpointer
Each member of a structure may require different deserialisation
context. They are provided by context_for<Tag>() method of the context
used to deserialise the structure itself.

imr::utils::object needs to add backpointer to the structure it manages
so that it can be used in the LSA memory. This is done by creating a
structure that has two members: the backpointer and the actual structure
that imr::utils::object is to manage. imr::utils::object_context creates
approperiate deserialisation context for it.

context_for() is called for each member of a structure. object_context
implementation of context_for() always created a deserialisation context
for the underlying structure regardless which member that was, so it was
done also for backpointer. This is wrong since the context may read the
object on its creation.

The fix is to use no_context_t for the backpointer.
2018-08-01 15:17:25 +01:00
Paweł Dziepak
61749019cb imr::utils::object: fix exception safety if allocation fails
imr::utils::object::make() handles creation of IMR objects. They are
created in three phases:
  1. The size of the object and all additional needed memory allocations
     is determined
  2. All needed buffers are allocated
  3. Data is written to the allocated space

When IMR objects are deallocated LSA asks their migrator for the size.
Migrator may read some parts of the object to figure out its size. This
is a problem if there is allocation failure in make() at point 2.
If one of required allocations fails, the buffers that were already
acquired need to be freed. However, since the object hasn't been fully
created yet migrator won't return a valid value.

The solution for this is to remember object size until all allocations
are completed. This way the LSA won't need to ask migrators for it in
case of failure. imr::alloc::object_allocator already does that but
imr::utils::object doesn't. This patch fixes that.
2018-08-01 15:17:13 +01:00
Piotr Sarna
156888fb44 docs: fix system.large_partitions doc entry
For some reason the doc entry for large_partitions was outdated.
It contained incorrect ORDERING information and wrong usage example,
since large_partitions' schema changed multiple times during
the reviewing process.

Message-Id: <1910f270419536ebccffde163ec1bfc67d273306.1533128957.git.sarna@scylladb.
com>
2018-08-01 16:12:39 +03:00
Asias He
95849371aa range_streamer: Remove unordered_multimap usage
We need the mapping between dht::token_range to
std::vector<inet_address> and inet_address to dht::token_range_vector in
various places. Currently, we use std::unordered_multimap and convert to
std::unordered_map. It is better to use std::unordered_map in the first
place. The changes like below:

- Change from

  std::unordered_multimap<dht::token_range, inet_address>

to

  std::unordered_map<dht::token_range, std::vector<inet_address>>

- Change from

   std::unordered_multimap<inet_address, dht::token_range>

to

   std::unordered_map<inet_address, dht::token_range_vector>

Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>
2018-08-01 13:01:41 +03:00
Gleb Natapov
44a6afad8c cache_hitrate_calculator: fix race when new table is added during calculations
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.

Fixes #3636

Message-Id: <20180801093147.GD23569@scylladb.com>
2018-08-01 12:45:03 +03:00
Avi Kivity
620e950fc8 Merge "No infinite time-outs for internal distributed queries" from Jesse
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.

The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.

Fixes #3603.
"

* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
  Use finite time-outs for internal auth. queries
  Use finite query time-outs for `system_distributed`
2018-08-01 11:23:42 +03:00
Asias He
4a0b561376 storage_service: Get rid of moving operation
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.

In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.

Removing it simplifies the cluster operation logic and code.

Fixes #3475

Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
2018-08-01 11:18:17 +03:00
Asias He
02befb6474 gossip: Log seeds seen
It is useful for debugging bootstap issue, especially for large
clusters.

Also do not use the `_seeds` as the set_seeds function parameter since
there is a class member called _seeds.

Refs #3417
Message-Id: <15e6bdf06376949ced1bdb845f810da09266783d.1532474820.git.asias@scylladb.com>
2018-08-01 10:57:56 +03:00
Takuya ASADA
2cd99d800b dist/common/scripts/scylla_ntp_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1533070539-2147-1-git-send-email-syuu@scylladb.com>
2018-08-01 10:31:07 +03:00
Avi Kivity
2c9b886b6d logalloc: reindent
No functional changes.
Message-Id: <20180731125116.32009-1-avi@scylladb.com>
2018-08-01 00:35:54 +01:00
Jesse Haber-Kucharsky
e664f9b0c6 Use finite time-outs for internal auth. queries 2018-07-31 11:38:16 -04:00
Jesse Haber-Kucharsky
ca44f4de3c Use finite query time-outs for system_distributed 2018-07-31 11:38:15 -04:00
Paweł Dziepak
b20a15bdda Merge "Prevent scheduling leaks when out of memtable space" from Avi
"
When we are out of memtable space (real of virtual), lsa will defer running
our mutation application and run it later when memory is in fact available.
However, it will run it in the main group, giving the write more shares than it
would otherwise get.

This patchset fixes the problem by running those deferred mutation applications
in the correct scheduling group.

Fixes #3638
"

* tag '3638/v2' of https://github.com/avikivity/scylla:
  database: tag dirty memory managers with scheduling groups
  logalloc: run releaser() in user-provided scheduling group
2018-07-31 11:55:19 +01:00
Avi Kivity
2d311c26b3 database: tag dirty memory managers with scheduling groups
dirty memory managers run code on behalf of their callers
in a background fiber, so provide that background fiber with
the scheduling group appropriate to their caller.

 - system: main (we want to let system writes through quickly)
 - dirty: statement (normal user writes)
 - streaming: streaming (streaming writes)
2018-07-31 13:18:21 +03:00
Paweł Dziepak
98217f0d66 Update seastar submodule
* seastar 6b97e00...d40faff (10):
  > tutorial: update build as needed for newer pandoc
  > core: fix __libc_free return type signature
  > future-utils: when_all: avoid calling member function on an uninitialized data member
  > future-util: reduce continuations in when_all (variadic version)
  > future-utils: remove allocation in when_all() if all futures are available
  > future: reduce allocations in when_all()
  > future: fill missing futurize::from_tuple() functions
  > future: expose more types in continuation_base
  > log: predict logger::is_enabled() as false
  > README: add Resources section with infomation about the mailing list etc.
2018-07-31 10:12:52 +01:00
Avi Kivity
0fc54aab98 logalloc: run releaser() in user-provided scheduling group
Let the user specify which scheduling group should run the
releaser, since it is running functions on the user's behalf.

Perhaps a cleaner interface is to require the user to call
a long-running function for the releaser, and so we'd just
inherit its scheduling group, but that's a much bigger change.
2018-07-31 11:57:58 +03:00
Avi Kivity
f258df099a Update ami submodule
* dist/ami/files/scylla-ami d53834f...c7e5a70 (1):
  > ds2_configure.py: uncomment 'cluster_name' when it's commented out
2018-07-31 09:34:33 +03:00
Avi Kivity
e7ae4beef0 main: run prometheus and API servers under streaming group
Both the Prometheus and the API servers are used for maintenance
operations, similarly to streaming. Run them under the streaming
scheduling group to prevent them from impacting normal operations,
and rename the streaming scheduling group to reflect the more
generic role.

This helps to prevent spikes from Prometheus or API requests from
interfering with the normal workload. Using an existing group is
preferable to creating a new group because in the worst case, all
the non-main-workload groups compete with the main workload.
Consolidating them allows us to give them significant shares in
total without increasing competition in the worst case.

The group's label is unchanged to preserve compatibility with
dashboards.

A nice side effect is that repair, which is initiated by API calls,
gets placed into the maintenance group naturally. Compaction tasks
which are run by compaction manager are not changed.
Message-Id: <20180714160723.23655-1-avi@scylladb.com>
2018-07-30 15:07:33 +01:00
Avi Kivity
a4282c2c6e tracing: move tracing code to cold path
Most queries run without tracing (and those that run with tracing
are not sensitive to a few cycles), so mark the tracing paths as
cold.
Message-Id: <20180723133000.30482-1-avi@scylladb.com>
2018-07-30 15:05:57 +01:00
Rafi Einstein
123f2c2a1c Add a counter for reverse queries
Fixes #3492

Tests: dtest(cql_additional_tests.py)
Message-Id: <20180729202615.22459-1-rafie@scylladb.com>
2018-07-30 12:34:43 +03:00
Takuya ASADA
032b26deeb dist/common/scripts/scylla_ntp_setup: fix typo
Comment on Python is "#" not "//".

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180730091022.4512-1-syuu@scylladb.com>
2018-07-30 12:30:53 +03:00
Avi Kivity
04d88e8ff7 scripts: add a script to compute optimal number of compile jobs
This will allow continuous integration to use the optimal number
of compiler jobs, without having to resort to complex calculations
from its scripting environment.

Message-Id: <20180722172050.13148-1-avi@scylladb.com>
2018-07-30 10:15:11 +03:00
Avi Kivity
a4c9330bfc Merge "Optimise paged queries" from Paweł
"
This series adds some optimisations to the paging logic, that attempt to
close the performance gap between paged and not paged queries. The
former are more complex so always are going to be slower, but the
performance loss was unacceptably large.

Fixes #3619.

Performance with paging:
        ./perf_paging_before  ./perf_paging_after   diff
 read              271246.13            312815.49  15.3%

Without paging:
        ./perf_nopaging_before  ./perf_nopaging_after   diff
 read                343732.17              342575.77  -0.3%

Tests: unit(release), dtests(paging_test.py, paging_additional_test.py)
"

* tag 'optimise-paging/v1' of https://github.com/pdziepak/scylla:
  cql3: select statement: don't copy metadata if not needed
  cql3: query_options: make simple getter inlineable
  cql3: metadata: avoid copying column information
  query_pager: avoid visiting result_view if not needed
  query::result_view: add get_last_partition_and_clustering_key()
  query::result_reader: fix const correctness
  tests/uuid: add more tests including make_randm_uuid()
  utils: uuid: don't use std::random_device()
2018-07-26 19:24:03 +03:00
Nadav Har'El
25bd139508 cross-tree: clean up use of std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).

In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.

This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.

To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:

   std::default_random_engine _engine{std::random_device{}()};

Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
2018-07-26 16:54:58 +01:00
Takuya ASADA
8e4d1350c9 dist/common/scripts/scylla_ntp_setup: ignore ntpdate error
Even ntpdate fails to adjust clock ntpd may able to recover it later,
ignore ntpdate error keep running the script.

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180726080206.28891-1-syuu@scylladb.com>
2018-07-26 14:44:53 +03:00
Paweł Dziepak
3e32245bb8 cql3: select statement: don't copy metadata if not needed 2018-07-26 12:37:20 +01:00
Paweł Dziepak
15775c958a cql3: query_options: make simple getter inlineable 2018-07-26 12:37:06 +01:00
Paweł Dziepak
ef0c999742 cql3: metadata: avoid copying column information
The column-related metadata is shared by all requests done with the same
perpared query. However, metadata class contains also some additional
flags and paging state which may differ. This patch allows sharing
column information among multiple instances of the metadata class.
2018-07-26 12:17:04 +01:00
Paweł Dziepak
757d9e3b5d query_pager: avoid visiting result_view if not needed
query::result_visitor provides get_last_partition_and_clustering_key()
which allows getting those without iterating through the whole result.
Moreover, row count may be precomputed in the result, if it isn't there
is query::result_view::count_partitions_and_rows() for getting it.
2018-07-26 12:14:48 +01:00
Paweł Dziepak
9b6dc52255 query::result_view: add get_last_partition_and_clustering_key()
Paging needs to get last partition and clustering key (if the latter
exists). Previously, this was done by result_view visitor but that is
suboptimal. Let's add a direct getter for those.
2018-07-26 12:12:08 +01:00
Paweł Dziepak
b5ed4c8806 query::result_reader: fix const correctness 2018-07-26 12:11:27 +01:00
Paweł Dziepak
495df277f9 tests/uuid: add more tests including make_randm_uuid() 2018-07-26 12:03:37 +01:00
Paweł Dziepak
b485deb124 utils: uuid: don't use std::random_device()
std::random_device() is extremely slow. This patch modifies
make_rand_uuid() so that it requires only two invocations of the PRNG.
2018-07-26 12:02:32 +01:00
Avi Kivity
b167647bf6 dist: redhat: fix up bad file ownership of rpms/srpms
mock outputs files owned by root. This causes attempts
by scripts that want to junk the working directory (typically
continuous integration) to fail on permission errors.

Fixup those permissions after the fact.
Message-Id: <20180719163553.5186-1-avi@scylladb.com>
2018-07-26 08:20:42 +03:00
Avi Kivity
bea1f715dc storage_proxy: count cross-shard operations
Count operations which were started on one shard and
were performed on another, due to non-shard-aware driver
and/or RPC.
Message-Id: <20180723155118.8545-1-avi@scylladb.com>
2018-07-25 16:21:04 +01:00
Avi Kivity
d6ef74fe36 Merge "Fix JSON string quoting" from Piotr
"

This mini-series covers a regression caused by newest versions
of jsoncpp library, which changed the way of quoting UTF-8 strings.

Tests: unit (release)
"

* 'add_json_quoting_3' of https://github.com/psarna/scylla:
  tests: add JSON unit test
  types: use value_to_quoted_string in JSON quoting
  json: add value_to_quoted_string helper function

Ref #3622.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2018-07-25 17:49:55 +03:00
Piotr Sarna
b367cff05d tests: add JSON unit test
Since value_to_quoted_string now has an internal implementation,
a unit test is provided to check if strings are quoted
and escaped properly.
2018-07-25 13:16:06 +02:00
Piotr Sarna
d307b5712c types: use value_to_quoted_string in JSON quoting
In order to avoid regressions caused by external libraries,
our own value_to_quoted_string implementation is used.

Fixes #3622
2018-07-25 13:16:06 +02:00
Piotr Sarna
783762a958 json: add value_to_quoted_string helper function
After open-source-parsers/jsoncpp@42a161f commit jsoncpp's version
of valueToQuotedString no longer fits our needs, because too many
UTF-8 characters are unnecessarily escaped. To remedy that,
this commit provides our own string quoting implementation.

Reported-by: Nadav Har'El <nyh@scylladb.com>

Refs #3622
2018-07-25 13:16:00 +02:00
Piotr Sarna
f66aace685 cql3: fix INSERT JSON grammar
Previously CQL grammar wrongfully required INSERT JSON queries
to provide a list of columns, even though they are already
present in JSON itself.
Unfortunately, tests were written with this false assumption as well,
so they're are updated.
Message-Id: <33b496cba523f0f27b6cbf5539a90b6feb20269e.1532514111.git.sarna@scylladb.com>
2018-07-25 11:36:59 +01:00
Avi Kivity
b443a9b930 compaction: demote compaction start/end messages to DEBUG level
Compactions start and end all the time, especially with many shards,
and don't contribute much to understanding what is going on these
days. Compaction throughput is available through the metrics and
other information is available via the compaction history table.

Demote compaction start and end messages to DEBUG level to keep
the log clean. Cleaning and resharding compactions are kept as
INFO, at least for now, since they are manual operations and
therefore rarer.
Message-Id: <20180724132859.14109-1-avi@scylladb.com>
2018-07-25 09:53:39 +01:00
Takuya ASADA
58f094e06d dist/debian: fix ImportError on pystache
Seems like pystache does not provides dependency, need to install it on
build_deb.sh.

Fixes #3627

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180724164852.16094-1-syuu@scylladb.com>
2018-07-25 07:42:19 +03:00
Avi Kivity
e2ad45c3db Merge "Add clustering prefix logic to indexes and filtering" from Piotr
"
This series follows up ALLOW FILTERING support series and depends on
this one: https://groups.google.com/d/msg/scylladb-dev/Qxt3_MP03jI/5ZhRTJ3gBwAJ

The following optimizations regarding clustering key prefix and filtering are
applied:
 * if clustering key restrictions require filtering, but they still
   contain any part of the prefix, this prefix can be used to narrow
   down the query by using it in computing clustering bounds
 * if an indexed query has partition key restrictions and any clustering
   key restrictions that form a prefix, then from now on this prefix
   will be used to narrow down the index query

"

Ref #3611.

* 'use_prefix_with_filtering_and_si_4' of https://github.com/psarna/scylla:
  tests: add prefix cases to indexed filtered queries tests
  cql3: use ck prefix in filtered queries
  cql3: use clustering key prefix in index queries
  cql3: add conversion to ck longest prefix restrictions
  cql3: add prefix_size method to ck restrictions
2018-07-23 15:28:50 +03:00
Piotr Sarna
517a5b66ba tests: add prefix cases to indexed filtered queries tests
More cases related to querying clustering key prefix in an indexed
query are added to secondary index test suite.
2018-07-23 14:10:52 +02:00
Piotr Sarna
8523c24576 cql3: use ck prefix in filtered queries
If a filtering query has restrictions that include any clustering
prefix, the longest prefix will be used to narrow down the query.

Fixes #3611
2018-07-23 14:10:52 +02:00
Piotr Sarna
6cc8ccc771 cql3: use clustering key prefix in index queries
If an indexed query has partition+clustering key restrictions as well
and at least some of these restrictions create a prefix, this prefix
is used in the index query to narrow down the number of rows read.

Refs #3611
2018-07-23 14:10:52 +02:00
Piotr Sarna
ab74f75727 cql3: add conversion to ck longest prefix restrictions
For optimization purposes it's sometimes useful to extract
the longest prefix of clustering key restrictions in order
to narrow down queries.
2018-07-23 14:10:52 +02:00
Piotr Sarna
2e4c493870 cql3: add prefix_size method to ck restrictions
Clustering key restrictions are usually set for at least part
of the clustering key prefix. A method of extracting the longest
prefix's size is added.
2018-07-23 14:10:52 +02:00
Vladimir Krivopalov
ec7f853f49 sstables: Do not pass liveness_info to consume_row_end().
The liveness_info is unconditionally added to the _in_progress_row as of
commit cbfc741d70 so no need to pass it to consume_row_end() and add
conditionally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <7cd3e599817cbd4b857c3295153602cd2b9a6ef1.1532311852.git.vladimir@scylladb.com>
2018-07-23 13:10:36 +03:00
Avi Kivity
bb79eccf55 tests: sstable_mutation_test: hack around leak during sstable close
sstable close is an asychronous operation launched in the background,
so we can't wait for it. If the test ends before all operations are
complete, the background operations are detected as leaks.

We need either a proper close(), or maybe a sstables::quiesce() that
waits until there are no sstables alive on the shard, but until then,
a hack.
2018-07-23 12:40:46 +03:00
Avi Kivity
af6ce47082 Merge "Support filtering and fast-forwarding with SSTables 3.x" from Piotr and Vladimir
"
This patchset authored by Piotr fixes ck filtering and fast forwarding in SSTables 3.x.
For now only clustering rows are supported and range tombstones will come next.

Test: unit {release}
"

* 'projects/sstables-30/filtering/v5' of https://github.com/argenet/scylla:
  sstables: Minor clean-up and renaming to clustering_ranges_walker.
  sstables: Add test for filtering and forwarding
  sstables: Fix schema for static row tests
  sstables: Fix ck filtering and fast forwarding
  sstables: Introduce mutation_fragment_filter
2018-07-22 21:11:51 +03:00
Avi Kivity
761931659a Merge "Do not linearise incoming CQL3 requests" from Paweł
"
This series changes the native CQL3 protocl layer so that it works with
fragmented buffers instead of a single temporary_buffer per request.
The main part is fragmented_temporary_buffer which represents a
fragmented buffer consisting of multiple temporary_buffers. It provides
helpers for reading fragmented buffer from an input_stream, interpreting
the data in the fragmented buffer as well as view that satisfy
FragmentRange concept.

There are still situations where a fragmented buffer is linearised. That
includes decompressing client requests (this uses reusable buffers in a
similar way to the code that sends compressed responses), CQL statement
restrictions and values that are hard-coded in prepared statements
(hopefully, the values in those cases will be small), value validation
in some cases (blobs are not validated, irrelevant for many fixed-size
small types, but may be a problem for large text cells) as well as
operations on collections.

Tests: unit(release), dtests(cql_prepared_test.py, cql_tests.py, cql_additional_tests.py)
"

* tag 'fragmented-cql3-receive/v1' of https://github.com/pdziepak/scylla: (23 commits)
  types: bytes_view: override fragmented validate()
  cql3: value_view: switch to fragmented_temporary_buffer::view
  types: add validate that accepts fragmented_temporary_buffer::view
  cql3 query_options: add linearize()
  cql3: query_options: use bytes_ostream for temporaries
  cql3: operation: make make_cell accept fragmented_temporary_buffer::view
  atomic_cell: accept fragmented_temporary_buffer::view values
  cql3: avoid ambiguity in a call to update_parameters::make_cell()
  transport: switch to fragmented_temporary_buffer
  transport: extract compression buffers from response class
  tests/reusable_buffer: test fragmented_temporary_buffer support
  utils: reusable_buffer: support fragmented_temporary_buffer
  tests: add test for fragmented_temporary_buffer
  util fragment_range: add general linearisation functions
  utils: add fragmented_temporary_buffer
  tests: add basic test for transport requests and responses
  tests/random-utils: print seed
  tests/random-utils: generate sstrings
  cql3: add value_view printer and equality comparison
  transport: move response outside of cql_server class
  ...
2018-07-22 19:40:37 +03:00
Avi Kivity
30cddd4531 Merge "Support reading promoted index from SSTables 3.x" from Vladimir and Piotr
"
This patchset adds support for reading Index.db files written in
SSTables 3.x ('mc') format.

Note that the offsets map introduced in SSTables 3.x is neither used nor
read yet. It is located last in promoted index and so current parsers
just ignore it for the time being.

Later it should be used to perform binary search of a desired promoted
index block in large partition, thus reducing the complexity from linear
to logarithmic.

Tests: unit {release}
"

* 'projects/sstables-30/index_reader/v5' of https://github.com/argenet/scylla:
  sstables: Add getter for end_open_marker to index_reader.
  tests: Add test reading index for a partition comprised of RT markers of boundary types.
  tests: Add test for reading index of a partition comprised of only range tombstones.
  tests: Use std::adjacent_find in index_reader_assertions::has_monotonic_positions()
  tests: Read rows only index
  sstables: Do not seek through the promoted index for static row positions.
  sstables: Read promoted index stored in SSTables 3.x ('mc') format.
  sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
  utils: Add overloaded_functor helper.
  position_in_partition: Add a constructor from range_tag_t{}, bound_kind and clustering_key_prefix.
  sstables: Support reading signed vints in continuous_data_consumer.
  sstables: Factor out the code building a vector of fixed clustering values lengths.
  sstables: Remove unused includes from index_entry.hh
  tests: Add test for reading SSTables 3.x index file with empty promoted index.
  tests: Rename sstable_assertions.hh -> tests/index_reader_assertions.hh
  sstables: Support parsing index entries from SSTables 3.x format.
  sstables: move bound_kind_m to header
2018-07-22 16:15:41 +03:00
Vladimir Krivopalov
df1a151f75 sstables: Minor clean-up and renaming to clustering_ranges_walker.
- Renamed _current to _current_range to better reflect its nature as
  there are other similarly named members (_current_start and
  _current_end).

- Don't use a temporary variable for incrementing the change counter.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
01611f2083 sstables: Add test for filtering and forwarding
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
3466dc2368 sstables: Fix schema for static row tests 2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
abf3fc1b98 sstables: Fix ck filtering and fast forwarding
Both were broken before this change.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
564bcfa4d0 sstables: Introduce mutation_fragment_filter
This class encapsulates the logic related to
clustering key filtering and fast forwarding.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 16:19:07 -07:00
Vladimir Krivopalov
4d3467d793 sstables: Add getter for end_open_marker to index_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
c7285abc9e tests: Add test reading index for a partition comprised of RT markers of boundary types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
91f96d7d2b tests: Add test for reading index of a partition comprised of only range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
fc051954c2 tests: Use std::adjacent_find in index_reader_assertions::has_monotonic_positions()
Not only this is easier to read and understand, but it also doesn't
force the promoted_index_block class to support copying which is
heavyweight and otherwise not needed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
d4e0fa96e3 tests: Read rows only index
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
5561c713d9 sstables: Do not seek through the promoted index for static row positions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
917528c427 sstables: Read promoted index stored in SSTables 3.x ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
86d14f8166 sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
This is a pre-requisite for parsing promoted index blocks written in
SSTables 'mc' format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
79c2f0095c utils: Add overloaded_functor helper.
The overloaded_functor class template can be used to encompass multiple
lambdas accepting different types into a single callable object that can
be used with any of those types.

One application is visitors for std::variant where different handling is
required for different types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
593d8faf7d position_in_partition: Add a constructor from range_tag_t{}, bound_kind and clustering_key_prefix.
This facilitates position_in_partition creation when parsing range tombstones bounds from SSTables files.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
997ebaaa14 sstables: Support reading signed vints in continuous_data_consumer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
540dfcc9bf sstables: Factor out the code building a vector of fixed clustering values lengths.
This code will be re-used in promoted_index_blocks_parser to parse
clustering key prefixes from SSTables 3.x format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
741d5f3b5d sstables: Remove unused includes from index_entry.hh
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
b29b948872 tests: Add test for reading SSTables 3.x index file with empty promoted index.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
054eb2df66 tests: Rename sstable_assertions.hh -> tests/index_reader_assertions.hh
The previous name of the file is moreover confusing as we have several
sstable_assertions classes throughout tests but this header only
contains a class for index reader assertions.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
f50ffa267f sstables: Support parsing index entries from SSTables 3.x format.
With this patch, index_reader is capable of reading index_entries from
both 'ka'/'la' and 'mc' formats.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Piotr Jastrzebski
d0f8c71e28 sstables: move bound_kind_m to header
and add helper methods.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-20 13:50:17 -07:00
Duarte Nunes
6bd087facb Merge 'Make indexed queries with pk restrictions non-filtering' from Piotr
"
Queries that use secondary index and have a full partition key restriction
or full primary key restriction should not require filtering - it's
sufficient to add these restrictions to the index query.
This also adds secondary index tests to cover this case.

Tests: unit (release)
"

* 'si_and_pk_restrictions_2' of https://github.com/psarna/scylla:
  tests: add index + partition key test
  cql3: make index+primary key restrictions filtering-independent
  cql3: use primary key restrictions in filtering index queries
  cql3: add is_all_eq to primary key restrictions
  cql3: add explicit conversion between key restrictions
  cql3: add apply_to() method to single column restriction
  cql3: make primary key restrictions' values unambiguous
2018-07-19 16:54:43 +01:00
Tomasz Grabiec
d5534d6a77 Merge "Improve categorization of messaging verbs into connections" from Avi
Now that verb categorizations also affect scheduling, getting them
correct is more important. The first three patches in this series
improve the infrastructure a little, and the forth fixes some
categorization errors wrt. repair/streaming verbs.

* https://github.com/avikivity/scylla msg-idx-sanity/v1:
  messaging: choose connection index via a look-up table
  messaging: convert do_get_rpc_client_idx into a switch
  messaging: remove default when computing rpc client index
  messaging: categorize more streaming/repair verbs as streaming
2018-07-19 15:03:15 +02:00
Tomasz Grabiec
ef4fb1f91d sstables: mp_row_consumer_m: Add trace-level logging
Very useful for debugging. The old mp_row_consumer_k_l had this.

Message-Id: <1532000326-28649-1-git-send-email-tgrabiec@scylladb.com>
2018-07-19 14:58:00 +03:00
Asias He
1f06ee3960 range_streamer: Limit nr of nodes to stream in parallel
For example, to bootstrap a 50th node in a cluster

 [shard 0] range_streamer - Bootstrap with
 [127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44,
 127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30,
 127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28,
 127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3,
 127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25,
 127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1,
 127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32,
 127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35,
 127.0.0.46]
 for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49

the new node will get data from 49 existing nodes.

Currently, it will stream from all the 49 existing nodes at the same
time. It is not a good idea to stream from all the nodes in parallel
which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node
receiving.

To fix this, limit the nr of nodes to stream in parallel. We should have
a better control over the memory usage and parallelism. But for now,
limit the nr of nodes to a maximum of 16 as a starter. With this limit,
each shard can work with as many as 16 remote nodes in parallel, I think
this has enough parallelism for streaming in terms of performance.

This change have effect on the bootstrap/decommission/removenode node
operations, and do not have effect on repair.

Refs #2782

Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>
2018-07-19 11:44:05 +03:00
Avi Kivity
31d4d37161 Merge "Reduce continuous memory usage in gossip" from Asias"
"
Use chunked_vector instead of vector. It won't have compatibility issues
because chunked_vector and vector have the same on wire format.

Refs #278
"

* 'asias/gossip_memory_v2' of github.com:scylladb/seastar-dev:
  gossip: Reduce continuous memory usage
  to_string: Add std::list and utils::chunked_vector support
  serializer: Add chunked_vector support
2018-07-19 09:12:09 +03:00
Tomasz Grabiec
9a0548397c tests: row_cache: Add test for eviction from invalidated partitions
Message-Id: <1531933216-28026-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 21:06:36 +03:00
Piotr Sarna
82c049692b tests: add index + partition key test
Tests covering querying both index and partition keys are added
- it's checked that such queries do not require filtering.
2018-07-18 18:45:08 +02:00
Piotr Sarna
0c85bdcdc2 cql3: make index+primary key restrictions filtering-independent
If full partition key (or full primary key) is used in an indexed
query, it should not require filtering, because queries like that
can be efficiently narrowed down with stricter index restrictions.
2018-07-18 18:45:08 +02:00
Piotr Sarna
2542630a18 cql3: use primary key restrictions in filtering index queries
If both index and partition key is used in a query, it should not
require filtering, because indexed query can be narrowed down
with partition key information. This commit appends partition key
restrictions to index query.
2018-07-18 18:45:08 +02:00
Piotr Sarna
27590816f0 cql3: add is_all_eq to primary key restrictions
is_all_eq is later needed to decide if restrictions can be used
in an indexed query.
2018-07-18 18:45:08 +02:00
Piotr Sarna
20a349777e cql3: add explicit conversion between key restrictions
Partition and clustering key restrictions sometimes need to be converted
and this commit provides a way to do that.
2018-07-18 18:45:08 +02:00
Piotr Sarna
f1357defd6 cql3: add apply_to() method to single column restriction
This method allows copying single column restriction,
possibly with a new column definition.
2018-07-18 18:44:38 +02:00
Tomasz Grabiec
dc453d4f5d tests: flat_mutation_reader: Use fluent assertions for better error messages
Message-Id: <1531908313-29810-2-git-send-email-tgrabiec@scylladb.com>
2018-07-18 13:52:23 +01:00
Tomasz Grabiec
604c8baed8 tests: flat_mutation_reader_assertions: Introduce produces(mutation_fragment)
Message-Id: <1531908313-29810-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 13:52:23 +01:00
Tomasz Grabiec
c46813717c tests: sstables: Check that reading large index pages does not cause large allocations
Reproducer of #3597.

Message-Id: <1531914040-5427-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 14:56:41 +03:00
Piotr Sarna
30f9924ad5 cql3: make primary key restrictions' values unambiguous
using directive must be used to disambiguate the overridden method.
2018-07-18 13:28:37 +02:00
Paweł Dziepak
a0c1c0c921 types: bytes_view: override fragmented validate()
The default implementation linearises the buffer and calls
validate(bytes_view). This is bad and not needed for bytes_type which
doesn't do any validation anyway.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
0b9eed72f4 cql3: value_view: switch to fragmented_temporary_buffer::view 2018-07-18 12:28:06 +01:00
Paweł Dziepak
0551efee3b types: add validate that accepts fragmented_temporary_buffer::view 2018-07-18 12:28:06 +01:00
Paweł Dziepak
8f4cb36ef2 cql3 query_options: add linearize()
Some code in the CQL3 layer requires bytes_view and it is fairly
reasonable to assume that it won't deal with large buffers (e.g.
statement restrictions). query_options already has make_temporary()
which takes ownership of a cql3::raw_value so that the rest of the code
can use cql3::raw_value_view. This patch adds similar linearize()
function which, if necessary, linearises a cql3::raw_value_view and
returns a bytes_view with lifetime tied to the life or query_options.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
3810045f8f cql3: query_options: use bytes_ostream for temporaries
bytes_ostream is going to be more efficient than
std::vector<std::vector<char>> since it can put multiple small values in
a single buffer thus reducing the number of memory allocations.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
dff6cd3e2f cql3: operation: make make_cell accept fragmented_temporary_buffer::view 2018-07-18 12:28:06 +01:00
Paweł Dziepak
cc87263bd8 atomic_cell: accept fragmented_temporary_buffer::view values 2018-07-18 12:28:06 +01:00
Paweł Dziepak
7d7910aa4d cql3: avoid ambiguity in a call to update_parameters::make_cell()
Using initializer lists in calls like foo({}) is ambiguous if foo() has
multiple overloads with more than one accepting a type that is
default-constructible. update_parameters::make_cell() is about to get an
overload that accepts fragmented_temporary_buffer::view as a value, so
let's make sure its call site won't be ambiguous.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
8c6e544fec transport: switch to fragmented_temporary_buffer
The logic responsible for reading requests was operating on
temporary_buffer<char> and bytes_view. This required all request
messages to be linearised to a contiguous buffer, possibly causing large
allocations. Changing to fragmented_temporary_buffer mostly alleviates this
problem unless the reader code explicitly asks for a contiguous bytes_view.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
f95bb21d99 transport: extract compression buffers from response class
Both compression and decompression code is going to reuse the same pair
of reusable buffers.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
a8c4f41a0b tests/reusable_buffer: test fragmented_temporary_buffer support 2018-07-18 12:28:06 +01:00
Paweł Dziepak
32ba47fb87 utils: reusable_buffer: support fragmented_temporary_buffer
reusable_buffer already supports bytes_ostream which is often used for
handling data sent from Scylla. This patch adds support for
fragmented_temporary_buffer which is going to be mainly used for data
received by Scylla.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
166c9a3b8c tests: add test for fragmented_temporary_buffer 2018-07-18 12:28:06 +01:00
Paweł Dziepak
b152aafd67 util fragment_range: add general linearisation functions
All FragmentRange implementations can be linearised in the same way, so
let's provide linearized() and with_linearized() functions for all of
them.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
fc484f0819 utils: add fragmented_temporary_buffer
Seastar output_streams produce temporary_buffer<char>s.
fragmented_temporary_buffer represents a single fragmented buffer that
consists of, possibly multiple, temporary_buffer<char>s.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
b5a72a880b tests: add basic test for transport requests and responses 2018-07-18 12:28:06 +01:00
Paweł Dziepak
054d39b8f7 tests/random-utils: print seed
Knowning the seed will make it easier to investigate failures in
randomised tests.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
9445ce3f84 tests/random-utils: generate sstrings 2018-07-18 12:28:06 +01:00
Paweł Dziepak
46acd76cc8 cql3: add value_view printer and equality comparison
BOOST_CHECK_*() expect compared objcts to be equality-comparable and
printable.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
24929fd2ce transport: move response outside of cql_server class 2018-07-18 12:28:06 +01:00
Paweł Dziepak
5986e7a383 transport: drop request_reader::read_value() 2018-07-18 12:28:06 +01:00
Paweł Dziepak
72450e2f7f transport: extract request reading to request_reader 2018-07-18 12:28:06 +01:00
Paweł Dziepak
1eeef4383c transport: fix use-after-free in read_name_and_value_list() 2018-07-18 12:28:06 +01:00
Avi Kivity
31151cadd4 Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()
2018-07-18 10:11:34 +03:00
Asias He
506eed325a dht: Fix typo in boot_strapper.cc
Eror -> Error

Message-Id: <ab1050c526f6e70c3a365595376acde7706d86e9.1531877929.git.asias@scylladb.com>
2018-07-18 10:00:27 +03:00
Tomasz Grabiec
894961006b Merge "db/view/view_builder: Fixes to bookkeeping" from Duarte
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.

* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:

Duarte Nunes (3):
  db/system_keyspace: Add function to remove view build status of a
    shard
  db/view: Don't have shard 0 clear other shard's status on drop
  db/view: Restrict writes to the distributed system keyspace to shard 0
2018-07-17 18:01:28 +02:00
Tomasz Grabiec
25d09e51ac Merge "db/view/build_progress_virtual_reader: Fixes to clustering key adjusts" from Duarte
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.

* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:

Duarte Nunes (3):
  db/view/build_progress_virtual_reader: Use correct schema to adjust ck
  db/view/build_progress_virtual_reader: Fix full ck detection
  db/view/build_progress_virtual_reader: Also adjust end RT bound
2018-07-17 18:00:30 +02:00
Avi Kivity
9ffa6b9ad6 Merge "Fix leaks and corruption of continuity in cache in case of bad_alloc from key linearization" from Tomasz
"
This series fixes two issues related to bad_allocs and keys which require
linearization (larger than 12.8 KiB). With such keys, comparators may throw if
memory allocation fails. This may cause lookups in partition and rows trees to
fail with bad_alloc.

The first issue (#3583) was that partition version merging
(mutation_partition::apply_monotonically()) was not taking into account that
lookups may fail. If we fail, the partition which is being applied may be
incorrectly left with the clustering range since the begging of the range up
to the current row marked as continuous, if the current row has the continuity
flag set, because we've moved all of the preceding rows into the target, and
the correct lower bound row is no longer there in the source. This may mark
some discontinuous ranges as continuous. Merging is retried by
allocating_section, and there will be no problem if it eventually succeeds,
original continuity will be reflected in the sum. The problem will persist if
it doesn't eventually succeed, when we're really out of memory.

The user-perceivable effect of this would be temporary loss of writes in the
clustering range which was marked as continuous but shouldn't. Introduced in
2.2-rc1.

The second issue (#3585) is that the code which inserts partitions in memtable
and cache will leak the entry if boost::intrusive_set::insert() throws. This
will also cause SIGSEGV when cache tries to evict from such a leaked entry.
"

* tag 'tgrabiec/fix-bad-continuity-on-oom-in-apply-v2' of github.com:tgrabiec/scylla:
  managed_bytes: Mark read_linearize() as an allocation point
  tests: Relax expectation about continuity after failed merging
  tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging
  tests: Switch to seastar's allocation failure injector
  mutation_partition: Introduce set_continuity()
  clustering_interval_set: Introduce contained_in()
  clustering_interval_set: Introduce add() overload accepting another interval set
  mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
  mutation_partition: Preserve continuity in case row merging with no tracker throws
  memtable, cache: Fix exception safety of partition entry insertions
2018-07-17 18:19:37 +03:00
Tomasz Grabiec
477d7b439b row_cache: Fix violation of continuity on concurrent eviction and population
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.

The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.

Fixes #3608
2018-07-17 16:43:21 +02:00
Tomasz Grabiec
8d47d21149 position_in_partition: Introduce is_before_all_clustered_rows() 2018-07-17 16:43:21 +02:00
Tomasz Grabiec
612b223819 managed_bytes: Mark read_linearize() as an allocation point 2018-07-17 16:39:43 +02:00
Tomasz Grabiec
be678a81ee tests: Relax expectation about continuity after failed merging
Currently we check that the sum of continuities is exactly the same as
expected on failure. Relax this to require that continuity is not
broader, since in some bad_alloc scenarios, or preemption, we will
have to mark some ranges as discontinuous.
2018-07-17 16:39:43 +02:00
Tomasz Grabiec
f366ac76e8 tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
d9db79a85d tests: Switch to seastar's allocation failure injector
It catches more allocation sites.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
6b1fe6cbe5 mutation_partition: Introduce set_continuity() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
ac772cbd81 clustering_interval_set: Introduce contained_in() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
d24ebe8565 clustering_interval_set: Introduce add() overload accepting another interval set 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c6c54021a8 mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.

Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.

To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.

Fixes #3583
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
de5c52f422 mutation_partition: Preserve continuity in case row merging with no tracker throws
Example:

 p:      row{key=A, cont=0} row{key=C, cont=1}
 this:                      row{key=C, cont=0}

When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.

This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
567da3e063 memtable, cache: Fix exception safety of partition entry insertions
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.

When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.

Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.

Fixes #3585.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c82c0be0be tests: mutation_diff: Ignore differences in memory addresses
Differences in memory addresses are not necessarily differences in
values.

Refs #3571

Message-Id: <1531824919-12737-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 16:32:04 +03:00
Amos Kong
0fcdab8538 scylla_setup: nic setup dialog is only for interactive mode
Current code raises dialog even for non-interactive mode when we pass options
in executing scylla_setup. This blocked automatical artifact-test.

Fixes #3549

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <58f90e1e2837f31d9333d7e9fb68ce05208323da.1531824972.git.amos@scylladb.com>
2018-07-17 16:31:18 +03:00
Paweł Dziepak
422d1eaeb9 Merge "Improve usability of pkeys in system.large_partitions table" from Avi
"
Partition keys are currently stored in serialized form in the
system.large_partitions table. This is an obstacle to operators
who usually can't deserialize partition keys in their heads.

Improve the situation by deserializing the partition key for them.
"

* tag 'pkey-print/v1' of https://github.com/avikivity/scylla:
  large_partition_handler: output friendly partition key
  keys: schema-aware printing of a partition_key
2018-07-17 13:51:22 +01:00
Avi Kivity
002ac87aac Update seastar submodule
* seastar aac6cf1...6b97e00 (5):
  > Merge "changes to fix travis CI builds" from Kefu
  > tls.cc: Make "close" timeout delay exception proof
  > core/sharded: mark foreign_ptr::get_owner_shard() const
  > core/memory: Expose counter of large allocations
  > tests: add test for multi-fragmented net::packet

Fixes #3461.
Ref scylladb/seastar#474.
2018-07-17 15:43:01 +03:00
Tomasz Grabiec
3f509ee3a2 mutation_partition: Fix exception-safety of row copy constructor
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.

Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.

Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 13:21:21 +01:00
Asias He
fd71c5718f gossip: Reduce continuous memory usage
Gossip SYN and ACK uses std::vector to store a list of gossip_digest,
the larger the cluster, the more continuous memory is needed. To reduce
the memory pressure which might cause std::bad_alloc, switch the std::vector
to chunked_vector.

In addition, change add_local_application_state to use std::list instead
of std::vector.

Refs #2782
2018-07-17 20:15:32 +08:00
Avi Kivity
acb3163639 large_partition_handler: output friendly partition key
Use abstract_type::to_string() to prettify partition key components.

Manually tested by setting --compaction-large-partition-warning-threshold-mb
to zero and inspecting the output for compound and non-compound partition
keys.
2018-07-17 14:44:52 +03:00
Avi Kivity
bfd14b4123 keys: schema-aware printing of a partition_key
Add a with_schema() helper to decorate a partition key with its
schema for pretty-printing purposes, and matching operator<<.

This is useful to print partition keys where the operator, who
may not be familiar with the encoding, may see them.
2018-07-17 14:43:12 +03:00
Tomasz Grabiec
d94c7c07a3 lsa: Disable alloc failure injector inside the LSA sanitizer
Message-Id: <1531814822-30259-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 11:27:56 +01:00
Asias He
77018b7304 to_string: Add std::list and utils::chunked_vector support
It will be used by the gossip code.
2018-07-17 16:14:31 +08:00
Asias He
e4802d2fe3 serializer: Add chunked_vector support
It will be used by the gossip SYN and ACK message soon.
2018-07-17 16:12:50 +08:00
Botond Dénes
cc4acb6e26 storage_proxy: use the original row limits for the final results merging
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.

Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.

Fixes #3605.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
2018-07-16 16:54:50 +03:00
Takuya ASADA
9479ff6b1e dist/common/scripts/scylla_prepare: fix error when /etc/scylla/ami_disabled exists
On this part shell command wasn't converted to python3, need to fix.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180715075015.13071-1-syuu@scylladb.com>
2018-07-16 09:29:38 +03:00
Avi Kivity
c4013f6fe1 messaging: categorize more streaming/repair verbs as streaming
Since the messaging service will assign a scheduling group based
on the client index, it's more important now to get the verbs categorized
correctly.

Re-categorize REPLICATION_FINISHED, REPAIR_CHECKSUM_RANGE, and most
importantly STREAM_MUTATION_FRAGMENTS to the repair/streaming oriented
connections so we get the correct scheduling.
2018-07-15 15:44:10 +03:00
Avi Kivity
ff3d7839ab messaging: remove default when computing rpc client index
A default means that when adding new verbs, we may forget to
categorize a verb correctly.  Without the default, the compiler
will complain due to -Wswitch.
2018-07-15 15:40:29 +03:00
Avi Kivity
fe2db68be8 messaging: convert do_get_rpc_client_idx into a switch
A switch is more readable for multiple choice with no
clearly preferred choice.
2018-07-15 15:26:50 +03:00
Avi Kivity
3b1e04091c messaging: choose connection index via a look-up table
Looking up is faster than a bunch of if()s.
2018-07-15 15:21:06 +03:00
Takuya ASADA
1511d92473 dist/redhat: drop scylla_lib.sh from .rpm
Since we dropped scylla_lib.sh at 58e6ad22b2,
we need remove it from RPM spec file too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712155129.17056-1-syuu@scylladb.com>
2018-07-15 14:46:22 +03:00
Avi Kivity
ef9b36376c Merge "database: support multiple data directories" from Glauber
"
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
  host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
  manifest file goes to the main directory. For this one, note that in
  Cassandra the same thing happens, except that there is no "main"
  directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory

Despite the restrictions, one example of usage of this is recovery.  If
we have network attached devices for instance, we can quickly attach a
network device to an existing node and make the data immediately
available as it is compacted back to main storage.

Tests: unit (release)
"

* 'multi-data-file-v2' of github.com:glommer/scylla:
  database: change ident
  database: support multiple data directories
  database: allow resharing to specify a directory
  database: support multiple directories in get_snapshot_details
  database: move get_snapshot_info into a seastar::thread
  snapshots: always create the snapshot directory
  sstables: pass sstable dir with entry descriptor
  database: make nodetool listsnapshots print correct information
  sstables: correctly create descriptors for snapshots
2018-07-15 13:31:04 +03:00
Avi Kivity
8ee807321f Merge "scylla streaming with rpc streaming" from Asias
"
This work is on top of Gleb's rpc streaming which is merged recently.

What this series does is to replace scylla streaming service's data plane to
use the new rpc streaming instead of the old rpc verb to send the mutations for
scylla streaming. Other parts of scylla streaming, the control plane, are not
changed.

In my test, to bootstrap a new node to the existing one node cluster, smp 2,
scylla stores data on ramdisk to minimize disk io impact.

I saw x2 improvment in streaming bandwidth.

Before:
   [shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000]
   Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s
   [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds

After:
   [shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000]
   Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s
   [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds

Tests: dtest update_cluster_layout_tests.py

Fixes: #3591
"

* tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev:
  streaming: Add rpc streaming support
  storage_service: Introduce STREAM_WITH_RPC_STREAM feature
  streaming: Add estimate_partitions to send_info
  messaging_service: Add streaming with rpc streaming support
  messaging_service: Add streaming_domain
  database: Add add_sstable_and_update_cache
  database: Add make_streaming_sstable_for_write
2018-07-15 12:36:52 +03:00
Vlad Zolotarov
235520292e utils::loading_cache: hold a shared_value_ptr to the value when we reload
This allows to remove the requirement to hold the key value inside the
_load callback if its value is needed in the asynchronous continuation
inside the callback in the context of a reload.

This also resolves the use-after-free issue when a _load() callback removes
the item for a given key.

See a9b72db34d.1528794135.git.bdenes%40scylladb.com
for a discussion about this.

In addition this patch makes the loading_cache more robust for any existing
and potential situations when cached entries are being removed from inside the
callback. This is achieved by extending the idea implemented by Duarte in the
"utils/loading_cache: Avoid using invalidated iterators" by capturing timestamped_val_ptr
(which is essentially a lw_shared_ptr to an intrusive set entry which holds both the key
and the cached value) instead of a naked pointer.

Tests {debug, release}:
   - Unit tests:
      - loading_cache_test
      - view_build_test
      - auth_test
      - auth_resource_test

   - dtest:
      - auth_test.py

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-13 11:27:58 -04:00
Vlad Zolotarov
b44ad5677a utils::loading_cache::on_timer(): remove not needed capture of "this"
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-13 11:27:43 -04:00
Vlad Zolotarov
4aa0e5914b utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
The list of elements that needs to be reloaded may be rather large.
Use chunked_vector in order to make the allocator's life easier.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-13 09:53:59 -04:00
Avi Kivity
8c993e0728 messaging: tag RPC services with scheduling groups
Assign a scheduling_group for each RPC service. Assignement is
done by connection (get_rpc_client_idx()) - all verbs on the
same connection are assigned the same group. While this may seem
arbitrary, it avoids priority inversion; if two verbs on the same
connection have different scheduling groups, the verb with the low
shares may cause a backlog and stall the connection, including
following requests from verbs that ought to have higher shares.

The scheduling_group parameters are encapsulated in different
classes as they are passed around to avoid adding dependencies.
Message-Id: <20180708140433.6426-1-avi@scylladb.com>
2018-07-13 13:57:08 +02:00
Vladimir Krivopalov
cf7b42619d clustering_ranges_walker: Improve class consistency and readability.
This patch addresses several issues.
  1. The class no longer uses placement-new trick for move-assignment.
     It was incorrect to use because the class contains const refererences
     and re-initializing the same region of memory would result in undefined
     behaviour on accessing these members.

  2. Use boost::iterator_range for tracking the current range of
     cr_ranges. It is easier to deal with and avoids possible bugs like
     assigning only one of two iterators
Message-Id: <4096182c4ee2fb1157e135c487c41012b266ba69.1531440684.git.vladimir@scylladb.com>
2018-07-13 11:23:33 +02:00
Asias He
deff5e7d60 streaming: Add rpc streaming support
This patch changes scylla streaming to use the recently added rpc
streaming feature provided by seastar to send mutation fragments for
scylla streaming instead of the rpc verbs.

It also changes the receiver to write to the sstable file directly,
skipping writing to memtable.
2018-07-13 08:36:47 +08:00
Asias He
71e22fe981 storage_service: Introduce STREAM_WITH_RPC_STREAM feature
With this feature, the node supports scylla streaming using the rpc
streaming.
2018-07-13 08:36:47 +08:00
Asias He
faa6769cdb streaming: Add estimate_partitions to send_info
The sender needs to estimate the number of partitions to send, because
the receiver needs this to prepare the sstables.
2018-07-13 08:36:46 +08:00
Asias He
ddfb4590ce messaging_service: Add streaming with rpc streaming support
Preparation for adding rpc streaming in scylla streaming.

- register_stream_mutation_fragments is used to register the rpc
streaming verb

- make_sink_and_source_for_stream_mutation_fragments is used to get the
sink and source object for the sender

- make_sink_for_stream_mutation_fragments is used to get a sink object
for the receiver
2018-07-13 08:36:46 +08:00
Asias He
671e1b08fe messaging_service: Add streaming_domain
The rpc streaming needs a streaming_domain id for the same logical
server. Chose one for our messaging service.
2018-07-13 08:36:46 +08:00
Asias He
6540051f77 database: Add add_sstable_and_update_cache
Since we can write mutations to sstable directly in streaming, we need
to add those sstables to the system so it can be seen by the query.
Also we need to update the cache so the query refects the latest data.
2018-07-13 08:36:45 +08:00
Asias He
dfc2739625 database: Add make_streaming_sstable_for_write
This will be used to create sstable for streaming receiver to write the
mutations received from network to sstable file instead of writing to
memtable.
2018-07-13 08:36:45 +08:00
Takuya ASADA
ee61660b76 dist/common/scripts/scylla_ec2_check: support custom NIC ifname on EC2
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.

Fixes #3584

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
2018-07-12 18:22:28 +03:00
Tomasz Grabiec
b17f7257a9 sstables: index_reader: Reduce size of index_entry by indirecting promoted_index
Reduces size of index_entry from 384 bytes to 64 bytes by using
indirection for the optional promoted index instead of embedding it.

Improves query time from 9ms to 4ms in a micro benchmark with a very
large index page.

Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 17:46:58 +03:00
Tomasz Grabiec
101dcdbb48 gdb: Fix scylla heapprof command
Type of _frames was chagned to static_vector<>

Message-Id: <1531233685-20786-2-git-send-email-tgrabiec@scylladb.com>
2018-07-12 16:51:30 +03:00
Tomasz Grabiec
059133ffa8 gdb: Introduce iteration wrapper for static_vector
Message-Id: <1531233685-20786-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 16:51:30 +03:00
Duarte Nunes
63b63b0461 utils/loading_cache: Avoid using invalidated iterators
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.

For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.

Fix this by using a temporary container to hold those entries to be
reloaded.

Spotted when reading the code.

Also use if constexpr and fix the comment in the function containing
the changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
2018-07-12 13:59:09 +01:00
Botond Dénes
2e7bf9c6f9 loading_cache::reload(): obtain key before calling _load()
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
2018-07-12 13:42:42 +01:00
Avi Kivity
a4a2f743a8 Merge "Avoid large allocations when reading sstable index pages" from Tomasz
"
If there is a lot of partitions in the index page, index_list may grow large
and require large contiguous blocks of memory, because it's based on
std::vector. That puts pressure on the memory allocator, and if memory is
fragmented, may not be possible to satisfy without a lot of eviction. Switch
to chunked_vector to avoid this.

Refs #3597
"

* 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla:
  sstables: Switch index_list to chunked_vector to avoid large allocations
  utils: chunked_vector: Do not require T to be default-constructible for clear()
  utils: chunked_vector: Implement front()
2018-07-12 15:30:18 +03:00
Duarte Nunes
1fb3b924f4 utils/loading_cache: Remove superfluous continuation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712122031.13424-1-duarte@scylladb.com>
2018-07-12 15:22:35 +03:00
Takuya ASADA
8f80d23b07 dist/common/scripts/scylla_util.py: fix typo
Fix typo, and rename get_mode_cpu_set() to get_mode_cpuset(), since a
term 'cpuset' is not included '_' on other places.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711141923.12675-1-syuu@scylladb.com>
2018-07-12 10:14:55 +03:00
Tomasz Grabiec
8c85b01ad3 gdb: Fix scylla lsa-segment on python 3
Referring to a function parameter via "global" no longer works on
python 3. We should be using "nonlocal", which is absent on python 2
though. To make the script work on both, inline next().

Message-Id: <1531317984-29224-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 10:14:22 +03:00
Duarte Nunes
a7fdf4fc49 Merge 'ALLOW FILTERING for indexed queries' from Piotr
"
Previous series on ALLOW FILTERING introduced it for regular queries,
but it's also possible to have an indexed query which requires
filtering. This series contains minor fixes that allow treating
indexed+filtered queries properly. The most important part is having
more selective approach of extracting values from restrictions
in read_posting_list() helper function. Before ALLOW FILTERING,
restrictions contained only a single entry that matched the indexed
column, but it's not the case with filtering (and it won't be the case
with multiple indexing support).

This series also comes with test cases for indexed+filtered queries.

Tests: unit (release)
"

* 'allow_filtering_and_si_3' of https://github.com/psarna/scylla:
  tests: add filtering indexed queries tests
  cql3: use single restriction value in index creation
  cql3: add secondary index condition to need_filtering
  cql3: add value_for method
  cql3: add missing inline declarations to restrictions
  cql3: make index detection more specific
  index: add target_column getter to index
2018-07-12 00:17:36 +01:00
Duarte Nunes
55caaec411 db/view/build_progress_virtual_reader: Also adjust end RT bound
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 23:28:31 +01:00
Duarte Nunes
eda6b88b0e db/view/build_progress_virtual_reader: Fix full ck detection
As an optimization, the virtual reader doesn't change the underlying
key if it is not full, and hence doesn't include the extra clustering
key. However, this detection is broken because it checked for 3
clustering columns, instead of 2.

This patch fixes that by obtaining the clustering key size from the
underlying schema instead of hardcoding the size.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 23:28:31 +01:00
Duarte Nunes
ff3a0d437a db/view/build_progress_virtual_reader: Use correct schema to adjust ck
The virtual reader adjusts clustering keys obtained from the
underlying, scylla-specific schema, and potentially sheds the extra
clustering key that's absent from the Cassandra-compatible schema.

This patches ensures we use the correct schema to iterator over the
key.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 23:28:31 +01:00
Duarte Nunes
df66d7db59 db/view: Restrict writes to the distributed system keyspace to shard 0
Writing to the distributed system keyspace should be confined to a
single shard of each host, namely shard 0. We were violating this
constraint by having all shards set the host status to "started". This
could be problematic when the build finishes quickly or there's a
concurrent view drop, such that a write done by shard 0 can have a
smaller timestamp than one done by some other shard.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 21:45:26 +01:00
Duarte Nunes
e683c1367f db/view: Don't have shard 0 clear other shard's status on drop
Shard 0 can clear the in-progress build status of all shards when a
view finishes building, because we are ensured all writes to the
system table have completed with earlier timestamps.

This is not the case when dropping a view. A drop can happen
concurrently with the build, in which case shard 0 may process the
notification before another shard receives it, and before that shard
writes to the system table.

Fix this by ensuring each shard clears its own status on drop.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 21:45:26 +01:00
Duarte Nunes
2fa7f10429 db/system_keyspace: Add function to remove view build status of a shard
This patch adds a function that clears the view build in-progress
status for the current shard, similar to the existing one that clears
it across all shards.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 21:27:39 +01:00
Piotr Sarna
fcfbc804e4 tests: add filtering indexed queries tests
Tests covering ALLOW FILTERING usage while using secondary indexes
as well are added to cql_query_test.
Tests are based on Cassandra's test suite for filtering secondary
indexes + some more simple cases.
2018-07-11 18:06:21 +02:00
Piotr Sarna
7d9715db27 cql3: use single restriction value in index creation
ALLOW FILTERING support caused index-related restrictions to possibly
have more values. In order to remain correct, only those restrictions
which match the indexed columns should be used.
2018-07-11 18:06:21 +02:00
Piotr Sarna
1d75035672 cql3: add secondary index condition to need_filtering
A query that restricts a partition key and an indexed column
needs filtering (after reading an index) and it wasn't
properly detected before.
2018-07-11 18:06:21 +02:00
Piotr Sarna
80ce9b72a1 cql3: add value_for method
In order to extract value from a restriction for just one column,
value_for(column_name, options) method is implemented.
It's needed because once ALLOW FILTERING support was introduced,
index-related restrictions may contain more than 1 value.
2018-07-11 18:06:21 +02:00
Piotr Sarna
c1ad28f28e cql3: add missing inline declarations to restrictions
In order to prevent future compilation errors, externally defined
class methods from single column primary key restrictions are explicitly
marked inline.
2018-07-11 18:06:21 +02:00
Piotr Sarna
02811d8996 cql3: make index detection more specific
Conditions that detect if restrictions need an indexed query weren't
specific enough to work properly with mixed index-filtering queries,
because they would overly eager assume that partition/clustering key
restrictions have a backing index.
2018-07-11 18:06:21 +02:00
Piotr Sarna
372644c909 index: add target_column getter to index
Target column for an index is later needed to find matching
restrictions.
2018-07-11 18:06:21 +02:00
Tomasz Grabiec
3b2890e1db sstables: Switch index_list to chunked_vector to avoid large allocations
If there is a lot of partitions in the index page, index_list may grow
large and require large contiguous blocks of memory. That puts
pressure on the memory allocator, and if memory is fragmented, may not
be possible to satisfy without a lot of eviction.
2018-07-11 16:55:20 +02:00
Tomasz Grabiec
b0f5df10d2 utils: chunked_vector: Do not require T to be default-constructible for clear()
resize(), used by clear(), requires T to be default-constructible in
case the vector is expanded. It's not actually needed for clearing,
and there will be users which use clear() with
non-default-constructible T, so implement clear() without using
resize().
2018-07-11 16:55:20 +02:00
Tomasz Grabiec
03832dab97 utils: chunked_vector: Implement front()
std::vector<> has it, so should this, for easy migration.
2018-07-11 16:55:20 +02:00
Piotr Sarna
dcdd8be59c cql3: make index-related tests less timing dependent
Indexes and materialized views take time to build, so checks
that rely on that are now wrapped with 'eventually' blocks.

Message-Id: <6d3def2bc49b76dda11d7a1c9974a8b3d221003f.1531312518.git.sarna@scylladb.com>
2018-07-11 15:45:52 +03:00
Takuya ASADA
58e6ad22b2 dist/common/scripts: drop scylla_lib.sh
Drop scylla_lib.sh since all bash scripts depends on the library is
already converted to python3, and all scylla_lib.sh features are
implemented on scylla_util.py.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711114756.21823-1-syuu@scylladb.com>
2018-07-11 14:54:56 +03:00
Avi Kivity
83d72f3755 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5200f3f...d53834f (1):
  > Merge "AMI scripts python3 conversion" from Takuya
2018-07-11 13:16:08 +03:00
Avi Kivity
693cf77022 Merge "more conversion from bash to python3" from Takuya
"Converted more scripts to python3."

* 'script_python_conversion2_v2' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_util.py: make run()/out() functions shorter
  dist/ami: install python34 to run scylla_install_ami
  dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
  dist/common/scripts: drop class concolor, use colorprint()
  dist/ami/files/.bash_profile: convert almost all lines to python3
  dist/common/scripts: convert node_exporter_install to python3
  dist/common/scripts: convert scylla_stop to python3
  dist/common/scripts: convert scylla_prepare to python3
2018-07-11 13:14:23 +03:00
Tomasz Grabiec
1de5177175 tests: row_cache: Fix use-after-scope on partition_range passed to readers
The partition_range must outlive the reader.

Message-Id: <1531301583-15476-1-git-send-email-tgrabiec@scylladb.com>
2018-07-11 12:39:30 +03:00
Avi Kivity
28621066e6 observable: allow an observable to disconnect() twice without penalty
Message-Id: <20180711070754.13286-1-avi@scylladb.com>
2018-07-11 10:15:01 +01:00
Avi Kivity
1895483781 observable: add comments explaining the purpose and use of the mechanism
Message-Id: <20180710133706.8791-1-avi@scylladb.com>
2018-07-11 10:15:01 +01:00
Avi Kivity
99d3f0a1b1 tests: add obserable_test to test suite
Message-Id: <20180711071131.13702-1-avi@scylladb.com>
2018-07-11 10:15:01 +01:00
Tomasz Grabiec
fde4a312db gdb: Replace long() with int()
Python 3 doesn't have 'long' anymore, so commands using it fail with
newer GDB. long on python2 is the same as int on python3, both are
arbitrary-precision. On python2 int is fixed-precision, but seems to
be still enough (64 bit), so use that instead.

Message-Id: <1531215600-31899-1-git-send-email-tgrabiec@scylladb.com>
2018-07-10 15:05:02 +03:00
Nadav Har'El
5e47061438 repair: fix small error-handling logic mistake
As noticed by Tomasz Grabiec, we test a future's available() after
having already waited for it with when_all(), which is pointless.

The code after the wrong if() exchanges the contents of a token-range
between this node and several other live neighbors; We can't do this
exchange if either this node is broken or there is no other live neighbor.
So this is what we needed to test. so !available() should have been failed().

Also the test for live_neighbors_checksum.empty() added in commit 7c873f0d1f
is unnecessary - we build live_neighbors and live_neighbors_checksum
together, so if one of them is empty, so is the other.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180710114940.26027-1-nyh@scylladb.com>
2018-07-10 15:04:03 +03:00
Piotr Sarna
559439b6ea tests: add more ALLOW FILTERING tests
More test cases are added to cql_query_test in order to check
ALLOW FILTERING clauses more accurately.

Message-Id: <4c59c1f3eb01558be992d0596e5423c276087387.1531220558.git.sarna@scylladb.com>
2018-07-10 14:44:33 +03:00
Piotr Sarna
aadbfc6b84 cql3: throw instead of log for collection filtering
Original series that introduced filtering logged a warning
when collection restrictions appeared. Instead, an exception
should be thrown until collection restrictions are supported
for ALLOW FILTERING clauses.

Message-Id: <ddaf342d4d6766fadb756f66e5afa0b99ce054f8.1531220558.git.sarna@scylladb.com>
2018-07-10 14:44:29 +03:00
Avi Kivity
7db394ce50 observable: switch to noncopyable_function
std::function's move constructor is not noexcept, so observer's move
constructor and assignment operator also cannot be. Switch to Seastar's
noncopyable_function which provides better guarantees.

Tests: observer_tests (release)
Message-Id: <20180710073628.30702-1-avi@scylladb.com>
2018-07-10 09:42:49 +01:00
Avi Kivity
0a2c9387e8 Merge "Support reading deleted cells" from Piotr
"
Implement and test support for reading deleted cells in SSTables 3.
"

* 'haaawk/sstables3/read-deleted-cells-v2' of ssh://github.com/scylladb/seastar-dev:
  sstables: Test reading deleted cells from SST3
  sstables: Support deleted cells in reading SST3
  test_uncompressed_compound_ck_read: fix comment
  utils: add observer/observable templates
2018-07-10 11:21:00 +03:00
Piotr Jastrzebski
0abdd919c8 sstables: Test reading deleted cells from SST3
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-10 10:03:29 +02:00
Piotr Jastrzebski
54fc6dde35 sstables: Support deleted cells in reading SST3
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-10 10:03:29 +02:00
Piotr Jastrzebski
f64901fdac test_uncompressed_compound_ck_read: fix comment
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-10 10:03:14 +02:00
Avi Kivity
96737d140f utils: add observer/observable templates
An observable is used to decouple an information producer from a consumer
(in the same way as a callback), while allowing multiple consumers (called
observers) to coexist and to manage their lifetime separately.

Two classes are introduced:

 observable: a producer class; when an observable is invoked all observers
        receive the information
 observer: a consumer class; receives information from a observable

Modelled after boost::signals2, with the following changes
 - all signals return void; information is passed from the producer to
   the consumer but not back
 - thread-unsafe
 - modern C++ without preprocessor hacks
 - connection lifetime is always managed rather than leaked by default
 - renamed to avoid the funky "slot" name
Message-Id: <20180709172726.5079-1-avi@scylladb.com>
2018-07-09 18:48:44 +01:00
Paweł Dziepak
00a63663d6 bytes_ostream: increase max chunk size to 128 kB
128 kB is the size of the LSA segment and therefore the default size of
any kind of chunks, fragments and buffers.

Message-Id: <20180709155615.22500-1-pdziepak@scylladb.com>
2018-07-09 19:59:51 +03:00
Tomasz Grabiec
1336744a05 mutation_fragment: Fix clustering_row::equal() using incorrect column kind
Incorrect column_kind was passed, which may cause wrong type to be
used for comparison if schema contains static columns. Affects only
tests.

Spotted during code review.
Message-Id: <1531144991-2658-1-git-send-email-tgrabiec@scylladb.com>
2018-07-09 15:25:17 +01:00
Avi Kivity
ed7855a8a6 Update seastar submodule
* seastar 216d499...aac6cf1 (5):
  > reactor: pollable_fd: limit fragment count to IOV_MAX
  > tests: silence more "-Werror=sign-compare" warnings
  > reactor: include <boost/next_prior.hpp>
  > Use `#pragma once` everywhere
  > .gitignore: adds __pycache__ directory
2018-07-09 17:01:44 +03:00
Gleb Natapov
617666efb0 storage_proxy: use logger's exception printer to report read failure
Use existing exception pretty printer since it handles nested
exceptions.

Message-Id: <20180709122826.GT28899@scylladb.com>
2018-07-09 15:31:14 +03:00
Duarte Nunes
156817e00e db/size_estimates_virtual_reader: Use left-exclusive token ranges
We were considering the token ranges in the size_estimates system
table to be inclusive, which is incorrect and incompatible with
Cassandra.

While we ignore the inclusiveness of the partition_range bounds when
selecting sstables, we do take it into account in
estimated_keys_for_range(). We would thus select the correct sstables,
but could over-estimate the range size nonetheless.

Tests: virtual_reader_test(release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180709115919.5106-1-duarte@scylladb.com>
2018-07-09 15:26:32 +03:00
Takuya ASADA
1a5a40e5f6 dist/common/scripts/scylla_util.py: use os.open(O_EXCL) to verify disk is unused
To simplify is_unused_disk(), just try to open the disk instead of
checking multiple block subsystems.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180709102729.30066-1-syuu@scylladb.com>
2018-07-09 13:29:15 +03:00
Avi Kivity
7d0df2a06d Update scylla-ami submodule
* dist/ami/files/scylla-ami 67293ba...5200f3f (1):
  > Add custom script options to AMI user-data
2018-07-09 13:21:30 +03:00
Gleb Natapov
ac27d1c93b storage_proxy: fix rpc connection failure handling by read operation
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.

Fixes #3590.

Message-Id: <20180708123522.GC28899@scylladb.com>
2018-07-09 10:05:31 +03:00
Avi Kivity
2f8537b178 database: demote "Setting compaction strategy" log message to debug level
It's not very helpful in normal operation, and generates much noise,
especially when there are many tables.
Message-Id: <20180708070051.8508-1-avi@scylladb.com>
2018-07-08 10:27:03 +01:00
Avi Kivity
512baf536f storage_proxy: implement write timeouts
Require a timeout parameter for storage_proxy::mutate_begin() and
all its callers (all the way to thrift and cql modification_statement
and batch_statement).

This should fix spurious debug-mode test failures, where overcommit
and general debug slowness result in the default timeouts being
exceeded. Since the tests use infinite timeouts, they should not
time out any more.

Tests: unit (release), with an extra patch that aborts
    when a non-infinite timeout is detected.
Message-Id: <20180707204424.17116-1-avi@scylladb.com>
2018-07-08 10:27:03 +01:00
Takuya ASADA
929ba016ed dist/common/scripts/scylla_util.py: strip double quote from sysconfig parameter
Current sysconfig_parser.get() returns parameter including double quote,
it will cause problem by append text using sysconfig_parser.set().

Fixes #3587

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180706172219.16859-1-syuu@scylladb.com>
2018-07-08 10:47:41 +03:00
Duarte Nunes
1beed0ca16 Merge 'hinted handoff: add rebalancing and unmark as experimental' from Vlad
"
This series adds the last missing part of the HH feature list (as in the design doc) - rebalancing;
and finally removes the "experimental" tag from the HH.
"

* 'hinted_handoff_rebalance-v3' of https://github.com/vladzcloudius/scylla:
  main: remove the "experimental" tag from the hinted handoff feature
  db::hints::manager: implement rebalance() method
2018-07-07 20:38:07 +01:00
Takuya ASADA
a98b4b705c dist/common/scripts/scylla_util.py: make run()/out() functions shorter
Refactored these functions to make them simpler.
2018-07-08 01:13:36 +09:00
Takuya ASADA
e2a032f7ea dist/ami: install python34 to run scylla_install_ami
Since we switched scylla_install_ami to python3, need to install python3
before launching the script.
2018-07-08 01:13:36 +09:00
Takuya ASADA
4e04fb7d68 dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
There is duplicated code on both scylla_ec2_check and class aws_instance
on scylla_util.py, so drop these code from scylla_ec2_check and use
class aws_instance.
2018-07-08 01:13:36 +09:00
Takuya ASADA
99d5ca03e7 dist/common/scripts: drop class concolor, use colorprint()
To print colored console output with simplar code, drop class concolor
and use colorprint() instead.
2018-07-08 01:13:36 +09:00
Takuya ASADA
14d117363b dist/ami/files/.bash_profile: convert almost all lines to python3
Since it's .bash_profile we cannot make the file to python3 script but
almost all lines are rewritten to python3, .bash_profile just launch it.
2018-07-08 01:13:35 +09:00
Takuya ASADA
25c3249d8d dist/common/scripts: convert node_exporter_install to python3
Convert bash script to python3.
2018-07-08 01:13:35 +09:00
Takuya ASADA
505fcc92f7 dist/common/scripts: convert scylla_stop to python3
Convert bash script to python3.
2018-07-08 01:13:35 +09:00
Takuya ASADA
eb369942bd dist/common/scripts: convert scylla_prepare to python3
Convert bash script to python3.
2018-07-08 01:13:35 +09:00
Vlad Zolotarov
7495c8e56d dist: scylla_lib.sh: get_mode_cpu_set: split the declaration and ssignment to the local variable
In bash local variable declaration is a separate operation with its own exit status
(always 0) therefore constructs like

local var=`cmd`

will always result in the 0 exit status ($? value) regardless of the actual
result of "cmd" invocation.

To overcome this we should split the declaration and the assignment to be like this:

local var
var=`cmd`

Fixes #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-3-git-send-email-vladz@scylladb.com>
2018-07-07 18:04:19 +03:00
Vlad Zolotarov
f3ca17b1a1 dist: scylla_lib.sh: get_mode_cpu_set: don't let the error messages out
References #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-2-git-send-email-vladz@scylladb.com>
2018-07-07 18:04:18 +03:00
Avi Kivity
e79fccdf7b Update seastar submodule
* seastar d7f35d7...216d499 (10):
  > temporary_buffer: Add clone method()
  > temporary_buffer: Make move-assignment operator noexcept.
  > deleter: Make move-assignment operator noexcept.
  > reactor: don't become inefficient when max_task_backlog is exceeded
  > reactor: switch cumulative time metrics resolution from nanoseconds to milliseconds
  > preempt: annotate for branch prediction
  > tests: silence "-Werror=sign-compare" warnings
  > Merge "Support one I/O Scheduler per device" from Glauber
  > rpc: make rpc server scheduling aware
  > Add SEASTAR_USER_CFLAGS and SEASTAR_ENABLE_WERROR
2018-07-07 17:48:25 +03:00
Vlad Zolotarov
c65a110839 main: remove the "experimental" tag from the hinted handoff feature
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:19:40 -04:00
Vlad Zolotarov
83ba6d84a1 db::hints::manager: implement rebalance() method
Rebalance hints segments that need to be sent among all present shards.

Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.

Try to minimize the amount of "file rename" operations in order to achieve the needed result.

Note: "Resharding" is a particular case of rebalancing.

Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:18:46 -04:00
Piotr Sarna
77aa97f62a cql3: fix ALLOW FILTERING iterator
In original series cell iterator for regular cells
was erroneously taken by copy instead of by reference,
which will result in iterating over the first value indefinitely.
Also, the same iterator was not updated for collections,
which is fixed too.
Message-Id: <83297adf8121de4fd37257c87f250d61ea9ec80b.1530892191.git.sarna@scylladb.com>
2018-07-06 17:23:12 +01:00
Glauber Costa
82f7f7b36d database: change ident
Previous patches have used reviewer-oriented identation. Re-ident.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 17:11:01 -04:00
Glauber Costa
99c8a1917f database: support multiple data directories
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
 - We scan all data directories for existing data.
 - resharding only happens within a particular data directory.
 - snapshot details are accumulated with data for all directories that
   host snapshots for the tables we are examining
 - snapshots are created with files in its own directories, but the
   manifest file goes to the main directory. For this one, note that in
   Cassandra the same thing happens, except that there is no "main"
   directory. Still the manifest file is still just in one of them.
 - SSTables are flushed into the main directory.
 - Compactions write data into the main directory

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:39 -04:00
Glauber Costa
3b46984a1e database: allow resharing to specify a directory
resharding assumes that all SSTables will be in cf->dir(), but in
reality we will soon have tables in other places. We can specify a
directory in get_all_shared_sstables and specify that directory from the
resharding process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
c8b2d441a8 database: support multiple directories in get_snapshot_details
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
a8ccf4d1e6 database: move get_snapshot_info into a seastar::thread
I am about to add another level of identation and this code already
shifts right too much. It is not performance critical, so let's use a
thread for that. seastar::threads did not exist when this was first
written.

Also remove one unused continuation from inside the inner scan,
simplifying its code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
919c7d6bb9 snapshots: always create the snapshot directory
We currently don't always create the snapshot directory as an
optimization. We have a test at sync time handling this use case.

This works well when all SSTables are created in the same directory, but
if we have more than one data directory than it may not work if we don't
have SSTables in all data directories.

We can fix it by unconditionally creating the directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
86239e4e22 sstables: pass sstable dir with entry descriptor
We have been assuming that all SSTables for a table will be in the same
directory, and we pass the directory name to make_descriptor only
because that's the way in ka and la to find out the keyspace and table
names.

However, SSTables for a given column family could be spread into
multiple directories. So let's pass it down with the descriptor so we
can load from the right place.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:45:26 -04:00
Glauber Costa
25a02c61d6 database: make nodetool listsnapshots print correct information
nodetool listsnapshots is currently printing zero sizes for all snapshots
The reason for that is that we are moving the snapshot directory name in
the capture list, which can be evaluated by the compiler to happen
before we use it as the function parameter.

Fixes #3572

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:20:07 -04:00
Glauber Costa
4a62866104 sstables: correctly create descriptors for snapshots
Our regular expression for parsing SSTable files tests for the directory
for the la file format, since that file format does not include the
ks/cf pair in the file name itself.

However, the regular expression does not cover the case in which the
SSTable files are coming from snapshots. This patch extends the regex so
they are also covered.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:19:09 -04:00
755 changed files with 27361 additions and 9533 deletions

8
.gitmodules vendored
View File

@@ -1,14 +1,14 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
url = ../scylla-swagger-ui
ignore = dirty
[submodule "dist/ami/files/scylla-ami"]
path = dist/ami/files/scylla-ami
url = ../scylla-ami
[submodule "xxHash"]
path = xxHash
url = ../xxHash
[submodule "libdeflate"]
path = libdeflate
url = ../libdeflate

View File

@@ -138,4 +138,5 @@ target_include_directories(scylla PUBLIC
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
xxhash
libdeflate
build/release/gen)

View File

@@ -20,7 +20,7 @@ $ git submodule update --init --recursive
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
Running `./install-dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
### Build system

View File

@@ -50,12 +50,12 @@ Then, to build an RPM, run:
./dist/redhat/build_rpm.sh
```
The built RPM is stored in ``/var/lib/mock/<configuration>/result`` directory.
The built RPM is stored in the ``build/mock/<configuration>/result`` directory.
For example, on Fedora 21 mock reports the following:
```
INFO: Done(scylla-server-0.00-1.fc21.src.rpm) Config(default) 20 minutes 7 seconds
INFO: Results and/or logs in: /var/lib/mock/fedora-21-x86_64/result
INFO: Results and/or logs in: build/mock/fedora-21-x86_64/result
```
## Building Fedora-based Docker image

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=3.0.4
if test -f version
then

View File

@@ -2228,11 +2228,11 @@
"description":"The column family"
},
"total":{
"type":"int",
"type":"long",
"description":"The total snapshot size"
},
"live":{
"type":"int",
"type":"long",
"description":"The live snapshot size"
}
}

View File

@@ -78,15 +78,17 @@ void set_storage_service(http_context& ctx, routes& r) {
});
});
ss::get_tokens.set(r, [] (const_req req) {
auto tokens = service::get_local_storage_service().get_token_metadata().sorted_tokens();
return container_to_vec(tokens);
ss::get_tokens.set(r, [] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_metadata().sorted_tokens(), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
ss::get_node_tokens.set(r, [] (const_req req) {
gms::inet_address addr(req.param["endpoint"]);
auto tokens = service::get_local_storage_service().get_token_metadata().get_tokens(addr);
return container_to_vec(tokens);
ss::get_node_tokens.set(r, [] (std::unique_ptr<request> req) {
gms::inet_address addr(req->param["endpoint"]);
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_metadata().get_tokens(addr), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
ss::get_commitlog.set(r, [&ctx](const_req req) {
@@ -107,11 +109,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_moving_nodes.set(r, [](const_req req) {
auto points = service::get_local_storage_service().get_token_metadata().get_moving_endpoints();
std::unordered_set<sstring> addr;
for (auto i: points) {
addr.insert(boost::lexical_cast<std::string>(i.second));
}
return container_to_vec(addr);
});

View File

@@ -55,6 +55,15 @@ atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_typ
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
@@ -73,6 +82,16 @@ atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_typ
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(

View File

@@ -33,6 +33,7 @@
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
#include "utils/fragmented_temporary_buffer.hh"
#include "serializer.hh"
@@ -190,6 +191,8 @@ public:
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
collection_member cm = collection_member::no) {
return make_live(type, timestamp, bytes_view(value), cm);
@@ -199,6 +202,8 @@ public:
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm = collection_member::no)
{

View File

@@ -28,6 +28,7 @@
#include "database.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "timeout_config.hh"
namespace auth {
@@ -86,12 +87,24 @@ future<> create_metadata_table_if_missing(
return mm.announce_new_column_family(b.build(), false);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db) {
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db, seastar::abort_source& as) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db] { return db.get_version() != database::empty_version; }, pause).then([&mm] {
return do_until([&mm] { return mm.have_schema_agreement(); }, pause);
return do_until([&db, &as] {
as.check();
return db.get_version() != database::empty_version;
}, pause).then([&mm, &as] {
return do_until([&mm, &as] {
as.check();
return mm.have_schema_agreement();
}, pause);
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
static const auto t = 5s;
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
}
}

View File

@@ -38,6 +38,7 @@
using namespace std::chrono_literals;
class database;
class timeout_config;
namespace service {
class migration_manager;
@@ -80,6 +81,11 @@ future<> create_metadata_table_if_missing(
stdx::string_view cql,
::service::migration_manager&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&, seastar::abort_source&);
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
}

View File

@@ -160,7 +160,7 @@ future<> default_authorizer::start() {
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().local(), _as).get0();
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {
@@ -178,7 +178,7 @@ future<> default_authorizer::start() {
future<> default_authorizer::stop() {
_as.request_abort();
return _finished.handle_exception_type([](const sleep_aborted&) {});
return _finished.handle_exception_type([](const sleep_aborted&) {}).handle_exception_type([](const abort_requested_exception&) {});
}
future<permission_set>
@@ -228,7 +228,7 @@ default_authorizer::modify(
return _qp.process(
query,
db::consistency_level::ONE,
infinite_timeout_config,
internal_distributed_timeout_config(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
@@ -254,7 +254,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.process(
query,
db::consistency_level::ONE,
infinite_timeout_config,
internal_distributed_timeout_config(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -282,7 +282,7 @@ future<> default_authorizer::revoke_all(stdx::string_view role_name) const {
return _qp.process(
query,
db::consistency_level::ONE,
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);

View File

@@ -41,11 +41,6 @@
#include "auth/password_authenticator.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
#include <algorithm>
#include <chrono>
#include <random>
@@ -55,6 +50,7 @@ extern "C" {
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/passwords.hh"
#include "auth/roles-metadata.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
@@ -82,6 +78,8 @@ static const class_registrator<
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
password_authenticator::~password_authenticator() {
}
@@ -91,76 +89,6 @@ password_authenticator::password_authenticator(cql3::query_processor& qp, ::serv
, _stopped(make_ready_future<>()) {
}
// TODO: blowfish
// Origin uses Java bcrypt library, i.e. blowfish salt
// generation and hashing, which is arguably a "better"
// password hash than sha/md5 versions usually available in
// crypt_r. Otoh, glibc 2.7+ uses a modified sha512 algo
// which should be the same order of safe, so the only
// real issue should be salted hash compatibility with
// origin if importing system tables from there.
//
// Since bcrypt/blowfish is _not_ (afaict) not available
// as a dev package/lib on most linux distros, we'd have to
// copy and compile for example OWL crypto
// (http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/glibc/crypt_blowfish/)
// to be fully bit-compatible.
//
// Until we decide this is needed, let's just use crypt_r,
// and some old-fashioned random salt generation.
static constexpr size_t rand_bytes = 16;
static thread_local crypt_data tlcrypt = { 0, };
static sstring hashpw(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (res == nullptr) {
throw std::system_error(errno, std::system_category());
}
return res;
}
static bool checkpw(const sstring& pass, const sstring& salted_hash) {
auto tmp = hashpw(pass, salted_hash);
return tmp == salted_hash;
}
static sstring gensalt() {
static sstring prefix;
std::random_device rd;
std::default_random_engine e1(rd());
std::uniform_int_distribution<char> dist;
sstring valid_salt = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789./";
sstring input(rand_bytes, 0);
for (char&c : input) {
c = valid_salt[dist(e1) % valid_salt.size()];
}
sstring salt;
if (!prefix.empty()) {
return prefix + input;
}
// Try in order:
// blowfish 2011 fix, blowfish, sha512, sha256, md5
for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {
salt = pfx + input;
if (crypt_r("fisk", salt.c_str(), &tlcrypt)) {
prefix = pfx;
return salt;
}
}
throw std::runtime_error("Could not initialize hashing algorithm");
}
static sstring hashpw(const sstring& pass) {
return hashpw(pass, gensalt());
}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return !row.get_or<sstring>(SALTED_HASH, "").empty();
}
@@ -184,7 +112,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.process(
query,
db::consistency_level::QUORUM,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -192,7 +120,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.process(
update_row_query,
consistency_for_user(username),
infinite_timeout_config,
internal_distributed_timeout_config(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
@@ -209,8 +137,8 @@ future<> password_authenticator::create_default_if_missing() const {
return _qp.process(
update_row_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
{hashpw(DEFAULT_USER_PASSWORD), DEFAULT_USER_NAME}).then([](auto&&) {
internal_distributed_timeout_config(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
}
@@ -221,8 +149,6 @@ future<> password_authenticator::create_default_if_missing() const {
future<> password_authenticator::start() {
return once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
auto f = create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
@@ -231,7 +157,7 @@ future<> password_authenticator::start() {
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().local(), _as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {
if (legacy_metadata_exists()) {
@@ -256,7 +182,7 @@ future<> password_authenticator::start() {
future<> password_authenticator::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});
}
db::consistency_level password_authenticator::consistency_for_user(stdx::string_view role_name) {
@@ -309,13 +235,17 @@ future<authenticated_user> password_authenticator::authenticate(
return _qp.process(
query,
consistency_for_user(username),
infinite_timeout_config,
internal_distributed_timeout_config(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
auto salted_hash = std::experimental::optional<sstring>();
if (!res->empty()) {
salted_hash = res->one().get_opt<sstring>(SALTED_HASH);
}
if (!salted_hash || !passwords::check(password, *salted_hash)) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
return make_ready_future<authenticated_user>(username);
@@ -337,8 +267,8 @@ future<> password_authenticator::create(stdx::string_view role_name, const authe
return _qp.process(
update_row_query,
consistency_for_user(role_name),
infinite_timeout_config,
{hashpw(*options.password), sstring(role_name)}).discard_result();
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
future<> password_authenticator::alter(stdx::string_view role_name, const authentication_options& options) const {
@@ -355,8 +285,8 @@ future<> password_authenticator::alter(stdx::string_view role_name, const authen
return _qp.process(
query,
consistency_for_user(role_name),
infinite_timeout_config,
{hashpw(*options.password), sstring(role_name)}).discard_result();
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
future<> password_authenticator::drop(stdx::string_view name) const {
@@ -366,7 +296,10 @@ future<> password_authenticator::drop(stdx::string_view name) const {
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(query, consistency_for_user(name), infinite_timeout_config, {sstring(name)}).discard_result();
return _qp.process(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
{sstring(name)}).discard_result();
}
future<custom_options> password_authenticator::query_custom_options(stdx::string_view role_name) const {

84
auth/passwords.cc Normal file
View File

@@ -0,0 +1,84 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/passwords.hh"
#include <cerrno>
#include <optional>
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
namespace auth::passwords {
static thread_local crypt_data tlcrypt = { 0, };
namespace detail {
scheme identify_best_supported_scheme() {
const auto all_schemes = { scheme::bcrypt_y, scheme::bcrypt_a, scheme::sha_512, scheme::sha_256, scheme::md5 };
// "Random", for testing schemes.
const sstring random_part_of_salt = "aaaabbbbccccdddd";
for (scheme c : all_schemes) {
const sstring salt = sstring(prefix_for_scheme(c)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return c;
}
}
throw no_supported_schemes();
}
sstring hash_with_salt(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
return res;
}
const char* prefix_for_scheme(scheme c) noexcept {
switch (c) {
case scheme::bcrypt_y: return "$2y$";
case scheme::bcrypt_a: return "$2a$";
case scheme::sha_512: return "$6$";
case scheme::sha_256: return "$5$";
case scheme::md5: return "$1$";
default: return nullptr;
}
}
} // namespace detail
no_supported_schemes::no_supported_schemes()
: std::runtime_error("No allowed hashing schemes are supported on this system") {
}
bool check(const sstring& pass, const sstring& salted_hash) {
return detail::hash_with_salt(pass, salted_hash) == salted_hash;
}
} // namespace auth::paswords

125
auth/passwords.hh Normal file
View File

@@ -0,0 +1,125 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <random>
#include <stdexcept>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth::passwords {
class no_supported_schemes : public std::runtime_error {
public:
no_supported_schemes();
};
///
/// Apache Cassandra uses a library to provide the bcrypt scheme. Many Linux implementations do not support bcrypt, so
/// we support alternatives. The cost is loss of direct compatibility with Apache Cassandra system tables.
///
enum class scheme {
bcrypt_y,
bcrypt_a,
sha_512,
sha_256,
md5
};
namespace detail {
template <typename RandomNumberEngine>
sstring generate_random_salt_bytes(RandomNumberEngine& g) {
static const sstring valid_bytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789./";
static constexpr std::size_t num_bytes = 16;
std::uniform_int_distribution<std::size_t> dist(0, valid_bytes.size() - 1);
sstring result(num_bytes, 0);
for (char& c : result) {
c = valid_bytes[dist(g)];
}
return result;
}
///
/// Test each allowed hashing scheme and report the best supported one on the current system.
///
/// \throws \ref no_supported_schemes when none of the known schemes is supported.
///
scheme identify_best_supported_scheme();
const char* prefix_for_scheme(scheme) noexcept;
///
/// Generate a implementation-specific salt string for hashing passwords.
///
/// The `RandomNumberEngine` is used to generate the string, which is an implementation-specific length.
///
/// \throws \ref no_supported_schemes when no known hashing schemes are supported on the system.
///
template <typename RandomNumberEngine>
sstring generate_salt(RandomNumberEngine& g) {
static const scheme scheme = identify_best_supported_scheme();
static const sstring prefix = sstring(prefix_for_scheme(scheme));
return prefix + generate_random_salt_bytes(g);
}
///
/// Hash a password combined with an implementation-specific salt string.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
sstring hash_with_salt(const sstring& pass, const sstring& salt);
} // namespace detail
///
/// Run a one-way hashing function on cleartext to produce encrypted text.
///
/// Prior to applying the hashing function, random salt is amended to the cleartext. The random salt bytes are generated
/// according to the random number engine `g`.
///
/// The result is the encrypted cyphertext, and also the salt used but in a implementation-specific format.
///
/// \throws \ref std::system_error when the implementation-specific implementation fails to hash the cleartext.
///
template <typename RandomNumberEngine>
sstring hash(const sstring& pass, RandomNumberEngine& g) {
return detail::hash_with_salt(pass, detail::generate_salt(g));
}
///
/// Check that cleartext matches previously hashed cleartext with salt.
///
/// \ref salted_hash is the result of invoking \ref hash, which is the implementation-specific combination of the hashed
/// password and the salt that was generated for it.
///
/// \returns `true` if the cleartext matches the salted hash.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
bool check(const sstring& pass, const sstring& salted_hash);
} // namespace auth::passwords

View File

@@ -79,7 +79,7 @@ future<bool> default_role_row_satisfies(
return qp.process(
query,
db::consistency_level::QUORUM,
infinite_timeout_config,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -104,7 +104,7 @@ future<bool> any_nondefault_role_row_satisfies(
return qp.process(
query,
db::consistency_level::QUORUM,
infinite_timeout_config).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -196,6 +196,10 @@ future<> service::start() {
}
future<> service::stop() {
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
_migration_manager.unregister_listener(_migration_listener.get());
return _permissions_cache->stop().then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
});

View File

@@ -89,7 +89,7 @@ static future<stdx::optional<record>> find_record(cql3::query_processor& qp, std
return qp.process(
query,
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -174,7 +174,7 @@ future<> standard_role_manager::create_default_role_if_missing() const {
return _qp.process(
query,
db::consistency_level::QUORUM,
infinite_timeout_config,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
@@ -201,7 +201,7 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
return _qp.process(
query,
db::consistency_level::QUORUM,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_as<bool>("super");
@@ -227,7 +227,7 @@ future<> standard_role_manager::start() {
return this->create_metadata_tables_if_missing().then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().local(), _as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {
@@ -251,7 +251,7 @@ future<> standard_role_manager::start() {
future<> standard_role_manager::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});;
}
future<> standard_role_manager::create_or_replace(stdx::string_view role_name, const role_config& c) const {
@@ -263,7 +263,7 @@ future<> standard_role_manager::create_or_replace(stdx::string_view role_name, c
return _qp.process(
query,
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
@@ -307,7 +307,7 @@ standard_role_manager::alter(stdx::string_view role_name, const role_config_upda
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
});
}
@@ -327,7 +327,7 @@ future<> standard_role_manager::drop(stdx::string_view role_name) const {
return _qp.process(
query,
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
@@ -367,7 +367,7 @@ future<> standard_role_manager::drop(stdx::string_view role_name) const {
return _qp.process(
query,
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
};
@@ -394,7 +394,7 @@ standard_role_manager::modify_membership(
return _qp.process(
query,
consistency_for_role(grantee_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
@@ -406,7 +406,7 @@ standard_role_manager::modify_membership(
"INSERT INTO %s (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
@@ -415,7 +415,7 @@ standard_role_manager::modify_membership(
"DELETE FROM %s WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
infinite_timeout_config,
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
@@ -516,7 +516,10 @@ future<role_set> standard_role_manager::query_all() const {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
return _qp.process(query, db::consistency_level::QUORUM, infinite_timeout_config).then([](::shared_ptr<cql3::untyped_result_set> results) {
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(

View File

@@ -77,7 +77,7 @@ protected:
, _io_priority(iop)
, _interval(interval)
, _update_timer([this] { adjust(); })
, _control_points({{0,0}})
, _control_points()
, _current_backlog(std::move(backlog))
, _inflight_update(make_ready_future<>())
{
@@ -125,7 +125,7 @@ public:
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{soft_limit, 10}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::vector<backlog_controller::control_point>({{0.0, 0.0}, {soft_limit, 10}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::move(current_dirty)
)
{}
@@ -139,7 +139,7 @@ public:
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{0.5, 10}, {1.5, 100} , {normalization_factor, 1000}}),
std::vector<backlog_controller::control_point>({{0.0, 50}, {1.5, 100} , {normalization_factor, 1000}}),
std::move(current_backlog)
)
{}

View File

@@ -35,6 +35,10 @@ using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::experimental::optional<bytes>;
using sstring_view = std::experimental::string_view;
inline sstring_view to_sstring_view(bytes_view view) {
return {reinterpret_cast<const char*>(view.data()), view.size()};
}
namespace std {
template <>

View File

@@ -38,7 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
static constexpr size_type max_chunk_size() { return 16 * 1024; }
static constexpr size_type max_chunk_size() { return 128 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -57,12 +57,12 @@ private:
value_type data[0];
void operator delete(void* ptr) { free(ptr); }
};
// FIXME: consider increasing chunk size as the buffer grows
static constexpr size_type chunk_size{512};
static constexpr size_type default_chunk_size{512};
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
size_type _size;
size_type _initial_chunk_size = default_chunk_size;
public:
class fragment_iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
chunk* _current = nullptr;
@@ -102,13 +102,13 @@ private:
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be at least chunk_size
// - must be at least _initial_chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: chunk_size;
: _initial_chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
return std::max<size_type>(next_size, data_size + sizeof(chunk));
@@ -116,13 +116,19 @@ private:
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
// size must not be zero.
[[gnu::always_inline]]
value_type* alloc(size_type size) {
if (size <= current_space_left()) {
if (__builtin_expect(size <= current_space_left(), true)) {
auto ret = _current->data + _current->offset;
_current->offset += size;
_size += size;
return ret;
} else {
return alloc_new(size);
}
}
[[gnu::noinline]]
value_type* alloc_new(size_type size) {
auto alloc_size = next_alloc_size(size);
auto space = malloc(alloc_size);
if (!space) {
@@ -140,19 +146,22 @@ private:
}
_size += size;
return _current->data;
};
}
public:
bytes_ostream() noexcept
explicit bytes_ostream(size_t initial_chunk_size) noexcept
: _begin()
, _current(nullptr)
, _size(0)
, _initial_chunk_size(initial_chunk_size)
{ }
bytes_ostream() noexcept : bytes_ostream(default_chunk_size) {}
bytes_ostream(bytes_ostream&& o) noexcept
: _begin(std::move(o._begin))
, _current(o._current)
, _size(o._size)
, _initial_chunk_size(o._initial_chunk_size)
{
o._current = nullptr;
o._size = 0;
@@ -162,6 +171,7 @@ public:
: _begin()
, _current(nullptr)
, _size(0)
, _initial_chunk_size(o._initial_chunk_size)
{
append(o);
}
@@ -199,18 +209,20 @@ public:
return place_holder<T>{alloc(sizeof(T))};
}
[[gnu::always_inline]]
value_type* write_place_holder(size_type size) {
return alloc(size);
}
// Writes given sequence of bytes
[[gnu::always_inline]]
inline void write(bytes_view v) {
if (v.empty()) {
return;
}
auto this_size = std::min(v.size(), size_t(current_space_left()));
if (this_size) {
if (__builtin_expect(this_size, true)) {
memcpy(_current->data + _current->offset, v.begin(), this_size);
_current->offset += this_size;
_size += this_size;
@@ -219,11 +231,12 @@ public:
while (!v.empty()) {
auto this_size = std::min(v.size(), size_t(max_chunk_size()));
std::copy_n(v.begin(), this_size, alloc(this_size));
std::copy_n(v.begin(), this_size, alloc_new(this_size));
v.remove_prefix(this_size);
}
}
[[gnu::always_inline]]
void write(const char* ptr, size_t size) {
write(bytes_view(reinterpret_cast<const signed char*>(ptr), size));
}
@@ -393,6 +406,21 @@ public:
bool operator!=(const bytes_ostream& other) const {
return !(*this == other);
}
// Makes this instance empty.
//
// The first buffer is not deallocated, so callers may rely on the
// fact that if they write less than the initial chunk size between
// the clear() calls then writes will not involve any memory allocations,
// except for the first write made on this instance.
void clear() {
if (_begin) {
_begin->offset = 0;
_size = 0;
_current = _begin.get();
_begin->next.reset();
}
}
};
template<>

View File

@@ -60,6 +60,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
reading_from_underlying,
end_of_stream
@@ -86,6 +87,13 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
// True iff current population interval, since the previous clustering row, starts before all clustered rows.
// We cannot just look at _lower_bound, because emission of range tombstones changes _lower_bound and
// because we mark clustering intervals as continuous when consuming a clustering_row, it would prevent
// us from marking the interval as continuous.
// Valid when _state == reading_from_underlying.
bool _population_range_starts_before_all_rows;
// Whether _lower_bound was changed within current fill_buffer().
// If it did not then we cannot break out of it (e.g. on preemption) because
// forward progress is not guaranteed in case iterators are getting constantly invalidated.
@@ -228,6 +236,7 @@ inline
future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
@@ -352,12 +361,12 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
}
});
return make_ready_future<>();
});
}, timeout);
}
inline
bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (!_ck_ranges_curr->start()) {
if (_population_range_starts_before_all_rows) {
return true;
}
if (!_last_row.refresh(*_snp)) {
@@ -412,6 +421,7 @@ inline
void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_last_row = nullptr;
_population_range_starts_before_all_rows = false;
_read_context->cache().on_mispopulate();
return;
}
@@ -445,6 +455,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
with_allocator(standard_allocator(), [&] {
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
});
_population_range_starts_before_all_rows = false;
});
}

View File

@@ -24,9 +24,9 @@
#include <boost/intrusive/unordered_set.hpp>
#include "utils/small_vector.hh"
#include "fnv1a_hasher.hh"
#include "mutation_fragment.hh"
#include "mutation_partition.hh"
#include "xx_hasher.hh"
#include "db/timeout_clock.hh"
@@ -194,10 +194,10 @@ private:
explicit hasher(const schema& s) : _schema(&s) { }
size_t operator()(const cell_address& ca) const {
fnv1a_hasher hasher;
xx_hasher hasher;
ca.position.feed_hash(hasher, *_schema);
::feed_hash(hasher, ca.id);
return hasher.finalize();
return static_cast<size_t>(hasher.finalize_uint64());
}
size_t operator()(const cell_entry& ce) const {
return operator()(ce._address);

View File

@@ -30,7 +30,7 @@ namespace query {
class clustering_key_filter_ranges {
clustering_row_ranges _storage;
const clustering_row_ranges& _ref;
std::reference_wrapper<const clustering_row_ranges> _ref;
public:
clustering_key_filter_ranges(const clustering_row_ranges& ranges) : _ref(ranges) { }
struct reversed { };
@@ -39,21 +39,21 @@ public:
clustering_key_filter_ranges(clustering_key_filter_ranges&& other) noexcept
: _storage(std::move(other._storage))
, _ref(&other._ref == &other._storage ? _storage : other._ref)
, _ref(&other._ref.get() == &other._storage ? _storage : other._ref.get())
{ }
clustering_key_filter_ranges& operator=(clustering_key_filter_ranges&& other) noexcept {
if (this != &other) {
this->~clustering_key_filter_ranges();
new (this) clustering_key_filter_ranges(std::move(other));
_storage = std::move(other._storage);
_ref = (&other._ref.get() == &other._storage) ? _storage : other._ref.get();
}
return *this;
}
auto begin() const { return _ref.begin(); }
auto end() const { return _ref.end(); }
bool empty() const { return _ref.empty(); }
size_t size() const { return _ref.size(); }
auto begin() const { return _ref.get().begin(); }
auto end() const { return _ref.get().end(); }
bool empty() const { return _ref.get().empty(); }
size_t size() const { return _ref.get().size(); }
const clustering_row_ranges& ranges() const { return _ref; }
static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {

View File

@@ -31,72 +31,61 @@
class clustering_ranges_walker {
const schema& _schema;
const query::clustering_row_ranges& _ranges;
query::clustering_row_ranges::const_iterator _current;
query::clustering_row_ranges::const_iterator _end;
boost::iterator_range<query::clustering_row_ranges::const_iterator> _current_range;
bool _in_current; // next position is known to be >= _current_start
bool _with_static_row;
position_in_partition_view _current_start;
position_in_partition_view _current_end;
stdx::optional<position_in_partition> _trim;
std::optional<position_in_partition> _trim;
size_t _change_counter = 1;
private:
bool advance_to_next_range() {
_in_current = false;
if (!_current_start.is_static_row()) {
if (_current == _end) {
if (!_current_range) {
return false;
}
++_current;
_current_range.advance_begin(1);
}
++_change_counter;
if (_current == _end) {
if (!_current_range) {
_current_end = _current_start = position_in_partition_view::after_all_clustered_rows();
return false;
}
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
return true;
}
public:
clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)
: _schema(s)
, _ranges(ranges)
, _current(ranges.begin())
, _end(ranges.end())
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows())
{
if (!with_static_row) {
if (_current == _end) {
void set_current_positions() {
if (!_with_static_row) {
if (!_current_range) {
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
}
}
clustering_ranges_walker(clustering_ranges_walker&& o) noexcept
: _schema(o._schema)
, _ranges(o._ranges)
, _current(o._current)
, _end(o._end)
, _in_current(o._in_current)
, _with_static_row(o._with_static_row)
, _current_start(o._current_start)
, _current_end(o._current_end)
, _trim(std::move(o._trim))
, _change_counter(o._change_counter)
{ }
clustering_ranges_walker& operator=(clustering_ranges_walker&& o) {
if (this != &o) {
this->~clustering_ranges_walker();
new (this) clustering_ranges_walker(std::move(o));
}
return *this;
public:
clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)
: _schema(s)
, _ranges(ranges)
, _current_range(ranges)
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows()) {
set_current_positions();
}
clustering_ranges_walker(const clustering_ranges_walker&) = delete;
clustering_ranges_walker(clustering_ranges_walker&&) = delete;
clustering_ranges_walker& operator=(const clustering_ranges_walker&) = delete;
clustering_ranges_walker& operator=(clustering_ranges_walker&&) = delete;
// Excludes positions smaller than pos from the ranges.
// pos should be monotonic.
// No constraints between pos and positions passed to advance_to().
@@ -173,17 +162,15 @@ public:
return false;
}
auto i = _current;
while (i != _end) {
auto range_start = position_in_partition_view::for_range_start(*i);
for (const auto& rng : _current_range) {
auto range_start = position_in_partition_view::for_range_start(rng);
if (!less(range_start, end)) {
return false;
}
auto range_end = position_in_partition_view::for_range_end(*i);
auto range_end = position_in_partition_view::for_range_end(rng);
if (less(start, range_end)) {
return true;
}
++i;
}
return false;
@@ -191,18 +178,20 @@ public:
// Returns true if advanced past all contained positions. Any later advance_to() until reset() will return false.
bool out_of_range() const {
return !_in_current && _current == _end;
return !_in_current && !_current_range;
}
// Resets the state of the walker so that advance_to() can be now called for new sequence of positions.
// Any range trimmings still hold after this.
void reset() {
auto trim = std::move(_trim);
auto ctr = _change_counter;
*this = clustering_ranges_walker(_schema, _ranges, _with_static_row);
_change_counter = ctr + 1;
if (trim) {
trim_front(std::move(*trim));
_current_range = _ranges;
_in_current = _with_static_row;
_current_start = position_in_partition_view::for_static_row();
_current_end = position_in_partition_view::before_all_clustered_rows();
set_current_positions();
++_change_counter;
if (_trim) {
trim_front(*std::exchange(_trim, {}));
}
}
@@ -211,6 +200,11 @@ public:
return _current_start;
}
// Returns the upper bound of the last range in provided ranges set
position_in_partition_view uppermost_bound() const {
return position_in_partition_view::for_range_end(_ranges.back());
}
// When lower_bound() changes, this also does
// Always > 0.
size_t lower_bound_change_counter() const {

View File

@@ -112,7 +112,7 @@ const sstring compression_parameters::CHUNK_LENGTH_KB = "chunk_length_kb";
const sstring compression_parameters::CRC_CHECK_CHANCE = "crc_check_chance";
compression_parameters::compression_parameters()
: compression_parameters(nullptr)
: compression_parameters(compressor::lz4)
{}
compression_parameters::~compression_parameters()

View File

@@ -118,6 +118,10 @@ public:
std::map<sstring, sstring> get_options() const;
bool operator==(const compression_parameters& other) const;
bool operator!=(const compression_parameters& other) const;
static compression_parameters no_compression() {
return compression_parameters(nullptr);
}
private:
void validate_options(const std::map<sstring, sstring>&);
};

View File

@@ -242,6 +242,9 @@ batch_size_fail_threshold_in_kb: 50
# The directory where hints files are stored if hinted handoff is enabled.
# hints_directory: /var/lib/scylla/hints
# The directory where hints files are stored for materialized-view updates
# view_hints_directory: /var/lib/scylla/view_hints
# See http://wiki.apache.org/cassandra/HintedHandoff
# May either be "true" or "false" to enable globally, or contain a list

File diff suppressed because it is too large Load Diff

View File

@@ -38,44 +38,44 @@ private:
static bool is_compatible(const column_definition& new_def, const data_type& old_type, column_kind kind) {
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
static atomic_cell upgrade_cell(const abstract_type& new_type, const abstract_type& old_type, atomic_cell_view cell,
atomic_cell::collection_member cm = atomic_cell::collection_member::no) {
if (cell.is_live() && !old_type.is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl(), cm);
}
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cm);
} else {
return atomic_cell(new_type, cell);
}
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (!is_compatible(new_def, old_type, kind) || cell.timestamp() <= new_def.dropped_at()) {
return;
}
auto new_cell = [&] {
if (cell.is_live() && !old_type->is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell_or_collection(
atomic_cell::make_live(*new_def.type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl())
);
}
return atomic_cell_or_collection(
atomic_cell::make_live(*new_def.type, cell.timestamp(), cell.value().linearize())
);
} else {
return atomic_cell_or_collection(*new_def.type, cell);
}
}();
dst.apply(new_def, std::move(new_cell));
dst.apply(new_def, upgrade_cell(*new_def.type, *old_type, cell));
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
if (!is_compatible(new_def, old_type, kind)) {
return;
}
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto&& ctype = static_pointer_cast<const collection_type_impl>(old_type);
auto old_view = ctype->deserialize_mutation_form(cell_bv);
auto new_ctype = static_pointer_cast<const collection_type_impl>(new_def.type);
auto old_ctype = static_pointer_cast<const collection_type_impl>(old_type);
auto old_view = old_ctype->deserialize_mutation_form(cell_bv);
collection_type_impl::mutation_view new_view;
collection_type_impl::mutation new_view;
if (old_view.tomb.timestamp > new_def.dropped_at()) {
new_view.tomb = old_view.tomb;
}
for (auto& c : old_view.cells) {
if (c.second.timestamp() > new_def.dropped_at()) {
new_view.cells.emplace_back(std::move(c));
new_view.cells.emplace_back(c.first, upgrade_cell(*new_ctype->value_comparator(), *old_ctype->value_comparator(), c.second, atomic_cell::collection_member::yes));
}
}
dst.apply(new_def, ctype->serialize_mutation_form(std::move(new_view)));
if (new_view.tomb || !new_view.cells.empty()) {
dst.apply(new_def, new_ctype->serialize_mutation_form(std::move(new_view)));
}
});
}
public:

View File

@@ -470,12 +470,13 @@ insertStatement returns [::shared_ptr<raw::modification_statement> expr]
std::vector<::shared_ptr<cql3::column_identifier::raw>> column_names;
std::vector<::shared_ptr<cql3::term::raw>> values;
bool if_not_exists = false;
bool default_unset = false;
::shared_ptr<cql3::term::raw> json_value;
}
: K_INSERT K_INTO cf=columnFamilyName
'(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
( K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
('(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
@@ -487,13 +488,15 @@ insertStatement returns [::shared_ptr<raw::modification_statement> expr]
}
| K_JSON
json_token=jsonValue { json_value = $json_token.value; }
( K_DEFAULT K_UNSET { default_unset = true; } | K_DEFAULT K_NULL )?
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_json_statement>(std::move(cf),
std::move(attrs),
std::move(json_value),
if_not_exists);
if_not_exists,
default_unset);
}
)
;
@@ -1531,12 +1534,22 @@ inMarkerForTuple returns [shared_ptr<cql3::tuples::in_raw> marker]
| ':' name=ident { $marker = new_tuple_in_bind_variables(name); }
;
comparatorType returns [shared_ptr<cql3_type::raw> t]
: n=native_type { $t = cql3_type::raw::from(n); }
| c=collection_type { $t = c; }
| tt=tuple_type { $t = tt; }
// The comparator_type rule is used for users' queries (internal=false)
// and for internal calls from db::cql_type_parser::parse() (internal=true).
// The latter is used for reading schemas stored in the system tables, and
// may support additional column types that cannot be created through CQL,
// but only internally through code. Today the only such type is "empty":
// Scylla code internally creates columns with type "empty" or collections
// "empty" to represent unselected columns in materialized views.
// If a user (internal=false) tries to use "empty" as a type, it is treated -
// as do all unknown types - as an attempt to use a user-defined type, and
// we report this name is reserved (as for _reserved_type_names()).
comparator_type [bool internal] returns [shared_ptr<cql3_type::raw> t]
: n=native_or_internal_type[internal] { $t = cql3_type::raw::from(n); }
| c=collection_type[internal] { $t = c; }
| tt=tuple_type[internal] { $t = tt; }
| id=userTypeName { $t = cql3::cql3_type::raw::user_type(id); }
| K_FROZEN '<' f=comparatorType '>'
| K_FROZEN '<' f=comparator_type[internal] '>'
{
try {
$t = cql3::cql3_type::raw::frozen(f);
@@ -1558,6 +1571,22 @@ comparatorType returns [shared_ptr<cql3_type::raw> t]
#endif
;
native_or_internal_type [bool internal] returns [shared_ptr<cql3_type> t]
: n=native_type { $t = n; }
// The "internal" types, only supported when internal==true:
| K_EMPTY {
if (internal) {
$t = cql3_type::empty;
} else {
add_recognition_error("Invalid (reserved) user type name empty");
}
}
;
comparatorType returns [shared_ptr<cql3_type::raw> t]
: tt=comparator_type[false] { $t = tt; }
;
native_type returns [shared_ptr<cql3_type> t]
: K_ASCII { $t = cql3_type::ascii; }
| K_BIGINT { $t = cql3_type::bigint; }
@@ -1582,24 +1611,24 @@ native_type returns [shared_ptr<cql3_type> t]
| K_TIME { $t = cql3_type::time; }
;
collection_type returns [shared_ptr<cql3::cql3_type::raw> pt]
: K_MAP '<' t1=comparatorType ',' t2=comparatorType '>'
collection_type [bool internal] returns [shared_ptr<cql3::cql3_type::raw> pt]
: K_MAP '<' t1=comparator_type[internal] ',' t2=comparator_type[internal] '>'
{
// if we can't parse either t1 or t2, antlr will "recover" and we may have t1 or t2 null.
if (t1 && t2) {
$pt = cql3::cql3_type::raw::map(t1, t2);
}
}
| K_LIST '<' t=comparatorType '>'
| K_LIST '<' t=comparator_type[internal] '>'
{ if (t) { $pt = cql3::cql3_type::raw::list(t); } }
| K_SET '<' t=comparatorType '>'
| K_SET '<' t=comparator_type[internal] '>'
{ if (t) { $pt = cql3::cql3_type::raw::set(t); } }
;
tuple_type returns [shared_ptr<cql3::cql3_type::raw> t]
tuple_type [bool internal] returns [shared_ptr<cql3::cql3_type::raw> t]
@init{ std::vector<shared_ptr<cql3::cql3_type::raw>> types; }
: K_TUPLE '<'
t1=comparatorType { types.push_back(t1); } (',' tn=comparatorType { types.push_back(tn); })*
t1=comparator_type[internal] { types.push_back(t1); } (',' tn=comparator_type[internal] { types.push_back(tn); })*
'>' { $t = cql3::cql3_type::raw::tuple(std::move(types)); }
;
@@ -1625,7 +1654,7 @@ unreserved_keyword returns [sstring str]
unreserved_function_keyword returns [sstring str]
: u=basic_unreserved_keyword { $str = u; }
| t=native_type { $str = t->to_string(); }
| t=native_or_internal_type[true] { $str = t->to_string(); }
;
basic_unreserved_keyword returns [sstring str]
@@ -1809,6 +1838,10 @@ K_OR: O R;
K_REPLACE: R E P L A C E;
K_DETERMINISTIC: D E T E R M I N I S T I C;
K_JSON: J S O N;
K_DEFAULT: D E F A U L T;
K_UNSET: U N S E T;
K_EMPTY: E M P T Y;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;

View File

@@ -77,12 +77,14 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
if (tval.is_unset_value()) {
return now;
}
return with_linearized(*tval, [] (bytes_view val) {
try {
data_type_for<int64_t>()->validate(*tval);
data_type_for<int64_t>()->validate(val);
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid timestamp value");
}
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(*tval));
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(val));
});
}
int32_t attributes::get_time_to_live(const query_options& options) {
@@ -96,14 +98,16 @@ int32_t attributes::get_time_to_live(const query_options& options) {
if (tval.is_unset_value()) {
return 0;
}
auto ttl = with_linearized(*tval, [] (bytes_view val) {
try {
data_type_for<int32_t>()->validate(*tval);
data_type_for<int32_t>()->validate(val);
}
catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid TTL value");
}
auto ttl = value_cast<int32_t>(data_type_for<int32_t>()->deserialize(*tval));
return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(val));
});
if (ttl < 0) {
throw exceptions::invalid_request_exception("A TTL must be greater or equal to 0");
}

View File

@@ -127,7 +127,11 @@ column_identifier::new_selector_factory(database& db, schema_ptr schema, std::ve
if (!def) {
throw exceptions::invalid_request_exception(sprint("Undefined name %s in selection clause", _text));
}
// Do not allow explicitly selecting hidden columns. We also skip them on
// "SELECT *" (see selection::wildcard()).
if (def->is_view_virtual()) {
throw exceptions::invalid_request_exception(sprint("Undefined name %s in selection clause", _text));
}
return selection::simple_selector::new_factory(def->name_as_text(), add_and_get_index(*def, defs), def->type);
}

View File

@@ -225,7 +225,9 @@ public:
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
auto increment = with_linearized(*value, [] (bytes_view value_view) {
return value_cast<int64_t>(long_type->deserialize_value(value_view));
});
m.set_cell(prefix, column, make_counter_update_cell(increment, params));
}
};
@@ -240,7 +242,9 @@ public:
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
auto increment = with_linearized(*value, [] (bytes_view value_view) {
return value_cast<int64_t>(long_type->deserialize_value(value_view));
});
if (increment == std::numeric_limits<int64_t>::min()) {
throw exceptions::invalid_request_exception(sprint("The negation of %d overflows supported counter precision (signed 8 bytes integer)", increment));
}

View File

@@ -67,6 +67,12 @@ class error_collector : public error_listener<RecognizerType, ExceptionBaseType>
*/
const sstring_view _query;
/**
* An empty bitset to be used as a workaround for AntLR null dereference
* bug.
*/
static typename ExceptionBaseType::BitsetListType _empty_bit_list;
public:
/**
@@ -144,6 +150,14 @@ private:
break;
}
default:
// AntLR Exception class has a bug of dereferencing a null
// pointer in the displayRecognitionError. The following
// if statement makes sure it will not be null before the
// call to that function (displayRecognitionError).
// bug reference: https://github.com/antlr/antlr3/issues/191
if (!ex->get_expectingSet()) {
ex->set_expectingSet(&_empty_bit_list);
}
ex->displayRecognitionError(token_names, msg);
}
return msg.str();
@@ -345,4 +359,8 @@ private:
#endif
};
template<typename RecognizerType, typename TokenType, typename ExceptionBaseType>
typename ExceptionBaseType::BitsetListType
error_collector<RecognizerType,TokenType,ExceptionBaseType>::_empty_bit_list = typename ExceptionBaseType::BitsetListType();
}

View File

@@ -177,7 +177,7 @@ shared_ptr<function>
make_to_json_function(data_type t) {
return make_native_scalar_function<true>("tojson", utf8_type, {t},
[t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return utf8_type->decompose(t->to_json_string(parameters[0].value()));
return utf8_type->decompose(t->to_json_string(parameters[0]));
});
}
@@ -461,9 +461,9 @@ function_call::make_terminal(shared_ptr<function> fun, cql3::raw_value result, c
}
auto ctype = static_pointer_cast<const collection_type_impl>(fun->return_type());
bytes_view res;
fragmented_temporary_buffer::view res;
if (result) {
res = *result;
res = fragmented_temporary_buffer::view(bytes_view(*result));
}
if (&ctype->_kind == &collection_type_impl::kind::list) {
return make_shared(lists::value::from_serialized(std::move(res), static_pointer_cast<const list_type_impl>(ctype), sf));

View File

@@ -115,11 +115,12 @@ lists::literal::to_string() const {
}
lists::value
lists::value::from_serialized(bytes_view v, list_type type, cql_serialization_format sf) {
lists::value::from_serialized(const fragmented_temporary_buffer::view& val, list_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserializeForNativeProtocol()?!
return with_linearized(val, [&] (bytes_view v) {
auto l = value_cast<list_type_impl::native_type>(type->deserialize(v, sf));
std::vector<bytes_opt> elements;
elements.reserve(l.size());
@@ -128,6 +129,7 @@ lists::value::from_serialized(bytes_view v, list_type type, cql_serialization_fo
elements.push_back(element.is_null() ? bytes_opt() : bytes_opt(type->get_elements_type()->decompose(element)));
}
return value(std::move(elements));
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -285,7 +287,9 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
return;
}
auto idx = net::ntoh(int32_t(*unaligned_cast<int32_t>(index->begin())));
auto idx = with_linearized(*index, [] (bytes_view v) {
return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(v));
});
auto&& existing_list_opt = params.get_prefetched_list(m.key().view(), prefix.view(), column);
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to set an element on a list which is null");

View File

@@ -79,7 +79,7 @@ public:
explicit value(std::vector<bytes_opt> elements)
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, list_type type, cql_serialization_format sf);
static value from_serialized(const fragmented_temporary_buffer::view& v, list_type type, cql_serialization_format sf);
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(shared_ptr<list_type_impl> lt, const value& v);

View File

@@ -152,18 +152,20 @@ maps::literal::to_string() const {
}
maps::value
maps::value::from_serialized(bytes_view value, map_type type, cql_serialization_format sf) {
maps::value::from_serialized(const fragmented_temporary_buffer::view& fragmented_value, map_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserialize_for_native_protocol?!
return with_linearized(fragmented_value, [&] (bytes_view value) {
auto m = value_cast<map_type_impl::native_type>(type->deserialize(value, sf));
std::map<bytes, bytes, serialized_compare> map(type->get_keys_type()->as_less_comparator());
for (auto&& e : m) {
map.emplace(type->get_keys_type()->decompose(e.first),
type->get_values_type()->decompose(e.second));
}
return { std::move(map) };
return maps::value { std::move(map) };
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -233,10 +235,10 @@ maps::delayed_value::bind(const query_options& options) {
if (key_bytes.is_unset_value()) {
throw exceptions::invalid_request_exception("unset value is not supported inside collections");
}
if (key_bytes->size() > std::numeric_limits<uint16_t>::max()) {
if (key_bytes->size_bytes() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Map key is too long. Map keys are limited to %d bytes but %d bytes keys provided",
std::numeric_limits<uint16_t>::max(),
key_bytes->size()));
key_bytes->size_bytes()));
}
auto value_bytes = value->bind_and_get(options);
if (value_bytes.is_null()) {
@@ -331,7 +333,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(*ctype->get_values_type(), e.second, atomic_cell::collection_member::yes));
mut.cells.emplace_back(e.first, params.make_cell(*ctype->get_values_type(), fragmented_temporary_buffer::view(e.second), atomic_cell::collection_member::yes));
}
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(prefix, column, std::move(col_mut));
@@ -342,7 +344,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
} else {
auto v = map_type_impl::serialize_partially_deserialized_form({map_value->map.begin(), map_value->map.end()},
cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(*column.type, std::move(v)));
m.set_cell(prefix, column, params.make_cell(*column.type, fragmented_temporary_buffer::view(std::move(v))));
}
}
}

View File

@@ -81,7 +81,7 @@ public:
value(std::map<bytes, bytes, serialized_compare> map)
: map(std::move(map)) {
}
static value from_serialized(bytes_view value, map_type type, cql_serialization_format sf);
static value from_serialized(const fragmented_temporary_buffer::view& value, map_type type, cql_serialization_format sf);
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf);
bool equals(map_type mt, const value& v);

View File

@@ -92,6 +92,10 @@ public:
}
static atomic_cell make_cell(const abstract_type& type, bytes_view value, const update_parameters& params) {
return params.make_cell(type, fragmented_temporary_buffer::view(value));
}
static atomic_cell make_cell(const abstract_type& type, const fragmented_temporary_buffer::view& value, const update_parameters& params) {
return params.make_cell(type, value);
}

View File

@@ -135,79 +135,32 @@ query_options::query_options(std::vector<cql3::raw_value> values)
db::consistency_level::ONE, infinite_timeout_config, std::move(values))
{}
db::consistency_level query_options::get_consistency() const
{
return _consistency;
}
cql3::raw_value_view query_options::get_value_at(size_t idx) const
{
return _value_views.at(idx);
}
size_t query_options::get_values_count() const
{
return _value_views.size();
}
cql3::raw_value_view query_options::make_temporary(cql3::raw_value value) const
{
if (value) {
_temporaries.emplace_back(value->begin(), value->end());
auto& temporary = _temporaries.back();
return cql3::raw_value_view::make_value(bytes_view{temporary.data(), temporary.size()});
auto value_view = *value;
auto ptr = _temporaries.write_place_holder(value_view.size());
std::copy_n(value_view.data(), value_view.size(), ptr);
return cql3::raw_value_view::make_value(fragmented_temporary_buffer::view(bytes_view{ptr, value_view.size()}));
}
return cql3::raw_value_view::make_null();
}
bool query_options::skip_metadata() const
bytes_view query_options::linearize(fragmented_temporary_buffer::view view) const
{
return _skip_metadata;
}
int32_t query_options::get_page_size() const
{
return get_specific_options().page_size;
}
::shared_ptr<service::pager::paging_state> query_options::get_paging_state() const
{
return get_specific_options().state;
}
std::experimental::optional<db::consistency_level> query_options::get_serial_consistency() const
{
return get_specific_options().serial_consistency;
}
api::timestamp_type query_options::get_timestamp(service::query_state& state) const
{
auto tstamp = get_specific_options().timestamp;
return tstamp != api::missing_timestamp ? tstamp : state.get_timestamp();
}
int query_options::get_protocol_version() const
{
return _cql_serialization_format.protocol_version();
}
cql_serialization_format query_options::get_cql_serialization_format() const
{
return _cql_serialization_format;
}
const query_options::specific_options& query_options::get_specific_options() const
{
return _options;
}
const query_options& query_options::for_statement(size_t i) const
{
if (!_batch_options) {
// No per-statement options supplied, so use the "global" options
return *this;
if (view.empty()) {
return { };
} else if (std::next(view.begin()) == view.end()) {
return *view.begin();
} else {
auto ptr = _temporaries.write_place_holder(view.size_bytes());
auto dst = ptr;
using boost::range::for_each;
for_each(view, [&] (bytes_view bv) {
dst = std::copy(bv.begin(), bv.end(), dst);
});
return bytes_view(ptr, view.size_bytes());
}
return _batch_options->at(i);
}
void query_options::prepare(const std::vector<::shared_ptr<column_specification>>& specs)
@@ -217,29 +170,24 @@ void query_options::prepare(const std::vector<::shared_ptr<column_specification>
}
auto& names = *_names;
std::vector<cql3::raw_value> ordered_values;
std::vector<cql3::raw_value_view> ordered_values;
ordered_values.reserve(specs.size());
for (auto&& spec : specs) {
auto& spec_name = spec->name->text();
for (size_t j = 0; j < names.size(); j++) {
if (names[j] == spec_name) {
ordered_values.emplace_back(_values[j]);
ordered_values.emplace_back(_value_views[j]);
break;
}
}
}
_values = std::move(ordered_values);
fill_value_views();
_value_views = std::move(ordered_values);
}
void query_options::fill_value_views()
{
for (auto&& value : _values) {
if (value) {
_value_views.emplace_back(cql3::raw_value_view::make_value(bytes_view{*value}));
} else {
_value_views.emplace_back(cql3::raw_value_view::make_null());
}
_value_views.emplace_back(value.to_view());
}
}

View File

@@ -75,7 +75,7 @@ private:
const std::experimental::optional<std::vector<sstring_view>> _names;
std::vector<cql3::raw_value> _values;
std::vector<cql3::raw_value_view> _value_views;
mutable std::vector<std::vector<int8_t>> _temporaries;
mutable bytes_ostream _temporaries;
const bool _skip_metadata;
const specific_options _options;
cql_serialization_format _cql_serialization_format;
@@ -156,33 +156,76 @@ public:
std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state);
db::consistency_level get_consistency() const;
const timeout_config& get_timeout_config() const { return _timeout_config; }
cql3::raw_value_view get_value_at(size_t idx) const;
db::consistency_level get_consistency() const {
return _consistency;
}
cql3::raw_value_view get_value_at(size_t idx) const {
return _value_views.at(idx);
}
size_t get_values_count() const {
return _value_views.size();
}
cql3::raw_value_view make_temporary(cql3::raw_value value) const;
size_t get_values_count() const;
bool skip_metadata() const;
/** The pageSize for this query. Will be <= 0 if not relevant for the query. */
int32_t get_page_size() const;
bytes_view linearize(fragmented_temporary_buffer::view) const;
bool skip_metadata() const {
return _skip_metadata;
}
int32_t get_page_size() const {
return get_specific_options().page_size;
}
/** The paging state for this query, or null if not relevant. */
::shared_ptr<service::pager::paging_state> get_paging_state() const;
::shared_ptr<service::pager::paging_state> get_paging_state() const {
return get_specific_options().state;
}
/** Serial consistency for conditional updates. */
std::experimental::optional<db::consistency_level> get_serial_consistency() const;
std::experimental::optional<db::consistency_level> get_serial_consistency() const {
return get_specific_options().serial_consistency;
}
api::timestamp_type get_timestamp(service::query_state& state) const {
auto tstamp = get_specific_options().timestamp;
return tstamp != api::missing_timestamp ? tstamp : state.get_timestamp();
}
/**
* The protocol version for the query. Will be 3 if the object don't come from
* a native protocol request (i.e. it's been allocated locally or by CQL-over-thrift).
*/
int get_protocol_version() const {
return _cql_serialization_format.protocol_version();
}
cql_serialization_format get_cql_serialization_format() const {
return _cql_serialization_format;
}
const query_options::specific_options& get_specific_options() const {
return _options;
}
// Mainly for the sake of BatchQueryOptions
const query_options& for_statement(size_t i) const {
if (!_batch_options) {
// No per-statement options supplied, so use the "global" options
return *this;
}
return _batch_options->at(i);
}
const std::experimental::optional<std::vector<sstring_view>>& get_names() const noexcept {
return _names;
}
api::timestamp_type get_timestamp(service::query_state& state) const;
/**
* The protocol version for the query. Will be 3 if the object don't come from
* a native protocol request (i.e. it's been allocated locally or by CQL-over-thrift).
*/
int get_protocol_version() const;
cql_serialization_format get_cql_serialization_format() const;
// Mainly for the sake of BatchQueryOptions
const specific_options& get_specific_options() const;
const query_options& for_statement(size_t i) const;
void prepare(const std::vector<::shared_ptr<column_specification>>& specs);
private:
void fill_value_views();

View File

@@ -243,7 +243,17 @@ query_processor::query_processor(service::storage_proxy& proxy, distributed<data
sm::make_gauge(
"user_prepared_auth_cache_footprint",
[this] { return _authorized_prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the authenticated prepared statements cache."))
sm::description("Size (in bytes) of the authenticated prepared statements cache.")),
sm::make_counter(
"reverse_queries",
_cql_stats.reverse_queries,
sm::description("Counts number of CQL SELECT requests with ORDER BY DESC.")),
sm::make_counter(
"unpaged_select_queries",
_cql_stats.unpaged_select_queries,
sm::description("Counts number of unpaged CQL SELECT requests.")),
});
@@ -263,11 +273,11 @@ query_processor::process(const sstring_view& query_string, service::query_state&
log.trace("process: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state());
options.prepare(p->bound_names);
auto cql_statement = p->statement;
if (cql_statement->get_bound_terms() != options.get_values_count()) {
throw exceptions::invalid_request_exception("Invalid amount of bind variables");
}
options.prepare(p->bound_names);
warn(unimplemented::cause::METRICS);
#if 0

View File

@@ -45,12 +45,16 @@
#include "cql3/statements/request_validations.hh"
#include "cql3/restrictions/primary_key_restrictions.hh"
#include "cql3/statements/request_validations.hh"
#include "cql3/restrictions/single_column_primary_key_restrictions.hh"
namespace cql3 {
namespace restrictions {
class multi_column_restriction : public primary_key_restrictions<clustering_key_prefix> {
private:
bool _has_only_asc_columns;
bool _has_only_desc_columns;
protected:
schema_ptr _schema;
std::vector<const column_definition*> _column_defs;
@@ -58,7 +62,9 @@ public:
multi_column_restriction(schema_ptr schema, std::vector<const column_definition*>&& defs)
: _schema(schema)
, _column_defs(std::move(defs))
{ }
{
update_asc_desc_existence();
}
virtual bool is_multi_column() const override {
return true;
@@ -84,6 +90,7 @@ public:
"Mixing single column relations and multi column relations on clustering columns is not allowed");
auto as_pkr = static_pointer_cast<primary_key_restrictions<clustering_key_prefix>>(other);
do_merge_with(as_pkr);
update_asc_desc_existence();
}
bool is_satisfied_by(const schema& schema,
@@ -140,6 +147,40 @@ protected:
virtual bool is_supported_by(const secondary_index::index& index) const = 0;
/**
* @return true if the restriction contains at least one column of each
* ordering, false otherwise.
*/
bool is_mixed_order() const {
return !is_desc_order() && !is_asc_order();
}
/**
* @return true if all the restricted columns ordered in descending
* order, false otherwise
*/
bool is_desc_order() const {
return _has_only_desc_columns;
}
/**
* @return true if all the restricted columns ordered in ascending
* order, false otherwise
*/
bool is_asc_order() const {
return _has_only_asc_columns;
}
private:
/**
* Updates the _has_only_asc_columns and _has_only_desc_columns fields.
*/
void update_asc_desc_existence() {
std::size_t num_of_desc =
std::count_if(_column_defs.begin(), _column_defs.end(), [] (const column_definition* cd) { return cd->type->is_reversed(); });
_has_only_asc_columns = num_of_desc == 0;
_has_only_desc_columns = num_of_desc == _column_defs.size();
}
#if 0
/**
* Check if this type of restriction is supported for the specified column by the specified index.
@@ -385,6 +426,7 @@ protected:
};
class multi_column_restriction::slice final : public multi_column_restriction {
using restriction_shared_ptr = ::shared_ptr<primary_key_restrictions<clustering_key_prefix>>;
private:
term_slice _slice;
@@ -422,24 +464,11 @@ public:
}
virtual std::vector<bounds_range_type> bounds_ranges(const query_options& options) const override {
// FIXME: doesn't work properly with mixed CLUSTERING ORDER (CASSANDRA-7281)
auto read_bound = [&] (statements::bound b) -> std::experimental::optional<bounds_range_type::bound> {
if (!has_bound(b)) {
return {};
}
auto vals = component_bounds(b, options);
for (unsigned i = 0; i < vals.size(); i++) {
statements::request_validations::check_not_null(vals[i], "Invalid null value in condition for column %s", _column_defs.at(i)->name_as_text());
}
auto prefix = clustering_key_prefix::from_optional_exploded(*_schema, vals);
return bounds_range_type::bound(prefix, is_inclusive(b));
};
auto range = wrapping_range<clustering_key_prefix>(read_bound(statements::bound::START), read_bound(statements::bound::END));
auto bounds = bound_view::from_range(range);
if (bound_view::compare(*_schema)(bounds.second, bounds.first)) {
return { };
if (!is_mixed_order()) {
return bounds_ranges_unified_order(options);
} else {
return bounds_ranges_mixed_order(options);
}
return { bounds_range_type(std::move(range)) };
}
#if 0
@Override
@@ -514,6 +543,221 @@ private:
auto value = static_pointer_cast<tuples::value>(_slice.bound(b)->bind(options));
return value->get_elements();
}
std::vector<bytes_opt> read_bound_components(const query_options& options, statements::bound b) const {
if (!has_bound(b)) {
return {};
}
auto vals = component_bounds(b, options);
for (unsigned i = 0; i < vals.size(); i++) {
statements::request_validations::check_not_null(vals[i], "Invalid null value in condition for column %s", _column_defs.at(i)->name_as_text());
}
return vals;
}
/**
* Retrieve the bounds for the case that all clustering columns have the same order.
* Having the same order implies we can do a prefix search on the data.
* @param options the query options
* @return the vector of ranges for the restriction
*/
std::vector<bounds_range_type> bounds_ranges_unified_order(const query_options& options) const {
auto start_prefix = clustering_key_prefix::from_optional_exploded(*_schema, read_bound_components(options, statements::bound::START));
auto start_bound = bounds_range_type::bound(std::move(start_prefix), is_inclusive(statements::bound::START));
auto end_prefix = clustering_key_prefix::from_optional_exploded(*_schema, read_bound_components(options, statements::bound::END));
auto end_bound = bounds_range_type::bound(std::move(end_prefix), is_inclusive(statements::bound::END));
auto make_range = [&] () {
if (is_asc_order()) {
return bounds_range_type::make(start_bound, end_bound);
} else {
return bounds_range_type::make(end_bound, start_bound);
}
};
auto range = make_range();
auto bounds = bound_view::from_range(range);
if (bound_view::compare(*_schema)(bounds.second, bounds.first)) {
return { };
}
return { std::move(range) };
}
/**
* Retrieve the bounds when clustering columns are mixed order
* (contains ASC and DESC together).
* Having mixed order implies that a prefix search can't take place,
* instead, the bounds have to be broken down to separate prefix serchable
* ranges such that their combination is equivalent to the original range.
* @param options the query options
* @return the vector of ranges for the restriction
*/
std::vector<bounds_range_type> bounds_ranges_mixed_order(const query_options& options) const {
std::vector<bounds_range_type> ret_ranges;
auto mixed_order_restrictions = build_mixed_order_restriction_set(options);
ret_ranges.reserve(mixed_order_restrictions.size());
for (auto r : mixed_order_restrictions) {
for (auto&& range : r->bounds_ranges(options)) {
ret_ranges.emplace_back(std::move(range));
}
}
return ret_ranges;
}
/**
* The function returns the first real inequality component.
* The first real inequality is the index of the first component in the
* tuple that will turn into a slice single column restriction.
* For example: (a, b, c) > (0, 1, 2) and (a, b, c) < (0, 1, 5) will be
* broken into one single column restriction set of the form:
* a = 0 and b = 1 and c > 2 and c < 5 , c is the first element that has
* inequality so for this case the function will return 2.
* @param start_components - the components of the starts tuple range.
* @param end_components - the components of the end tuple range.
* @return an empty value if not found and the index of the first index that
* will yield inequality
*/
std::optional<std::size_t> find_first_neq_component(std::vector<bytes_opt>& start_components,
std::vector<bytes_opt>& end_components) const {
size_t common_components_count = std::min(start_components.size(), end_components.size());
for (size_t i = 0; i < common_components_count ; i++) {
if (start_components[i].value() != end_components[i].value()) {
return i;
}
}
size_t max_components_count = std::max(start_components.size(), end_components.size());
if (common_components_count < max_components_count) {
return common_components_count;
} else {
return std::nullopt;
}
}
/**
* Creates a single column restriction which is either slice or equality.
* @param bound - if bound is empty this is an equality, if its either START or END ,
* this is the corresponding slice restriction.
* @param inclusive - is the slice inclusive (ignored for equality).
* @param column_pos - the column position to restrict
* @param value - the value to restrict the colum with.
* @return a shared pointer to the just created restriction.
*/
::shared_ptr<restriction> make_single_column_restriction(std::optional<cql3::statements::bound> bound, bool inclusive,
std::size_t column_pos,const bytes_opt& value) const {
::shared_ptr<cql3::term> term = ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(value));
if (!bound){
return ::make_shared<cql3::restrictions::single_column_restriction::EQ>(*_column_defs[column_pos], term);
} else {
return ::make_shared<cql3::restrictions::single_column_restriction::slice>(*_column_defs[column_pos], bound.value(), inclusive, term);
}
}
/**
* A helper function to create a single column restrictions set from a tuple relation on
* clustering keys.
* i.e : (a,b,c) >= (0,1,2) will become:
* 1.a > 0
* 2. a = 0 and b > 1
* 3. a = 0 and b = 1 and c >=2
* @param bound - determines if the operator is '>' (START) or '<' (END)
* @param bound_inclusive - determines if to append equality to the operator i.e: if > becomes >=
* @param bound_values - the tuple values for the restriction
* @param first_neq_component - the first component that will have inequality.
* for the example above, if this parameter is 1, only restrictions 2 and 3 will be created.
* this parameter helps to facilitate the nuances of breaking more complex relations, for example when
* there is in existence a second condition limiting the other side of the bound
* i.e:(a,b,c) >= (0,1,2) and (a,b,c) < (5,6,7), this will require each bound to use the parameter.
* @return the single column restriction set built according to the above parameters.
*/
std::vector<restriction_shared_ptr> make_single_bound_restrictions(statements::bound bound, bool bound_inclusive,
std::vector<bytes_opt>& bound_values,
std::size_t first_neq_component) const{
std::vector<restriction_shared_ptr> ret;
std::size_t num_of_restrictions = bound_values.size() - first_neq_component;
ret.reserve(num_of_restrictions);
for (std::size_t i = 0;i < num_of_restrictions ; i++) {
ret.emplace_back(::make_shared<cql3::restrictions::single_column_primary_key_restrictions<clustering_key>>(_schema, false));
std::size_t neq_component_idx = first_neq_component + i;
for (std::size_t j = 0;j < neq_component_idx; j++) {
ret[i]->merge_with(make_single_column_restriction(std::nullopt, false, j, bound_values[j]));
}
bool inclusive = (i == (num_of_restrictions-1)) && bound_inclusive;
ret[i]->merge_with(make_single_column_restriction(bound, inclusive, neq_component_idx, bound_values[neq_component_idx]));
}
return ret;
}
/**
* Builds and returns a set of restrictions such that the union of their ranges (the restrictions OR-ed together)
* is logically identical to this restriction, with the additional property that it can execute
* correctly when the clustering columns are with "mixed order" - contains ASC and DESC orderings.
* for more information: https://github.com/scylladb/scylla/issues/2050
* @param options - the query options
* @return set of restrictions which their ranges union is logically identical to this restriction.
*/
std::vector<::shared_ptr<primary_key_restrictions<clustering_key_prefix>>>
build_mixed_order_restriction_set(const query_options& options) const {
std::vector<restriction_shared_ptr> ret;
auto start_components = read_bound_components(options, statements::bound::START);
auto end_components = read_bound_components(options, statements::bound::END);
bool start_inclusive = is_inclusive(statements::bound::START);
bool end_inclusive = is_inclusive(statements::bound::END);
std::optional<std::size_t> first_neq_component = std::nullopt;
// find the first index of the first component that is not equal between the tuples.
if (start_components.empty() || end_components.empty()) {
first_neq_component = 0;
} else {
auto tuple_mismatch = std::mismatch(start_components.begin(), start_components.end(),
end_components.begin(), end_components.end());
if ((tuple_mismatch.first != start_components.end()) ||
(tuple_mismatch.second != end_components.end())) {
first_neq_component = std::distance(start_components.begin(), tuple_mismatch.first);
}
}
// this is either a simple equality or a never fulfilled restriction
if (!first_neq_component && start_inclusive && end_inclusive) {
// This is a simple equality case
shared_ptr<cql3::term> term = ::make_shared<cql3::tuples::value>(start_components);
ret.emplace_back(::make_shared<cql3::restrictions::multi_column_restriction::EQ>(_schema, _column_defs, term));
return ret;
} else if (!first_neq_component) {
// This is a contradiction case
return {};
} else if ((*first_neq_component == end_components.size() && !end_inclusive ) ||
(*first_neq_component == start_components.size() && !start_inclusive )) {
// This is a case where one bound is a prefix of the other. If this prefix bound
// is not inclusive the result will be an empty set.
return {};
}
bool start_components_exists = (start_components.size() - first_neq_component.value()) > 0;
bool end_components_exists = (end_components.size() - first_neq_component.value()) > 0;
bool both_components_exists = start_components_exists && end_components_exists;
if (start_components_exists) {
auto restrictions =
make_single_bound_restrictions(statements::bound::START, start_inclusive, start_components, first_neq_component.value());
for (auto&& r : restrictions) {
ret.emplace_back(r);
}
}
if (end_components_exists) {
auto restrictions =
make_single_bound_restrictions(statements::bound::END, end_inclusive,
end_components, first_neq_component.value() + both_components_exists);
for (auto&& r : restrictions) {
ret.emplace_back(r);
}
}
if (both_components_exists) {
bool inclusive = end_inclusive && ((end_components.size() - first_neq_component.value()) == 1);
ret[0]->merge_with(make_single_column_restriction(statements::bound::END, inclusive, first_neq_component.value(),
end_components[first_neq_component.value()]));
}
return ret;
}
};
}

View File

@@ -88,6 +88,7 @@ public:
using restrictions::uses_function;
using restrictions::has_supporting_index;
using restrictions::values;
bool empty() const override {
return get_column_defs().empty();
@@ -99,6 +100,28 @@ public:
bool has_unrestricted_components(const schema& schema) const;
virtual bool needs_filtering(const schema& schema) const;
// How long a prefix of the restrictions could have resulted in
// need_filtering() == false. These restrictions do not need to be
// applied during filtering.
// For example, if we have the filter "c1 < 3 and c2 > 3", c1 does
// not need filtering (just a read stopping at c1=3) but c2 does,
// so num_prefix_columns_that_need_not_be_filtered() will be 1.
virtual unsigned int num_prefix_columns_that_need_not_be_filtered() const {
return 0;
}
virtual bool is_all_eq() const {
return false;
}
virtual size_t prefix_size() const {
return 0;
}
size_t prefix_size(const schema_ptr schema) const {
return 0;
}
};
template<>
@@ -122,5 +145,23 @@ inline bool primary_key_restrictions<clustering_key>::needs_filtering(const sche
return false;
}
template<>
inline size_t primary_key_restrictions<clustering_key>::prefix_size(const schema_ptr schema) const {
size_t count = 0;
if (schema->clustering_key_columns().empty()) {
return count;
}
auto column_defs = get_column_defs();
column_id expected_column_id = schema->clustering_key_columns().begin()->id;
for (auto&& cdef : column_defs) {
if (schema->position(*cdef) != expected_column_id) {
return count;
}
expected_column_id++;
count++;
}
return count;
}
}
}

View File

@@ -68,6 +68,10 @@ public:
virtual std::vector<bytes_opt> values(const query_options& options) const = 0;
virtual bytes_opt value_for(const column_definition& cdef, const query_options& options) const {
throw exceptions::invalid_request_exception("Single value can be obtained from single-column restrictions only");
}
/**
* Returns <code>true</code> if one of the restrictions use the specified function.
*

View File

@@ -49,6 +49,7 @@
#include <boost/algorithm/cxx11/all_of.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/range/adaptor/map.hpp>
namespace cql3 {
@@ -62,6 +63,8 @@ class single_column_primary_key_restrictions : public primary_key_restrictions<V
using range_type = query::range<ValueType>;
using range_bound = typename range_type::bound;
using bounds_range_type = typename primary_key_restrictions<ValueType>::bounds_range_type;
template<typename OtherValueType>
friend class single_column_primary_key_restrictions;
private:
schema_ptr _schema;
bool _allow_filtering;
@@ -79,6 +82,27 @@ public:
, _in(false)
{ }
// Convert another primary key restrictions type into this type, possibly using different schema
template<typename OtherValueType>
explicit single_column_primary_key_restrictions(schema_ptr schema, const single_column_primary_key_restrictions<OtherValueType>& other)
: _schema(schema)
, _allow_filtering(other._allow_filtering)
, _restrictions(::make_shared<single_column_restrictions>(schema))
, _slice(other._slice)
, _contains(other._contains)
, _in(other._in)
{
for (const auto& entry : other._restrictions->restrictions()) {
const column_definition* other_cdef = entry.first;
const column_definition* this_cdef = _schema->get_column_definition(other_cdef->name());
if (!this_cdef) {
throw exceptions::invalid_request_exception(sprint("Base column %s not found in view index schema", other_cdef->name_as_text()));
}
::shared_ptr<single_column_restriction> restriction = entry.second;
_restrictions->add_restriction(restriction->apply_to(*this_cdef));
}
}
virtual bool is_on_token() const override {
return false;
}
@@ -99,6 +123,10 @@ public:
return _in;
}
virtual bool is_all_eq() const override {
return _restrictions->is_all_eq();
}
virtual bool has_bound(statements::bound b) const override {
return boost::algorithm::all_of(_restrictions->restrictions(), [b] (auto&& r) { return r.second->has_bound(b); });
}
@@ -137,6 +165,25 @@ public:
_restrictions->add_restriction(restriction);
}
virtual size_t prefix_size() const override {
return primary_key_restrictions<ValueType>::prefix_size(_schema);
}
::shared_ptr<single_column_primary_key_restrictions<clustering_key>> get_longest_prefix_restrictions() {
static_assert(std::is_same_v<ValueType, clustering_key>, "Only clustering key can produce longest prefix restrictions");
size_t current_prefix_size = prefix_size();
if (current_prefix_size == _restrictions->restrictions().size()) {
return dynamic_pointer_cast<single_column_primary_key_restrictions<clustering_key>>(this->shared_from_this());
}
auto longest_prefix_restrictions = ::make_shared<single_column_primary_key_restrictions<clustering_key>>(_schema, _allow_filtering);
auto restriction_it = _restrictions->restrictions().begin();
for (size_t i = 0; i < current_prefix_size; ++i) {
longest_prefix_restrictions->merge_with((restriction_it++)->second);
}
return longest_prefix_restrictions;
}
virtual void merge_with(::shared_ptr<restriction> restriction) override {
if (restriction->is_multi_column()) {
throw exceptions::invalid_request_exception(
@@ -309,6 +356,11 @@ public:
}
return res;
}
virtual bytes_opt value_for(const column_definition& cdef, const query_options& options) const override {
return _restrictions->value_for(cdef, options);
}
std::vector<bytes_opt> bounds(statements::bound b, const query_options& options) const override {
// TODO: if this proved to be required.
fail(unimplemented::cause::LEGACY_COMPOSITE_KEYS); // not 100% correct...
@@ -355,10 +407,11 @@ public:
}
virtual bool needs_filtering(const schema& schema) const override;
virtual unsigned int num_prefix_columns_that_need_not_be_filtered() const override;
};
template<>
dht::partition_range_vector
inline dht::partition_range_vector
single_column_primary_key_restrictions<partition_key>::bounds_ranges(const query_options& options) const {
dht::partition_range_vector ranges;
ranges.reserve(size());
@@ -376,7 +429,7 @@ single_column_primary_key_restrictions<partition_key>::bounds_ranges(const query
}
template<>
std::vector<query::clustering_range>
inline std::vector<query::clustering_range>
single_column_primary_key_restrictions<clustering_key_prefix>::bounds_ranges(const query_options& options) const {
auto wrapping_bounds = compute_bounds(options);
auto bounds = boost::copy_range<query::clustering_row_ranges>(wrapping_bounds
@@ -413,12 +466,12 @@ single_column_primary_key_restrictions<clustering_key_prefix>::bounds_ranges(con
}
template<>
bool single_column_primary_key_restrictions<partition_key>::needs_filtering(const schema& schema) const {
inline bool single_column_primary_key_restrictions<partition_key>::needs_filtering(const schema& schema) const {
return primary_key_restrictions<partition_key>::needs_filtering(schema);
}
template<>
bool single_column_primary_key_restrictions<clustering_key>::needs_filtering(const schema& schema) const {
inline bool single_column_primary_key_restrictions<clustering_key>::needs_filtering(const schema& schema) const {
// Restrictions currently need filtering in three cases:
// 1. any of them is a CONTAINS restriction
// 2. restrictions do not form a contiguous prefix (i.e. there are gaps in it)
@@ -435,6 +488,39 @@ bool single_column_primary_key_restrictions<clustering_key>::needs_filtering(con
return false;
}
// How many of the restrictions (in column order) do not need filtering
// because they are implemented as a slice (potentially, a contiguous disk
// read). For example, if we have the filter "c1 < 3 and c2 > 3", c1 does not
// need filtering but c2 does so num_prefix_columns_that_need_not_be_filtered
// will be 1.
// The implementation of num_prefix_columns_that_need_not_be_filtered() is
// closely tied to that of needs_filtering() above - basically, if only the
// first num_prefix_columns_that_need_not_be_filtered() restrictions existed,
// then needs_filtering() would have returned false.
template<>
inline unsigned single_column_primary_key_restrictions<clustering_key>::num_prefix_columns_that_need_not_be_filtered() const {
column_id position = 0;
unsigned int count = 0;
for (const auto& restriction : _restrictions->restrictions() | boost::adaptors::map_values) {
if (restriction->is_contains() || position != restriction->get_column_def().id) {
return count;
}
if (!restriction->is_slice()) {
position = restriction->get_column_def().id + 1;
}
count++;
}
return count;
}
template<>
inline unsigned single_column_primary_key_restrictions<partition_key>::num_prefix_columns_that_need_not_be_filtered() const {
// skip_filtering() is currently called only for clustering key
// restrictions, so it doesn't matter what we return here.
return 0;
}
}
}

View File

@@ -95,6 +95,7 @@ public:
virtual bool is_supported_by(const secondary_index::index& index) const = 0;
using abstract_restriction::is_satisfied_by;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const = 0;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) = 0;
#if 0
/**
* Check if this type of restriction is supported by the specified index.
@@ -169,6 +170,9 @@ public:
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<EQ>(cdef, _value);
}
#if 0
@Override
@@ -205,7 +209,18 @@ public:
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
throw std::logic_error("IN superclass should never be cloned directly");
}
virtual std::vector<bytes_opt> values_raw(const query_options& options) const = 0;
virtual std::vector<bytes_opt> values(const query_options& options) const override {
std::vector<bytes_opt> ret = values_raw(options);
std::sort(ret.begin(),ret.end());
ret.erase(std::unique(ret.begin(),ret.end()),ret.end());
return ret;
}
#if 0
@Override
protected final boolean isSupportedBy(SecondaryIndex index)
@@ -228,7 +243,7 @@ public:
return abstract_restriction::term_uses_function(_values, ks_name, function_name);
}
virtual std::vector<bytes_opt> values(const query_options& options) const override {
virtual std::vector<bytes_opt> values_raw(const query_options& options) const override {
std::vector<bytes_opt> ret;
for (auto&& v : _values) {
ret.emplace_back(to_bytes_opt(v->bind_and_get(options)));
@@ -239,6 +254,10 @@ public:
virtual sstring to_string() const override {
return sprint("IN(%s)", std::to_string(_values));
}
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<IN_with_values>(cdef, _values);
}
};
class single_column_restriction::IN_with_marker : public IN {
@@ -253,7 +272,7 @@ public:
return false;
}
virtual std::vector<bytes_opt> values(const query_options& options) const override {
virtual std::vector<bytes_opt> values_raw(const query_options& options) const override {
auto&& lval = dynamic_pointer_cast<multi_item_terminal>(_marker->bind(options));
if (!lval) {
throw exceptions::invalid_request_exception("Invalid null value for IN restriction");
@@ -264,6 +283,10 @@ public:
virtual sstring to_string() const override {
return "IN ?";
}
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<IN_with_marker>(cdef, _marker);
}
};
class single_column_restriction::slice : public single_column_restriction {
@@ -275,6 +298,11 @@ public:
, _slice(term_slice::new_instance(bound, inclusive, std::move(term)))
{ }
slice(const column_definition& column_def, term_slice slice)
: single_column_restriction(column_def)
, _slice(slice)
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return (_slice.has_bound(statements::bound::START) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::START), ks_name, function_name))
|| (_slice.has_bound(statements::bound::END) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::END), ks_name, function_name));
@@ -361,6 +389,9 @@ public:
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<slice>(cdef, _slice);
}
};
// This holds CONTAINS, CONTAINS_KEY, and map[key] = value restrictions because we might want to have any combination of them.
@@ -483,6 +514,9 @@ public:
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
throw std::logic_error("Cloning 'contains' restriction is not implemented.");
}
#if 0
private List<ByteBuffer> keys(const query_options& options) {

View File

@@ -111,6 +111,11 @@ public:
return r;
}
virtual bytes_opt value_for(const column_definition& cdef, const query_options& options) const override {
auto it = _restrictions.find(std::addressof(cdef));
return (it != _restrictions.end()) ? it->second->value(options) : bytes_opt{};
}
/**
* Returns the restriction associated to the specified column.
*

View File

@@ -72,6 +72,9 @@ public:
// throw? should not reach?
return {};
}
bytes_opt value_for(const column_definition& cdef, const query_options& options) const override {
return {};
}
std::vector<T> values_as_keys(const query_options& options) const override {
// throw? should not reach?
return {};
@@ -212,12 +215,13 @@ statement_restrictions::statement_restrictions(database& db,
auto& cf = db.find_column_family(schema);
auto& sim = cf.get_index_manager();
bool has_queriable_clustering_column_index = _clustering_columns_restrictions->has_supporting_index(sim);
bool has_queriable_pk_index = _partition_key_restrictions->has_supporting_index(sim);
bool has_queriable_index = has_queriable_clustering_column_index
|| _partition_key_restrictions->has_supporting_index(sim)
|| has_queriable_pk_index
|| _nonprimary_key_restrictions->has_supporting_index(sim);
// At this point, the select statement if fully constructed, but we still have a few things to validate
process_partition_key_restrictions(has_queriable_index, for_view, allow_filtering);
process_partition_key_restrictions(has_queriable_pk_index, for_view, allow_filtering);
// Some but not all of the partition key columns have been specified;
// hence we need turn these restrictions into index expressions.
@@ -237,7 +241,7 @@ statement_restrictions::statement_restrictions(database& db,
}
}
process_clustering_columns_restrictions(has_queriable_index, select_a_collection, for_view, allow_filtering);
process_clustering_columns_restrictions(has_queriable_clustering_column_index, select_a_collection, for_view, allow_filtering);
// Covers indexes on the first clustering column (among others).
if (_is_key_range && has_queriable_clustering_column_index) {
@@ -333,6 +337,52 @@ const std::vector<::shared_ptr<restrictions>>& statement_restrictions::index_res
return _index_restrictions;
}
std::optional<secondary_index::index> statement_restrictions::find_idx(secondary_index::secondary_index_manager& sim) const {
for (::shared_ptr<cql3::restrictions::restrictions> restriction : index_restrictions()) {
for (const auto& cdef : restriction->get_column_defs()) {
for (auto index : sim.list_indexes()) {
if (index.depends_on(*cdef)) {
return std::make_optional<secondary_index::index>(std::move(index));
}
}
}
}
return std::nullopt;
}
std::vector<const column_definition*> statement_restrictions::get_column_defs_for_filtering(database& db) const {
std::vector<const column_definition*> column_defs_for_filtering;
if (need_filtering()) {
auto& sim = db.find_column_family(_schema).get_index_manager();
std::optional<secondary_index::index> opt_idx = find_idx(sim);
auto column_uses_indexing = [&opt_idx] (const column_definition* cdef) {
return opt_idx && opt_idx->depends_on(*cdef);
};
if (_partition_key_restrictions->needs_filtering(*_schema)) {
for (auto&& cdef : _partition_key_restrictions->get_column_defs()) {
if (!column_uses_indexing(cdef)) {
column_defs_for_filtering.emplace_back(cdef);
}
}
}
if (_clustering_columns_restrictions->needs_filtering(*_schema)) {
column_id first_filtering_id = _schema->clustering_key_columns().begin()->id +
_clustering_columns_restrictions->num_prefix_columns_that_need_not_be_filtered();
for (auto&& cdef : _clustering_columns_restrictions->get_column_defs()) {
if (cdef->id >= first_filtering_id && !column_uses_indexing(cdef)) {
column_defs_for_filtering.emplace_back(cdef);
}
}
}
for (auto&& cdef : _nonprimary_key_restrictions->get_column_defs()) {
if (!column_uses_indexing(cdef)) {
column_defs_for_filtering.emplace_back(cdef);
}
}
}
return column_defs_for_filtering;
}
void statement_restrictions::process_partition_key_restrictions(bool has_queriable_index, bool for_view, bool allow_filtering) {
// If there is a queriable index, no special condition are required on the other restrictions.
// But we still need to know 2 things:
@@ -413,18 +463,28 @@ std::vector<query::clustering_range> statement_restrictions::get_clustering_boun
if (_clustering_columns_restrictions->empty()) {
return {query::clustering_range::make_open_ended_both_sides()};
}
// TODO(sarna): For filtering to work, clustering range is not bounded at all. For filtering to work faster,
// the biggest clustering prefix restriction should be used here.
if (_clustering_columns_restrictions->needs_filtering(*_schema)) {
if (auto single_ck_restrictions = dynamic_pointer_cast<single_column_primary_key_restrictions<clustering_key>>(_clustering_columns_restrictions)) {
return single_ck_restrictions->get_longest_prefix_restrictions()->bounds_ranges(options);
}
return {query::clustering_range::make_open_ended_both_sides()};
}
return _clustering_columns_restrictions->bounds_ranges(options);
}
bool statement_restrictions::need_filtering() const {
uint32_t number_of_restricted_columns = 0;
uint32_t number_of_restricted_columns_for_indexing = 0;
for (auto&& restrictions : _index_restrictions) {
number_of_restricted_columns += restrictions->size();
number_of_restricted_columns_for_indexing += restrictions->size();
}
int number_of_filtering_restrictions = _nonprimary_key_restrictions->size();
// If the whole partition key is restricted, it does not imply filtering
if (_partition_key_restrictions->has_unrestricted_components(*_schema) || !_partition_key_restrictions->is_all_eq()) {
number_of_filtering_restrictions += _partition_key_restrictions->size();
if (_clustering_columns_restrictions->has_unrestricted_components(*_schema)) {
number_of_filtering_restrictions += _clustering_columns_restrictions->size() - _clustering_columns_restrictions->prefix_size();
}
}
if (_partition_key_restrictions->is_multi_column() || _clustering_columns_restrictions->is_multi_column()) {
@@ -433,10 +493,11 @@ bool statement_restrictions::need_filtering() const {
return false;
}
return number_of_restricted_columns > 1
|| (number_of_restricted_columns == 0 && _partition_key_restrictions->empty() && !_clustering_columns_restrictions->empty())
|| (number_of_restricted_columns != 0 && _nonprimary_key_restrictions->has_multiple_contains())
|| (number_of_restricted_columns != 0 && !_uses_secondary_indexing);
return number_of_restricted_columns_for_indexing > 1
|| (number_of_restricted_columns_for_indexing == 0 && _partition_key_restrictions->empty() && !_clustering_columns_restrictions->empty())
|| (number_of_restricted_columns_for_indexing != 0 && _nonprimary_key_restrictions->has_multiple_contains())
|| (number_of_restricted_columns_for_indexing != 0 && !_uses_secondary_indexing)
|| (_uses_secondary_indexing && number_of_filtering_restrictions > 1);
}
void statement_restrictions::validate_secondary_index_selections(bool selects_only_static_columns) {
@@ -582,7 +643,8 @@ static query::range<bytes_view> to_range(const term_slice& slice, const query_op
if (!value) {
return { };
}
return { range_type::bound(*value, slice.is_inclusive(bound)) };
auto value_view = options.linearize(*value);
return { range_type::bound(value_view, slice.is_inclusive(bound)) };
};
return range_type(
extract_bound(statements::bound::START),
@@ -611,7 +673,7 @@ bool single_column_restriction::slice::is_satisfied_by(bytes_view data, const qu
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
return to_range(_slice, options).contains(data, _column_def.type->as_tri_comparator());
return to_range(_slice, options).contains(data, _column_def.type->underlying_type()->as_tri_comparator());
}
bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
@@ -647,10 +709,12 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!val) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
auto found = with_linearized(*val, [&] (bytes_view bv) {
return std::find_if(elements.begin(), end, [&] (auto&& element) {
return element.second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, *val) == 0;
return element_type->compare(value_bv, bv) == 0;
});
});
});
if (found == end) {
return false;
@@ -661,8 +725,10 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!k) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *k) == 0;
auto found = with_linearized(*k, [&] (bytes_view bv) {
return std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, bv) == 0;
});
});
if (found == end) {
return false;
@@ -674,14 +740,18 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!map_key || !map_value) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *map_key) == 0;
auto found = with_linearized(*map_key, [&] (bytes_view map_key_bv) {
return std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, map_key_bv) == 0;
});
});
if (found == end) {
return false;
}
auto cmp = found->second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, *map_value);
auto cmp = with_linearized(*map_value, [&] (bytes_view map_value_bv) {
return found->second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, map_value_bv);
});
});
if (cmp != 0) {
return false;
@@ -698,13 +768,14 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
return _column_def.type->deserialize(cell_value_bv);
});
for (auto&& value : _values) {
auto val = value->bind_and_get(options);
if (!val) {
auto fragmented_val = value->bind_and_get(options);
if (!fragmented_val) {
continue;
}
return with_linearized(*fragmented_val, [&] (bytes_view val) {
auto exists_in = [&](auto&& range) {
auto found = std::find_if(range.begin(), range.end(), [&] (auto&& element) {
return element_type->compare(element.serialize(), *val) == 0;
return element_type->compare(element.serialize(), val) == 0;
});
return found != range.end();
};
@@ -722,6 +793,8 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
return false;
}
}
return true;
});
}
if (col_type->is_map()) {
auto& data_map = value_cast<map_type_impl::native_type>(deserialized);
@@ -730,8 +803,10 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!k) {
continue;
}
auto found = std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), *k) == 0;
auto found = with_linearized(*k, [&] (bytes_view k_bv) {
return std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), k_bv) == 0;
});
});
if (found == data_map.end()) {
return false;
@@ -743,10 +818,15 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!map_key || !map_value) {
continue;
}
auto found = std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), *map_key) == 0;
auto found = with_linearized(*map_key, [&] (bytes_view map_key_bv) {
return std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), map_key_bv) == 0;
});
});
if (found == data_map.end() || element_type->compare(found->second.serialize(), *map_value) != 0) {
if (found == data_map.end()
|| with_linearized(*map_value, [&] (bytes_view map_value_bv) {
return element_type->compare(found->second.serialize(), map_value_bv);
}) != 0) {
return false;
}
}

View File

@@ -163,6 +163,20 @@ public:
return _clustering_columns_restrictions;
}
/**
* Builds a possibly empty collection of column definitions that will be used for filtering
* @param db - the database context
* @return A list with the column definitions needed for filtering.
*/
std::vector<const column_definition*> get_column_defs_for_filtering(database& db) const;
/**
* Determines the index to be used with the restriction.
* @param db - the database context (for extracting index manager)
* @return If an index can be used, an optional containing this index, otherwise an empty optional.
*/
std::optional<secondary_index::index> find_idx(secondary_index::secondary_index_manager& sim) const;
/**
* Checks if the partition key has some unrestricted components.
* @return <code>true</code> if the partition key has some unrestricted components, <code>false</code> otherwise.

View File

@@ -45,27 +45,25 @@ namespace cql3 {
metadata::metadata(std::vector<::shared_ptr<column_specification>> names_)
: _flags(flag_enum_set())
, names(std::move(names_)) {
_column_count = names.size();
}
, _column_info(make_lw_shared<column_info>(std::move(names_)))
{ }
metadata::metadata(flag_enum_set flags, std::vector<::shared_ptr<column_specification>> names_, uint32_t column_count,
::shared_ptr<const service::pager::paging_state> paging_state)
: _flags(flags)
, names(std::move(names_))
, _column_count(column_count)
, _column_info(make_lw_shared<column_info>(std::move(names_), column_count))
, _paging_state(std::move(paging_state))
{ }
// The maximum number of values that the ResultSet can hold. This can be bigger than columnCount due to CASSANDRA-4911
uint32_t metadata::value_count() const {
return _flags.contains<flag::NO_METADATA>() ? _column_count : names.size();
return _flags.contains<flag::NO_METADATA>() ? _column_info->_column_count : _column_info->_names.size();
}
void metadata::add_non_serialized_column(::shared_ptr<column_specification> name) {
// See comment above. Because columnCount doesn't account the newly added name, it
// won't be serialized.
names.emplace_back(std::move(name));
_column_info->_names.emplace_back(std::move(name));
}
bool metadata::all_in_same_cf() const {
@@ -73,18 +71,21 @@ bool metadata::all_in_same_cf() const {
return false;
}
return column_specification::all_in_same_table(names);
return column_specification::all_in_same_table(_column_info->_names);
}
void metadata::set_has_more_pages(::shared_ptr<const service::pager::paging_state> paging_state) {
if (!paging_state) {
return;
}
void metadata::set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state) {
_flags.set<flag::HAS_MORE_PAGES>();
_paging_state = std::move(paging_state);
}
void metadata::maybe_set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state) {
assert(paging_state);
if (paging_state->get_remaining() > 0) {
set_paging_state(std::move(paging_state));
}
}
void metadata::set_skip_metadata() {
_flags.set<flag::NO_METADATA>();
}
@@ -93,18 +94,10 @@ metadata::flag_enum_set metadata::flags() const {
return _flags;
}
uint32_t metadata::column_count() const {
return _column_count;
}
::shared_ptr<const service::pager::paging_state> metadata::paging_state() const {
return _paging_state;
}
const std::vector<::shared_ptr<column_specification>>& metadata::get_names() const {
return names;
}
prepared_metadata::prepared_metadata(const std::vector<::shared_ptr<column_specification>>& names,
const std::vector<uint16_t>& partition_key_bind_indices)
: _names{names}

View File

@@ -70,18 +70,29 @@ public:
using flag_enum_set = enum_set<flag_enum>;
private:
flag_enum_set _flags;
public:
struct column_info {
// Please note that columnCount can actually be smaller than names, even if names is not null. This is
// used to include columns in the resultSet that we need to do post-query re-orderings
// (SelectStatement.orderResults) but that shouldn't be sent to the user as they haven't been requested
// (CASSANDRA-4911). So the serialization code will exclude any columns in name whose index is >= columnCount.
std::vector<::shared_ptr<column_specification>> names;
std::vector<::shared_ptr<column_specification>> _names;
uint32_t _column_count;
column_info(std::vector<::shared_ptr<column_specification>> names, uint32_t column_count)
: _names(std::move(names))
, _column_count(column_count)
{ }
explicit column_info(std::vector<::shared_ptr<column_specification>> names)
: _names(std::move(names))
, _column_count(_names.size())
{ }
};
private:
flag_enum_set _flags;
private:
uint32_t _column_count;
lw_shared_ptr<column_info> _column_info;
::shared_ptr<const service::pager::paging_state> _paging_state;
public:
@@ -99,17 +110,20 @@ private:
bool all_in_same_cf() const;
public:
void set_has_more_pages(::shared_ptr<const service::pager::paging_state> paging_state);
void set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state);
void maybe_set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state);
void set_skip_metadata();
flag_enum_set flags() const;
uint32_t column_count() const;
uint32_t column_count() const { return _column_info->_column_count; }
::shared_ptr<const service::pager::paging_state> paging_state() const;
const std::vector<::shared_ptr<column_specification>>& get_names() const;
const std::vector<::shared_ptr<column_specification>>& get_names() const {
return _column_info->_names;
}
};
::shared_ptr<const cql3::metadata> make_empty_metadata();
@@ -223,14 +237,14 @@ public:
class result {
std::unique_ptr<cql3::result_set> _result_set;
result_generator _result_generator;
shared_ptr<cql3::metadata> _metadata;
shared_ptr<const cql3::metadata> _metadata;
public:
explicit result(std::unique_ptr<cql3::result_set> rs)
: _result_set(std::move(rs))
, _metadata(_result_set->_metadata)
{ }
explicit result(result_generator generator, shared_ptr<metadata> m)
explicit result(result_generator generator, shared_ptr<const metadata> m)
: _result_generator(std::move(generator))
, _metadata(std::move(m))
{ }
@@ -240,7 +254,7 @@ public:
if (_result_set) {
return *_result_set;
} else {
auto builder = result_set::builder(_metadata);
auto builder = result_set::builder(make_shared<cql3::metadata>(*_metadata));
_result_generator.visit(builder);
return std::move(builder).get_result_set();
}

View File

@@ -40,6 +40,7 @@
*/
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include "cql3/selection/selection.hh"
#include "cql3/selection/selector_factories.hh"
@@ -155,9 +156,9 @@ public:
return _factories->uses_function(ks_name, function_name);
}
virtual uint32_t add_column_for_ordering(const column_definition& c) override {
uint32_t index = selection::add_column_for_ordering(c);
_factories->add_selector_for_ordering(c, index);
virtual uint32_t add_column_for_post_processing(const column_definition& c) override {
uint32_t index = selection::add_column_for_post_processing(c);
_factories->add_selector_for_post_processing(c, index);
return index;
}
@@ -208,9 +209,17 @@ protected:
::shared_ptr<selection> selection::wildcard(schema_ptr schema) {
auto columns = schema->all_columns_in_select_order();
auto cds = boost::copy_range<std::vector<const column_definition*>>(columns | boost::adaptors::transformed([](const column_definition& c) {
return &c;
}));
// filter out hidden columns, which should not be seen by the
// user when doing "SELECT *". We also disallow selecting them
// individually (see column_identifier::new_selector_factory()).
auto cds = boost::copy_range<std::vector<const column_definition*>>(
columns |
boost::adaptors::filtered([](const column_definition& c) {
return !c.is_view_virtual();
}) |
boost::adaptors::transformed([](const column_definition& c) {
return &c;
}));
return simple_selection::make(schema, std::move(cds), true);
}
@@ -218,7 +227,7 @@ protected:
return simple_selection::make(schema, std::move(columns), false);
}
uint32_t selection::add_column_for_ordering(const column_definition& c) {
uint32_t selection::add_column_for_post_processing(const column_definition& c) {
_columns.push_back(&c);
_metadata->add_non_serialized_column(c.column_specification);
return _columns.size() - 1;
@@ -330,14 +339,14 @@ std::unique_ptr<result_set> result_set_builder::build() {
return std::move(_result_set);
}
bool result_set_builder::restrictions_filter::operator()(const selection& selection,
bool result_set_builder::restrictions_filter::do_filter(const selection& selection,
const std::vector<bytes>& partition_key,
const std::vector<bytes>& clustering_key,
const query::result_row_view& static_row,
const query::result_row_view& row) const {
static logging::logger rlogger("restrictions_filter");
if (_current_pratition_key_does_not_match || _current_static_row_does_not_match) {
if (_current_partition_key_does_not_match || _current_static_row_does_not_match || _remaining == 0) {
return false;
}
@@ -350,11 +359,16 @@ bool result_set_builder::restrictions_filter::operator()(const selection& select
switch (cdef->kind) {
case column_kind::static_column:
// fallthrough
case column_kind::regular_column:
case column_kind::regular_column: {
auto& cell_iterator = (cdef->kind == column_kind::static_column) ? static_row_iterator : row_iterator;
if (cdef->type->is_multi_cell()) {
rlogger.debug("Multi-cell filtering is not implemented yet", cdef->name_as_text());
cell_iterator.next_collection_cell();
auto restr_it = non_pk_restrictions_map.find(cdef);
if (restr_it == non_pk_restrictions_map.end()) {
continue;
}
throw exceptions::invalid_request_exception("Collection filtering is not supported yet");
} else {
auto cell_iterator = (cdef->kind == column_kind::static_column) ? static_row_iterator : row_iterator;
auto cell = cell_iterator.next_atomic_cell();
auto restr_it = non_pk_restrictions_map.find(cdef);
@@ -365,17 +379,18 @@ bool result_set_builder::restrictions_filter::operator()(const selection& select
bool regular_restriction_matches;
if (cell) {
regular_restriction_matches = cell->value().with_linearized([&restriction](bytes_view data) {
return restriction.is_satisfied_by(data, cql3::query_options({ }));
regular_restriction_matches = cell->value().with_linearized([&restriction, this](bytes_view data) {
return restriction.is_satisfied_by(data, _options);
});
} else {
regular_restriction_matches = restriction.is_satisfied_by(bytes(), cql3::query_options({ }));
regular_restriction_matches = restriction.is_satisfied_by(bytes(), _options);
}
if (!regular_restriction_matches) {
_current_static_row_does_not_match = (cdef->kind == column_kind::static_column);
return false;
}
}
}
break;
case column_kind::partition_key: {
@@ -385,9 +400,9 @@ bool result_set_builder::restrictions_filter::operator()(const selection& select
}
restrictions::single_column_restriction& restriction = *restr_it->second;
const bytes& value_to_check = partition_key[cdef->id];
bool pk_restriction_matches = restriction.is_satisfied_by(value_to_check, cql3::query_options({ }));
bool pk_restriction_matches = restriction.is_satisfied_by(value_to_check, _options);
if (!pk_restriction_matches) {
_current_pratition_key_does_not_match = true;
_current_partition_key_does_not_match = true;
return false;
}
}
@@ -399,7 +414,7 @@ bool result_set_builder::restrictions_filter::operator()(const selection& select
}
restrictions::single_column_restriction& restriction = *restr_it->second;
const bytes& value_to_check = clustering_key[cdef->id];
bool pk_restriction_matches = restriction.is_satisfied_by(value_to_check, cql3::query_options({ }));
bool pk_restriction_matches = restriction.is_satisfied_by(value_to_check, _options);
if (!pk_restriction_matches) {
return false;
}
@@ -412,6 +427,20 @@ bool result_set_builder::restrictions_filter::operator()(const selection& select
return true;
}
bool result_set_builder::restrictions_filter::operator()(const selection& selection,
const std::vector<bytes>& partition_key,
const std::vector<bytes>& clustering_key,
const query::result_row_view& static_row,
const query::result_row_view& row) const {
const bool accepted = do_filter(selection, partition_key, clustering_key, static_row, row);
if (!accepted) {
++_rows_dropped;
} else if (_remaining > 0) {
--_remaining;
}
return accepted;
}
api::timestamp_type result_set_builder::timestamp_of(size_t idx) {
return _timestamps[idx];
}

View File

@@ -169,10 +169,14 @@ public:
return _metadata;
}
::shared_ptr<metadata> get_result_metadata() {
return _metadata;
}
static ::shared_ptr<selection> wildcard(schema_ptr schema);
static ::shared_ptr<selection> for_columns(schema_ptr schema, std::vector<const column_definition*> columns);
virtual uint32_t add_column_for_ordering(const column_definition& c);
virtual uint32_t add_column_for_post_processing(const column_definition& c);
virtual bool uses_function(const sstring &ks_name, const sstring& function_name) const {
return false;
@@ -255,19 +259,31 @@ public:
}
void reset() {
}
uint32_t get_rows_dropped() const {
return 0;
}
};
class restrictions_filter {
::shared_ptr<restrictions::statement_restrictions> _restrictions;
mutable bool _current_pratition_key_does_not_match = false;
const query_options& _options;
mutable bool _current_partition_key_does_not_match = false;
mutable bool _current_static_row_does_not_match = false;
mutable uint32_t _rows_dropped = 0;
mutable uint32_t _remaining = 0;
public:
restrictions_filter() = default;
explicit restrictions_filter(::shared_ptr<restrictions::statement_restrictions> restrictions) : _restrictions(restrictions) {}
explicit restrictions_filter(::shared_ptr<restrictions::statement_restrictions> restrictions, const query_options& options, uint32_t remaining) : _restrictions(restrictions), _options(options), _remaining(remaining) {}
bool operator()(const selection& selection, const std::vector<bytes>& pk, const std::vector<bytes>& ck, const query::result_row_view& static_row, const query::result_row_view& row) const;
void reset() {
_current_pratition_key_does_not_match = false;
_current_partition_key_does_not_match = false;
_current_static_row_does_not_match = false;
_rows_dropped = 0;
}
uint32_t get_rows_dropped() const {
return _rows_dropped;
}
private:
bool do_filter(const selection& selection, const std::vector<bytes>& pk, const std::vector<bytes>& ck, const query::result_row_view& static_row, const query::result_row_view& row) const;
};
result_set_builder(const selection& s, gc_clock::time_point now, cql_serialization_format sf);
@@ -367,7 +383,7 @@ public:
}
}
void accept_partition_end(const query::result_row_view& static_row) {
uint32_t accept_partition_end(const query::result_row_view& static_row) {
if (_row_count == 0) {
_builder.new_row();
auto static_row_iterator = static_row.iterator();
@@ -381,6 +397,7 @@ public:
}
}
}
return _filter.get_rows_dropped();
}
};

View File

@@ -105,9 +105,11 @@ public:
virtual void reset() = 0;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) override {
if (receiver->type == get_type()) {
auto t1 = receiver->type->underlying_type();
auto t2 = get_type()->underlying_type();
if (t1 == t2) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (receiver->type->is_value_compatible_with(*get_type())) {
} else if (t1->is_value_compatible_with(*t2)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} else {
return assignment_testable::test_result::NOT_ASSIGNABLE;

View File

@@ -53,6 +53,7 @@ selector_factories::selector_factories(std::vector<::shared_ptr<selectable>> sel
: _contains_write_time_factory(false)
, _contains_ttl_factory(false)
, _number_of_aggregate_factories(0)
, _number_of_factories_for_post_processing(0)
{
_factories.reserve(selectables.size());
@@ -76,8 +77,9 @@ bool selector_factories::uses_function(const sstring& ks_name, const sstring& fu
return false;
}
void selector_factories::add_selector_for_ordering(const column_definition& def, uint32_t index) {
void selector_factories::add_selector_for_post_processing(const column_definition& def, uint32_t index) {
_factories.emplace_back(simple_selector::new_factory(def.name_as_text(), index, def.type));
++_number_of_factories_for_post_processing;
}
std::vector<::shared_ptr<selector>> selector_factories::new_instances() const {

View File

@@ -74,6 +74,11 @@ private:
*/
uint32_t _number_of_aggregate_factories;
/**
* The number of factories that are only for post processing.
*/
uint32_t _number_of_factories_for_post_processing;
public:
/**
* Creates a new <code>SelectorFactories</code> instance and collect the column definitions.
@@ -97,11 +102,12 @@ public:
bool uses_function(const sstring& ks_name, const sstring& function_name) const;
/**
* Adds a new <code>Selector.Factory</code> for a column that is needed only for ORDER BY purposes.
* Adds a new <code>Selector.Factory</code> for a column that is needed only for ORDER BY or post
* processing purposes.
* @param def the column that is needed for ordering
* @param index the index of the column definition in the Selection's list of columns
*/
void add_selector_for_ordering(const column_definition& def, uint32_t index);
void add_selector_for_post_processing(const column_definition& def, uint32_t index);
/**
* Checks if this <code>SelectorFactories</code> contains only factories for aggregates.
@@ -111,7 +117,7 @@ public:
*/
bool contains_only_aggregate_functions() const {
auto size = _factories.size();
return size != 0 && _number_of_aggregate_factories == size;
return size != 0 && _number_of_aggregate_factories == (size - _number_of_factories_for_post_processing);
}
/**

View File

@@ -120,17 +120,19 @@ sets::literal::to_string() const {
}
sets::value
sets::value::from_serialized(bytes_view v, set_type type, cql_serialization_format sf) {
sets::value::from_serialized(const fragmented_temporary_buffer::view& val, set_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserializeForNativeProtocol?!
return with_linearized(val, [&] (bytes_view v) {
auto s = value_cast<set_type_impl::native_type>(type->deserialize(v, sf));
std::set<bytes, serialized_compare> elements(type->get_elements_type()->as_less_comparator());
for (auto&& element : s) {
elements.insert(elements.end(), type->get_elements_type()->decompose(element));
}
return value(std::move(elements));
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -198,10 +200,10 @@ sets::delayed_value::bind(const query_options& options) {
return constants::UNSET_VALUE;
}
// We don't support value > 64K because the serialization format encode the length as an unsigned short.
if (b->size() > std::numeric_limits<uint16_t>::max()) {
if (b->size_bytes() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Set value is too long. Set values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
b->size()));
b->size_bytes()));
}
buffers.insert(buffers.end(), std::move(to_bytes(*b)));
@@ -269,7 +271,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
}
for (auto&& e : set_value->_elements) {
mut.cells.emplace_back(e, params.make_cell(*set_type->value_comparator(), {}, atomic_cell::collection_member::yes));
mut.cells.emplace_back(e, params.make_cell(*set_type->value_comparator(), bytes_view(), atomic_cell::collection_member::yes));
}
auto smut = set_type->serialize_mutation_form(mut);
@@ -279,7 +281,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
auto v = set_type->serialize_partially_deserialized_form(
{set_value->_elements.begin(), set_value->_elements.end()},
cql_serialization_format::internal());
m.set_cell(row_key, column, params.make_cell(*column.type, std::move(v)));
m.set_cell(row_key, column, params.make_cell(*column.type, fragmented_temporary_buffer::view(v)));
} else {
m.set_cell(row_key, column, params.make_dead_cell());
}

View File

@@ -78,7 +78,7 @@ public:
value(std::set<bytes, serialized_compare> elements)
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, set_type type, cql_serialization_format sf);
static value from_serialized(const fragmented_temporary_buffer::view& v, set_type type, cql_serialization_format sf);
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(set_type st, const value& v);

View File

@@ -101,13 +101,6 @@ single_column_relation::to_receivers(schema_ptr schema, const column_definition&
}
if (is_IN()) {
// For partition keys we only support IN for the last name so far
if (column_def.is_partition_key() && !schema->is_last_partition_key(column_def)) {
throw exceptions::invalid_request_exception(sprint(
"Partition KEY part %s cannot be restricted by IN relation (only the last part of the partition key can)",
column_def.name_as_text()));
}
// We only allow IN on the row key and the clustering key so far, never on non-PK columns, and this even if
// there's an index
// Note: for backward compatibility reason, we conside a IN of 1 value the same as a EQ, so we let that

View File

@@ -246,18 +246,22 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
cfm.with_column(column_name->name(), type, _is_static ? column_kind::static_column : column_kind::regular_column);
// Adding a column to a table which has an include all view requires the column to be added to the view
// as well. If the view has a regular base column in its PK, then the column ID needs to be updated in
// view_info; for that, rebuild the schema.
// Adding a column to a base table always requires updating the view
// schemas: If the view includes all columns it should include the new
// column, but if it doesn't, it may need to include the new
// unselected column as a virtual column. The case when it we
// shouldn't add a virtual column is when the view has in its PK one
// of the base's regular columns - but even in this case we need to
// rebuild the view schema, to update the column ID.
if (!_is_static) {
for (auto&& view : cf.views()) {
if (view->view_info()->include_all_columns() || view->view_info()->base_non_pk_column_in_view_pk()) {
schema_builder builder(view);
if (view->view_info()->include_all_columns()) {
builder.with_column(column_name->name(), type);
}
view_updates.push_back(view_ptr(builder.build()));
schema_builder builder(view);
if (view->view_info()->include_all_columns()) {
builder.with_column(column_name->name(), type);
} else if (!view->view_info()->base_non_pk_column_in_view_pk()) {
db::view::create_virtual_column(builder, column_name->name(), type);
}
view_updates.push_back(view_ptr(builder.build()));
}
}
@@ -272,7 +276,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
auto type = validate_alter(schema, *def, *validator);
// In any case, we update the column definition
cfm.with_altered_column_type(column_name->name(), type);
cfm.alter_column_type(column_name->name(), type);
// We also have to validate the view types here. If we have a view which includes a column as part of
// the clustering key, we need to make sure that it is indeed compatible.
@@ -281,7 +285,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
if (view_def) {
schema_builder builder(view);
auto view_type = validate_alter(view, *view_def, *validator);
builder.with_altered_column_type(column_name->name(), std::move(view_type));
builder.alter_column_type(column_name->name(), std::move(view_type));
view_updates.push_back(view_ptr(builder.build()));
}
}
@@ -302,7 +306,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
} else {
for (auto&& column_def : boost::range::join(schema->static_columns(), schema->regular_columns())) { // find
if (column_def.name() == column_name->name()) {
cfm.without_column(column_name->name());
cfm.remove_column(column_name->name());
break;
}
}
@@ -345,9 +349,10 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
auto to = entry.second->prepare_column_identifier(schema);
validate_column_rename(db, *schema, *from, *to);
cfm.with_column_rename(from->name(), to->name());
cfm.rename_column(from->name(), to->name());
// If the view includes a renamed column, it must be renamed in the view table and the definition.
// If the view includes a renamed column, it must be renamed in
// the view table and the definition.
for (auto&& view : cf.views()) {
if (view->get_column_definition(from->name())) {
schema_builder builder(view);
@@ -355,7 +360,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
auto view_from = entry.first->prepare_column_identifier(view);
auto view_to = entry.second->prepare_column_identifier(view);
validate_column_rename(db, *view, *view_from, *view_to);
builder.with_column_rename(view_from->name(), view_to->name());
builder.rename_column(view_from->name(), view_to->name());
auto new_where = util::rename_column_in_where_clause(
view->view_info()->where_clause(),

View File

@@ -110,7 +110,7 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
if (t_opt) {
modified = true;
// We need to update this column
cfm.with_altered_column_type(column.name(), *t_opt);
cfm.alter_column_type(column.name(), *t_opt);
}
}
if (modified) {

View File

@@ -191,20 +191,20 @@ const std::vector<batch_statement::single_statement>& batch_statement::get_state
return _statements;
}
future<std::vector<mutation>> batch_statement::get_mutations(service::storage_proxy& storage, const query_options& options, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state) {
future<std::vector<mutation>> batch_statement::get_mutations(service::storage_proxy& storage, const query_options& options, db::timeout_clock::time_point timeout, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state) {
// Do not process in parallel because operations like list append/prepend depend on execution order.
using mutation_set_type = std::unordered_set<mutation, mutation_hash_by_key, mutation_equals_by_key>;
return do_with(mutation_set_type(), [this, &storage, &options, now, local, trace_state] (auto& result) {
return do_with(mutation_set_type(), [this, &storage, &options, timeout, now, local, trace_state] (auto& result) {
result.reserve(_statements.size());
_stats.statements_in_batches += _statements.size();
return do_for_each(boost::make_counting_iterator<size_t>(0),
boost::make_counting_iterator<size_t>(_statements.size()),
[this, &storage, &options, now, local, &result, trace_state] (size_t i) {
[this, &storage, &options, now, local, &result, timeout, trace_state] (size_t i) {
auto&& statement = _statements[i].statement;
statement->inc_cql_stats();
auto&& statement_options = options.for_statement(i);
auto timestamp = _attrs->get_timestamp(now, statement_options);
return statement->get_mutations(storage, statement_options, local, timestamp, trace_state).then([&result] (auto&& more) {
return statement->get_mutations(storage, statement_options, timeout, local, timestamp, trace_state).then([&result] (auto&& more) {
for (auto&& m : more) {
// We want unordered_set::try_emplace(), but we don't have it
auto pos = result.find(m);
@@ -293,8 +293,9 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
return execute_with_conditions(storage, options, query_state);
}
return get_mutations(storage, options, local, now, query_state.get_trace_state()).then([this, &storage, &options, tr_state = query_state.get_trace_state()] (std::vector<mutation> ms) mutable {
return execute_without_conditions(storage, std::move(ms), options.get_consistency(), std::move(tr_state));
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
return get_mutations(storage, options, timeout, local, now, query_state.get_trace_state()).then([this, &storage, &options, timeout, tr_state = query_state.get_trace_state()] (std::vector<mutation> ms) mutable {
return execute_without_conditions(storage, std::move(ms), options.get_consistency(), timeout, std::move(tr_state));
}).then([] {
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(
make_shared<cql_transport::messages::result_message::void_message>());
@@ -305,6 +306,7 @@ future<> batch_statement::execute_without_conditions(
service::storage_proxy& storage,
std::vector<mutation> mutations,
db::consistency_level cl,
db::timeout_clock::time_point timeout,
tracing::trace_state_ptr tr_state)
{
// FIXME: do we need to do this?
@@ -332,7 +334,7 @@ future<> batch_statement::execute_without_conditions(
mutate_atomic = false;
}
}
return storage.mutate_with_triggers(std::move(mutations), cl, mutate_atomic, std::move(tr_state));
return storage.mutate_with_triggers(std::move(mutations), cl, timeout, mutate_atomic, std::move(tr_state));
}
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::execute_with_conditions(

View File

@@ -125,7 +125,7 @@ public:
const std::vector<single_statement>& get_statements();
private:
future<std::vector<mutation>> get_mutations(service::storage_proxy& storage, const query_options& options, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state);
future<std::vector<mutation>> get_mutations(service::storage_proxy& storage, const query_options& options, db::timeout_clock::time_point timeout, bool local, api::timestamp_type now, tracing::trace_state_ptr trace_state);
public:
/**
@@ -147,6 +147,7 @@ private:
service::storage_proxy& storage,
std::vector<mutation> mutations,
db::consistency_level cl,
db::timeout_clock::time_point timeout,
tracing::trace_state_ptr tr_state);
future<shared_ptr<cql_transport::messages::result_message>> execute_with_conditions(

View File

@@ -88,6 +88,11 @@ create_index_statement::validate(service::storage_proxy& proxy, const service::c
throw exceptions::invalid_request_exception("Secondary indexes are not supported on materialized views");
}
if (schema->is_dense()) {
throw exceptions::invalid_request_exception(
"Secondary indexes are not supported on COMPACT STORAGE tables that have clustering columns");
}
std::vector<::shared_ptr<index_target>> targets;
for (auto& raw_target : _raw_targets) {
targets.emplace_back(raw_target->prepare(schema));
@@ -109,6 +114,11 @@ create_index_statement::validate(service::storage_proxy& proxy, const service::c
sprint("No column definition found for column %s", *target->column));
}
//NOTICE(sarna): Should be lifted after resolving issue #2963
if (cd->is_static()) {
throw exceptions::invalid_request_exception("Indexing static columns is not implemented yet.");
}
if (cd->type->references_duration()) {
using request_validations::check_false;
const auto& ty = *cd->type;
@@ -122,8 +132,7 @@ create_index_statement::validate(service::storage_proxy& proxy, const service::c
}
// Origin TODO: we could lift that limitation
if ((schema->is_dense() || !schema->thrift().has_compound_comparator()) &&
cd->kind != column_kind::regular_column) {
if ((schema->is_dense() || !schema->thrift().has_compound_comparator()) && cd->is_primary_key()) {
throw exceptions::invalid_request_exception(
"Secondary indexes are not supported on PRIMARY KEY columns in COMPACT STORAGE tables");
}
@@ -137,10 +146,15 @@ create_index_statement::validate(service::storage_proxy& proxy, const service::c
bool is_map = dynamic_cast<const collection_type_impl *>(cd->type.get()) != nullptr
&& dynamic_cast<const collection_type_impl *>(cd->type.get())->is_map();
bool is_frozen_collection = cd->type->is_collection() && !cd->type->is_multi_cell();
bool is_collection = cd->type->is_collection();
bool is_frozen_collection = is_collection && !cd->type->is_multi_cell();
if (is_frozen_collection) {
validate_for_frozen_collection(target);
} else if (is_collection) {
// NOTICE(sarna): should be lifted after #2962 (indexes on non-frozen collections) is implemented
throw exceptions::invalid_request_exception(
sprint("Cannot create secondary index on non-frozen collection column %s", cd->name_as_text()));
} else {
validate_not_full_index(target);
validate_is_values_index_if_target_column_not_collection(cd, target);

View File

@@ -84,7 +84,6 @@ create_view_statement::create_view_statement(
, _clustering_keys{clustering_keys}
, _if_not_exists{if_not_exists}
{
service::get_local_storage_proxy().get_db().local().get_config().check_experimental("Creating materialized views");
if (!service::get_local_storage_service().cluster_supports_materialized_views()) {
throw exceptions::invalid_request_exception("Can't create materialized views until the whole cluster has been upgraded");
}
@@ -275,6 +274,7 @@ future<shared_ptr<cql_transport::event::schema_change>> create_view_statement::a
std::vector<const column_definition*> missing_pk_columns;
std::vector<const column_definition*> target_non_pk_columns;
std::vector<const column_definition*> unselected_columns;
// We need to include all of the primary key columns from the base table in order to make sure that we do not
// overwrite values in the view. We cannot support "collapsing" the base table into a smaller number of rows in
@@ -292,6 +292,9 @@ future<shared_ptr<cql_transport::event::schema_change>> create_view_statement::a
if (included_def && !def_in_target_pk) {
target_non_pk_columns.push_back(&def);
}
if (!included_def && !def_in_target_pk && !def.is_static()) {
unselected_columns.push_back(&def);
}
if (def.is_primary_key() && !def_in_target_pk) {
missing_pk_columns.push_back(&def);
}
@@ -311,6 +314,27 @@ future<shared_ptr<cql_transport::event::schema_change>> create_view_statement::a
throw exceptions::invalid_request_exception(sprint("No columns are defined for Materialized View other than primary key"));
}
// The unique feature of a filter by a non-key column is that the
// value of such column can be updated - and also be expired with TTL
// and cause the view row to appear and disappear. We don't currently
// support support this case - see issue #3430, and neither does
// Cassandra - see see CASSANDRA-13798 and CASSANDRA-13832.
// Actually, as CASSANDRA-13798 explains, the problem is "the liveness of
// view row is now depending on multiple base columns (multiple filtered
// non-pk base column + base column used in view pk)". When the filtered
// column *is* the base column added to the view pk, we don't have this
// problem. And this case actually works correctly.
auto non_pk_restrictions = restrictions->get_non_pk_restriction();
if (non_pk_restrictions.size() == 1 && has_non_pk_column &&
std::find(target_primary_keys.begin(), target_primary_keys.end(), non_pk_restrictions.cbegin()->first) != target_primary_keys.end()) {
// This case (filter by new PK column of the view) works, as explained above
} else if (!non_pk_restrictions.empty()) {
auto column_names = ::join(", ", non_pk_restrictions | boost::adaptors::map_keys | boost::adaptors::transformed(std::mem_fn(&column_definition::name_as_text)));
throw exceptions::invalid_request_exception(sprint(
"Non-primary key columns cannot be restricted in the SELECT statement used for materialized view %s creation (got restrictions on: %s)",
column_family(), column_names));
}
schema_builder builder{keyspace(), column_family()};
auto add_columns = [this, &builder] (std::vector<const column_definition*>& defs, column_kind kind) mutable {
for (auto* def : defs) {
@@ -321,6 +345,19 @@ future<shared_ptr<cql_transport::event::schema_change>> create_view_statement::a
add_columns(target_partition_keys, column_kind::partition_key);
add_columns(target_clustering_keys, column_kind::clustering_key);
add_columns(target_non_pk_columns, column_kind::regular_column);
// Add all unselected columns (base-table columns which are not selected
// in the view) as "virtual columns" - columns which have timestamp and
// ttl information, but an empty value. These are needed to keep view
// rows alive when the base row is alive, even if the view row has no
// data, just a key (see issue #3362). The virtual columns are not needed
// when the view pk adds a regular base column (i.e., has_non_pk_column)
// because in that case, the liveness of that base column is what
// determines the liveness of the view row.
if (!has_non_pk_column) {
for (auto* def : unselected_columns) {
db::view::create_virtual_column(builder, def->name(), def->type);
}
}
_properties.properties()->apply_to_builder(builder, proxy.get_db().local().get_config().extensions());
if (builder.default_time_to_live().count() > 0) {

View File

@@ -49,7 +49,7 @@ void cql3::statements::index_prop_defs::validate() {
property_definitions::validate(keywords);
if (is_custom && !custom_class) {
throw exceptions::invalid_request_exception("CUSTOM index requires specifiying the index class");
throw exceptions::invalid_request_exception("CUSTOM index requires specifying the index class");
}
if (!is_custom && custom_class) {
@@ -64,6 +64,16 @@ void cql3::statements::index_prop_defs::validate() {
sprint("Cannot specify %s as a CUSTOM option",
db::index::secondary_index::custom_index_option_name));
}
// Currently, Scylla does not support *any* class of custom index
// implementation. If in the future we do (e.g., SASI, or something
// new), we'll need to check for valid values here.
if (is_custom && custom_class) {
throw exceptions::invalid_request_exception(
format("Unsupported CUSTOM INDEX class {}. Note that currently, Scylla does not support SASI or any other CUSTOM INDEX class.",
*custom_class));
}
}
index_options_map

View File

@@ -160,11 +160,11 @@ future<> modification_statement::check_access(const service::client_state& state
}
future<std::vector<mutation>>
modification_statement::get_mutations(service::storage_proxy& proxy, const query_options& options, bool local, int64_t now, tracing::trace_state_ptr trace_state) {
modification_statement::get_mutations(service::storage_proxy& proxy, const query_options& options, db::timeout_clock::time_point timeout, bool local, int64_t now, tracing::trace_state_ptr trace_state) {
auto json_cache = maybe_prepare_json_cache(options);
auto keys = make_lw_shared(build_partition_keys(options, json_cache));
auto ranges = make_lw_shared(create_clustering_ranges(options, json_cache));
return make_update_parameters(proxy, keys, ranges, options, local, now, std::move(trace_state)).then(
return make_update_parameters(proxy, keys, ranges, options, timeout, local, now, std::move(trace_state)).then(
[this, keys, ranges, now, json_cache = std::move(json_cache)] (auto params_ptr) {
std::vector<mutation> mutations;
mutations.reserve(keys->size());
@@ -186,10 +186,11 @@ modification_statement::make_update_parameters(
lw_shared_ptr<dht::partition_range_vector> keys,
lw_shared_ptr<query::clustering_row_ranges> ranges,
const query_options& options,
db::timeout_clock::time_point timeout,
bool local,
int64_t now,
tracing::trace_state_ptr trace_state) {
return read_required_rows(proxy, *keys, std::move(ranges), local, options, std::move(trace_state)).then(
return read_required_rows(proxy, *keys, std::move(ranges), local, options, timeout, std::move(trace_state)).then(
[this, &options, now] (auto rows) {
return make_ready_future<std::unique_ptr<update_parameters>>(
std::make_unique<update_parameters>(s, options,
@@ -275,6 +276,7 @@ modification_statement::read_required_rows(
lw_shared_ptr<query::clustering_row_ranges> ranges,
bool local,
const query_options& options,
db::timeout_clock::time_point timeout,
tracing::trace_state_ptr trace_state) {
if (!requires_read()) {
return make_ready_future<update_parameters::prefetched_rows_type>(
@@ -308,7 +310,6 @@ modification_statement::read_required_rows(
query::partition_slice::option::collections_as_maps>());
query::read_command cmd(s->id(), s->version(), ps, std::numeric_limits<uint32_t>::max());
// FIXME: ignoring "local"
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
return proxy.query(s, make_lw_shared(std::move(cmd)), std::move(keys),
cl, {timeout, std::move(trace_state)}).then([this, ps] (auto qr) {
return query::result_view::do_with(*qr.query_result, [&] (query::result_view v) {
@@ -408,12 +409,13 @@ modification_statement::execute_without_condition(service::storage_proxy& proxy,
db::validate_for_write(s->ks_name(), cl);
}
return get_mutations(proxy, options, false, options.get_timestamp(qs), qs.get_trace_state()).then([this, cl, &proxy, &qs] (auto mutations) {
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
return get_mutations(proxy, options, timeout, false, options.get_timestamp(qs), qs.get_trace_state()).then([this, cl, timeout, &proxy, &qs] (auto mutations) {
if (mutations.empty()) {
return now();
}
return proxy.mutate_with_triggers(std::move(mutations), cl, false, qs.get_trace_state(), this->is_raw_counter_shard_write());
return proxy.mutate_with_triggers(std::move(mutations), cl, timeout, false, qs.get_trace_state(), this->is_raw_counter_shard_write());
});
}

View File

@@ -206,6 +206,7 @@ protected:
lw_shared_ptr<query::clustering_row_ranges> ranges,
bool local,
const query_options& options,
db::timeout_clock::time_point now,
tracing::trace_state_ptr trace_state);
private:
future<::shared_ptr<cql_transport::messages::result_message>>
@@ -349,7 +350,7 @@ public:
* @return vector of the mutations
* @throws invalid_request_exception on invalid requests
*/
future<std::vector<mutation>> get_mutations(service::storage_proxy& proxy, const query_options& options, bool local, int64_t now, tracing::trace_state_ptr trace_state);
future<std::vector<mutation>> get_mutations(service::storage_proxy& proxy, const query_options& options, db::timeout_clock::time_point timeout, bool local, int64_t now, tracing::trace_state_ptr trace_state);
public:
future<std::unique_ptr<update_parameters>> make_update_parameters(
@@ -357,6 +358,7 @@ public:
lw_shared_ptr<dht::partition_range_vector> keys,
lw_shared_ptr<query::clustering_row_ranges> ranges,
const query_options& options,
db::timeout_clock::time_point timeout,
bool local,
int64_t now,
tracing::trace_state_ptr trace_state);

View File

@@ -87,6 +87,7 @@ private:
::shared_ptr<attributes::raw> _attrs;
::shared_ptr<term::raw> _json_value;
bool _if_not_exists;
bool _default_unset;
public:
/**
* A parsed <code>INSERT JSON</code> statement.
@@ -95,7 +96,7 @@ public:
* @param json_value JSON string representing names and values
* @param attrs additional attributes for statement (CL, timestamp, timeToLive)
*/
insert_json_statement(::shared_ptr<cf_name> name, ::shared_ptr<attributes::raw> attrs, ::shared_ptr<term::raw> json_value, bool if_not_exists);
insert_json_statement(::shared_ptr<cf_name> name, ::shared_ptr<attributes::raw> attrs, ::shared_ptr<term::raw> json_value, bool if_not_exists, bool default_unset);
virtual ::shared_ptr<cql3::statements::modification_statement> prepare_internal(database& db, schema_ptr schema,
::shared_ptr<variable_specifications> bound_names, std::unique_ptr<attributes> attrs, cql_stats& stats) override;

View File

@@ -141,6 +141,10 @@ private:
/** If ALLOW FILTERING was not specified, this verifies that it is not needed */
void check_needs_filtering(::shared_ptr<restrictions::statement_restrictions> restrictions);
void ensure_filtering_columns_retrieval(database& db,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions);
bool contains_alias(::shared_ptr<column_identifier> name);
::shared_ptr<column_specification> limit_receiver();

View File

@@ -45,6 +45,7 @@
#include "transport/messages/result_message.hh"
#include "cql3/selection/selection.hh"
#include "cql3/util.hh"
#include "cql3/restrictions/single_column_primary_key_restrictions.hh"
#include "core/shared_ptr.hh"
#include "query-result-reader.hh"
#include "query_result_merger.hh"
@@ -96,12 +97,8 @@ public:
encoded_row.write("\\\"", 2);
}
encoded_row.write("\": ", 3);
if (parameters[i]) {
sstring row_sstring = _selector_types[i]->to_json_string(parameters[i].value());
encoded_row.write(row_sstring.c_str(), row_sstring.size());
} else {
encoded_row.write("null", 4);
}
sstring row_sstring = _selector_types[i]->to_json_string(parameters[i]);
encoded_row.write(row_sstring.c_str(), row_sstring.size());
}
encoded_row.write("}", 1);
return encoded_row.linearize().to_string();
@@ -316,13 +313,14 @@ select_statement::make_partition_slice(const query_options& options)
if (_is_reversed) {
_opts.set(query::partition_slice::option::reversed);
std::reverse(bounds.begin(), bounds.end());
++_stats.reverse_queries;
}
return query::partition_slice(std::move(bounds),
std::move(static_columns), std::move(regular_columns), _opts, nullptr, options.get_cql_serialization_format());
}
int32_t select_statement::get_limit(const query_options& options) const {
if (!_limit) {
if (!_limit || _selection->is_aggregate()) {
return std::numeric_limits<int32_t>::max();
}
@@ -333,9 +331,10 @@ int32_t select_statement::get_limit(const query_options& options) const {
if (val.is_unset_value()) {
return std::numeric_limits<int32_t>::max();
}
return with_linearized(*val, [&] (bytes_view bv) {
try {
int32_type->validate(*val);
auto l = value_cast<int32_t>(int32_type->deserialize(*val));
int32_type->validate(bv);
auto l = value_cast<int32_t>(int32_type->deserialize(bv));
if (l <= 0) {
throw exceptions::invalid_request_exception("LIMIT must be strictly positive");
}
@@ -343,6 +342,7 @@ int32_t select_statement::get_limit(const query_options& options) const {
} catch (const marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid limit value");
}
});
}
bool select_statement::needs_post_query_ordering() const {
@@ -383,48 +383,55 @@ select_statement::do_execute(service::storage_proxy& proxy,
int32_t limit = get_limit(options);
auto now = gc_clock::now();
const bool restrictions_need_filtering = _restrictions->need_filtering();
++_stats.reads;
_stats.filtered_reads += _restrictions->need_filtering();
_stats.filtered_reads += restrictions_need_filtering;
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(),
make_partition_slice(options), limit, now, tracing::make_trace_info(state.get_trace_state()), query::max_partitions, utils::UUID(), options.get_timestamp(state));
int32_t page_size = options.get_page_size();
_stats.unpaged_select_queries += page_size <= 0;
// An aggregation query will never be paged for the user, but we always page it internally to avoid OOM.
// If we user provided a page_size we'll use that to page internally (because why not), otherwise we use our default
// Note that if there are some nodes in the cluster with a version less than 2.0, we can't use paging (CASSANDRA-6707).
auto aggregate = _selection->is_aggregate();
if (aggregate && page_size <= 0) {
const bool aggregate = _selection->is_aggregate();
const bool nonpaged_filtering = restrictions_need_filtering && page_size <= 0;
if (aggregate || nonpaged_filtering) {
page_size = DEFAULT_COUNT_PAGE_SIZE;
}
auto key_ranges = _restrictions->get_partition_key_ranges(options);
if (!aggregate && (page_size <= 0
|| !service::pager::query_pagers::may_need_paging(page_size,
if (!aggregate && !restrictions_need_filtering && (page_size <= 0
|| !service::pager::query_pagers::may_need_paging(*_schema, page_size,
*command, key_ranges))) {
return execute(proxy, command, std::move(key_ranges), state, options, now);
}
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto timeout = options.get_timeout_config().*get_timeout_config_selector();
auto timeout_duration = options.get_timeout_config().*get_timeout_config_selector();
auto p = service::pager::query_pagers::pager(_schema, _selection,
state, options, timeout, command, std::move(key_ranges), _stats, _restrictions->need_filtering() ? _restrictions : nullptr);
state, options, command, std::move(key_ranges), _stats, restrictions_need_filtering ? _restrictions : nullptr);
if (aggregate) {
if (aggregate || nonpaged_filtering) {
return do_with(
cql3::selection::result_set_builder(*_selection, now,
options.get_cql_serialization_format()),
[this, p, page_size, now](auto& builder) {
[this, p, page_size, now, timeout_duration, restrictions_need_filtering](auto& builder) {
return do_until([p] {return p->is_exhausted();},
[p, &builder, page_size, now] {
return p->fetch_page(builder, page_size, now);
[p, &builder, page_size, now, timeout_duration] {
auto timeout = db::timeout_clock::now() + timeout_duration;
return p->fetch_page(builder, page_size, now, timeout);
}
).then([this, &builder] {
).then([this, &builder, restrictions_need_filtering] {
auto rs = builder.build();
if (restrictions_need_filtering) {
_stats.filtered_rows_matched_total += rs->size();
}
update_stats_rows_read(rs->size());
_stats.filtered_rows_matched_total += _restrictions->need_filtering() ? rs->size() : 0;
auto msg = ::make_shared<cql_transport::messages::result_message::rows>(result(std::move(rs)));
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(msg));
});
@@ -437,12 +444,18 @@ select_statement::do_execute(service::storage_proxy& proxy,
" you must either remove the ORDER BY or the IN and sort client side, or disable paging for this query");
}
if (_selection->is_trivial() && !_restrictions->need_filtering()) {
return p->fetch_page_generator(page_size, now, _stats).then([this, p, limit] (result_generator generator) {
auto meta = make_shared<metadata>(*_selection->get_result_metadata());
if (!p->is_exhausted()) {
meta->set_has_more_pages(p->state());
}
auto timeout = db::timeout_clock::now() + timeout_duration;
if (_selection->is_trivial() && !restrictions_need_filtering) {
return p->fetch_page_generator(page_size, now, timeout, _stats).then([this, p, limit] (result_generator generator) {
auto meta = [&] () -> shared_ptr<const cql3::metadata> {
if (!p->is_exhausted()) {
auto meta = make_shared<metadata>(*_selection->get_result_metadata());
meta->set_paging_state(p->state());
return meta;
} else {
return _selection->get_result_metadata();
}
}();
return shared_ptr<cql_transport::messages::result_message>(
make_shared<cql_transport::messages::result_message::rows>(result(std::move(generator), std::move(meta)))
@@ -450,20 +463,201 @@ select_statement::do_execute(service::storage_proxy& proxy,
});
}
return p->fetch_page(page_size, now).then(
[this, p, &options, limit, now](std::unique_ptr<cql3::result_set> rs) {
return p->fetch_page(page_size, now, timeout).then(
[this, p, &options, now, restrictions_need_filtering](std::unique_ptr<cql3::result_set> rs) {
if (!p->is_exhausted()) {
rs->get_metadata().set_has_more_pages(p->state());
rs->get_metadata().set_paging_state(p->state());
}
if (restrictions_need_filtering) {
_stats.filtered_rows_matched_total += rs->size();
}
update_stats_rows_read(rs->size());
_stats.filtered_rows_matched_total += _restrictions->need_filtering() ? rs->size() : 0;
auto msg = ::make_shared<cql_transport::messages::result_message::rows>(result(std::move(rs)));
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(msg));
});
}
template<typename KeyType>
GCC6_CONCEPT(
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key_prefix>)
)
static KeyType
generate_base_key_from_index_pk(const partition_key& index_pk, const clustering_key& index_ck, const schema& base_schema, const schema& view_schema) {
const auto& base_columns = std::is_same_v<KeyType, partition_key> ? base_schema.partition_key_columns() : base_schema.clustering_key_columns();
std::vector<bytes_view> exploded_base_key;
exploded_base_key.reserve(base_columns.size());
for (const column_definition& base_col : base_columns) {
const column_definition* view_col = view_schema.view_info()->view_column(base_col);
if (view_col->is_partition_key()) {
exploded_base_key.push_back(index_pk.get_component(view_schema, view_col->id));
} else {
exploded_base_key.push_back(index_ck.get_component(view_schema, view_col->id));
}
}
return KeyType::from_range(exploded_base_key);
}
lw_shared_ptr<query::read_command>
indexed_table_select_statement::prepare_command_for_base_query(const query_options& options, service::query_state& state, gc_clock::time_point now, bool use_paging) {
lw_shared_ptr<query::read_command> cmd = ::make_lw_shared<query::read_command>(
_schema->id(),
_schema->version(),
make_partition_slice(options),
get_limit(options),
now,
tracing::make_trace_info(state.get_trace_state()),
query::max_partitions,
utils::UUID(),
options.get_timestamp(state));
if (use_paging) {
cmd->slice.options.set<query::partition_slice::option::allow_short_read>();
cmd->slice.options.set<query::partition_slice::option::send_partition_key>();
if (_schema->clustering_key_size() > 0) {
cmd->slice.options.set<query::partition_slice::option::send_clustering_key>();
}
}
return cmd;
}
future<shared_ptr<cql_transport::messages::result_message>>
indexed_table_select_statement::execute_base_query(
service::storage_proxy& proxy,
dht::partition_range_vector&& partition_ranges,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state) {
auto cmd = prepare_command_for_base_query(options, state, now, bool(paging_state));
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
dht::partition_range_vector per_vnode_ranges;
per_vnode_ranges.reserve(partition_ranges.size());
for (auto& pr : partition_ranges) {
auto restricted_ranges = proxy.get_restricted_ranges(*_schema, pr);
std::move(restricted_ranges.begin(), restricted_ranges.end(), std::back_inserter(per_vnode_ranges));
}
struct base_query_state {
query::result_merger merger;
dht::partition_range_vector per_vnode_ranges;
dht::partition_range_vector::iterator current_partition_range;
base_query_state(uint32_t row_limit, dht::partition_range_vector&& ranges)
: merger(row_limit * ranges.size(), query::max_partitions)
, per_vnode_ranges(std::move(ranges))
, current_partition_range(per_vnode_ranges.begin())
{}
base_query_state(base_query_state&&) = default;
base_query_state(const base_query_state&) = delete;
};
base_query_state query_state{cmd->row_limit, std::move(per_vnode_ranges)};
return do_with(std::move(query_state), [this, &proxy, &state, &options, cmd, timeout] (auto&& query_state) {
auto &merger = query_state.merger;
auto &ranges = query_state.per_vnode_ranges;
auto &range_it = query_state.current_partition_range;
return repeat([this, &ranges, &range_it, &merger, &proxy, &state, &options, cmd, timeout]() {
// Starting with 1 range, we check if the result was a short read, and if not,
// we continue exponentially, asking for 2x more ranges than before
auto range_it_end = std::min(range_it + std::distance(ranges.begin(), range_it) + 1, ranges.end());
dht::partition_range_vector prange(range_it, range_it_end);
auto command = ::make_lw_shared<query::read_command>(*cmd);
auto old_paging_state = options.get_paging_state();
if (old_paging_state && range_it == ranges.begin()) {
auto base_pk = generate_base_key_from_index_pk<partition_key>(old_paging_state->get_partition_key(),
*old_paging_state->get_clustering_key(), *_schema, *_view_schema);
auto base_ck = generate_base_key_from_index_pk<clustering_key>(old_paging_state->get_partition_key(),
*old_paging_state->get_clustering_key(), *_schema, *_view_schema);
command->slice.set_range(*_schema, base_pk,
std::vector<query::clustering_range>{query::clustering_range::make_starting_with(range_bound<clustering_key>(base_ck, false))});
}
return proxy.query(_schema, command, std::move(prange), options.get_consistency(), {timeout, state.get_trace_state()})
.then([&range_it, range_it_end = std::move(range_it_end), &ranges, &merger] (service::storage_proxy::coordinator_query_result qr) {
bool is_short_read = qr.query_result->is_short_read();
merger(std::move(qr.query_result));
range_it = range_it_end;
return stop_iteration(is_short_read || range_it == ranges.end());
});
}).then([&merger]() {
return merger.get();
});
}).then([this, &proxy, &state, &options, now, cmd, paging_state = std::move(paging_state)] (foreign_ptr<lw_shared_ptr<query::result>> result) mutable {
return this->process_base_query_results(std::move(result), cmd, proxy, state, options, now, std::move(paging_state));
});
}
// Function for fetching the selected columns from a list of clustering rows.
// It is currently used only in our Secondary Index implementation - ordinary
// CQL SELECT statements do not have the syntax to request a list of rows.
// FIXME: The current implementation is very inefficient - it requests each
// row separately (and, incrementally, in parallel). Even multiple rows from a single
// partition are requested separately. This last case can be easily improved,
// but to implement the general case (multiple rows from multiple partitions)
// efficiently, we will need more support from other layers.
// Keys are ordered in token order (see #3423)
future<shared_ptr<cql_transport::messages::result_message>>
indexed_table_select_statement::execute_base_query(
service::storage_proxy& proxy,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state) {
auto cmd = prepare_command_for_base_query(options, state, now, bool(paging_state));
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
struct base_query_state {
query::result_merger merger;
std::vector<primary_key> primary_keys;
std::vector<primary_key>::iterator current_primary_key;
base_query_state(uint32_t row_limit, std::vector<primary_key>&& keys)
: merger(row_limit, query::max_partitions)
, primary_keys(std::move(keys))
, current_primary_key(primary_keys.begin())
{}
base_query_state(base_query_state&&) = default;
base_query_state(const base_query_state&) = delete;
};
base_query_state query_state{cmd->row_limit, std::move(primary_keys)};
return do_with(std::move(query_state), [this, &proxy, &state, &options, cmd, timeout] (auto&& query_state) {
auto &merger = query_state.merger;
auto &keys = query_state.primary_keys;
auto &key_it = query_state.current_primary_key;
return repeat([this, &keys, &key_it, &merger, &proxy, &state, &options, cmd, timeout]() {
// Starting with 1 key, we check if the result was a short read, and if not,
// we continue exponentially, asking for 2x more key than before
auto key_it_end = std::min(key_it + std::distance(keys.begin(), key_it) + 1, keys.end());
auto command = ::make_lw_shared<query::read_command>(*cmd);
query::result_merger oneshot_merger(cmd->row_limit, query::max_partitions);
return map_reduce(key_it, key_it_end, [this, &proxy, &state, &options, cmd, timeout] (auto& key) {
auto command = ::make_lw_shared<query::read_command>(*cmd);
// for each partition, read just one clustering row (TODO: can
// get all needed rows of one partition at once.)
command->slice._row_ranges.clear();
if (key.clustering) {
command->slice._row_ranges.push_back(query::clustering_range::make_singular(key.clustering));
}
return proxy.query(_schema, command, {dht::partition_range::make_singular(key.partition)}, options.get_consistency(), {timeout, state.get_trace_state()})
.then([] (service::storage_proxy::coordinator_query_result qr) {
return std::move(qr.query_result);
});
}, std::move(oneshot_merger)).then([&key_it, key_it_end = std::move(key_it_end), &keys, &merger] (foreign_ptr<lw_shared_ptr<query::result>> result) {
bool is_short_read = result->is_short_read();
merger(std::move(result));
key_it = key_it_end;
return stop_iteration(is_short_read || key_it == keys.end());
});
}).then([&merger] () {
return merger.get();
});
}).then([this, &proxy, &state, &options, now, cmd, paging_state = std::move(paging_state)] (foreign_ptr<lw_shared_ptr<query::result>> result) mutable {
return this->process_base_query_results(std::move(result), cmd, proxy, state, options, now, std::move(paging_state));
});
}
future<shared_ptr<cql_transport::messages::result_message>>
select_statement::execute(service::storage_proxy& proxy,
lw_shared_ptr<query::read_command> cmd,
@@ -503,52 +697,21 @@ select_statement::execute(service::storage_proxy& proxy,
}
}
// Function for fetching the selected columns from a list of clustering rows.
// It is currently used only in our Secondary Index implementation - ordinary
// CQL SELECT statements do not have the syntax to request a list of rows.
// FIXME: The current implementation is very inefficient - it requests each
// row separately (and all in parallel). Even multiple rows from a single
// partition are requested separately. This last case can be easily improved,
// but to implement the general case (multiple rows from multiple partitions)
// efficiently, we will need more support from other layers.
// Keys are ordered in token order (see #3423)
future<shared_ptr<cql_transport::messages::result_message>>
select_statement::execute(service::storage_proxy& proxy,
lw_shared_ptr<query::read_command> cmd,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
const query_options& options,
gc_clock::time_point now)
shared_ptr<cql_transport::messages::result_message>
indexed_table_select_statement::process_base_query_results(
foreign_ptr<lw_shared_ptr<query::result>> results,
lw_shared_ptr<query::read_command> cmd,
service::storage_proxy& proxy,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state)
{
// FIXME: pass the timeout from caller. The query has already started
// earlier (with read_posting_list()), not now.
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
return do_with(std::move(primary_keys), [this, &proxy, &state, &options, cmd, timeout] (auto& keys) {
assert(cmd->partition_limit == query::max_partitions);
query::result_merger merger(cmd->row_limit, query::max_partitions);
// there is no point to produce rows beyond the first row_limit:
auto end = keys.size() <= cmd->row_limit ? keys.end() : keys.begin() + cmd->row_limit;
return map_reduce(keys.begin(), end, [this, &proxy, &state, &options, cmd, timeout] (auto& key) {
auto command = ::make_lw_shared<query::read_command>(*cmd);
// for each partition, read just one clustering row (TODO: can
// get all needed rows of one partition at once.)
command->slice._row_ranges.clear();
if (key.clustering) {
command->slice._row_ranges.push_back(query::clustering_range::make_singular(key.clustering));
}
return proxy.query(_schema,
command,
{dht::partition_range::make_singular(key.partition)},
options.get_consistency(),
{timeout, state.get_trace_state()}).then([] (service::storage_proxy::coordinator_query_result qr) {
return std::move(qr.query_result);
});
}, std::move(merger));
}).then([this, &options, now, cmd] (auto result) {
// note that cmd here still has the garbage clustering range in slice,
// but process_results() ignores this part of the slice setting.
return this->process_results(std::move(result), cmd, options, now);
});
if (paging_state) {
paging_state = generate_view_paging_state_from_base_query_results(paging_state, results, proxy, state, options);
_selection->get_result_metadata()->maybe_set_paging_state(std::move(paging_state));
}
return process_results(std::move(results), std::move(cmd), options, now);
}
shared_ptr<cql_transport::messages::result_message>
@@ -557,7 +720,8 @@ select_statement::process_results(foreign_ptr<lw_shared_ptr<query::result>> resu
const query_options& options,
gc_clock::time_point now)
{
bool fast_path = !needs_post_query_ordering() && _selection->is_trivial() && !_restrictions->need_filtering();
const bool restrictions_need_filtering = _restrictions->need_filtering();
const bool fast_path = !needs_post_query_ordering() && _selection->is_trivial() && !restrictions_need_filtering;
if (fast_path) {
return make_shared<cql_transport::messages::result_message::rows>(result(
result_generator(_schema, std::move(results), std::move(cmd), _selection, _stats),
@@ -567,12 +731,12 @@ select_statement::process_results(foreign_ptr<lw_shared_ptr<query::result>> resu
cql3::selection::result_set_builder builder(*_selection, now,
options.get_cql_serialization_format());
if (_restrictions->need_filtering()) {
if (restrictions_need_filtering) {
results->ensure_counts();
_stats.filtered_rows_read_total += *results->row_count();
query::result_view::consume(*results, cmd->slice,
cql3::selection::result_set_builder::visitor(builder, *_schema,
*_selection, cql3::selection::result_set_builder::restrictions_filter(_restrictions)));
*_selection, cql3::selection::result_set_builder::restrictions_filter(_restrictions, options, cmd->row_limit)));
} else {
query::result_view::consume(*results, cmd->slice,
cql3::selection::result_set_builder::visitor(builder, *_schema,
@@ -588,7 +752,7 @@ select_statement::process_results(foreign_ptr<lw_shared_ptr<query::result>> resu
rs->trim(cmd->row_limit);
}
update_stats_rows_read(rs->size());
_stats.filtered_rows_matched_total += _restrictions->need_filtering() ? rs->size() : 0;
_stats.filtered_rows_matched_total += restrictions_need_filtering ? rs->size() : 0;
return ::make_shared<cql_transport::messages::result_message::rows>(result(std::move(rs)));
}
@@ -617,10 +781,16 @@ indexed_table_select_statement::prepare(database& db,
ordering_comparator_type ordering_comparator,
::shared_ptr<term> limit, cql_stats &stats)
{
auto index_opt = find_idx(db, schema, restrictions);
auto& sim = db.find_column_family(schema).get_index_manager();
auto index_opt = restrictions->find_idx(sim);
if (!index_opt) {
throw std::runtime_error("No index found.");
}
const auto& im = index_opt->metadata();
sstring index_table_name = im.name() + "_index";
schema_ptr view_schema = db.find_schema(schema->ks_name(), index_table_name);
return ::make_shared<cql3::statements::indexed_table_select_statement>(
schema,
bound_terms,
@@ -631,28 +801,11 @@ indexed_table_select_statement::prepare(database& db,
std::move(ordering_comparator),
limit,
stats,
*index_opt);
*index_opt,
view_schema);
}
stdx::optional<secondary_index::index> indexed_table_select_statement::find_idx(database& db,
schema_ptr schema,
::shared_ptr<restrictions::statement_restrictions> restrictions)
{
auto& sim = db.find_column_family(schema).get_index_manager();
for (::shared_ptr<cql3::restrictions::restrictions> restriction : restrictions->index_restrictions()) {
for (const auto& cdef : restriction->get_column_defs()) {
for (auto index : sim.list_indexes()) {
if (index.depends_on(*cdef)) {
return stdx::make_optional<secondary_index::index>(std::move(index));
}
}
}
}
return stdx::nullopt;
}
indexed_table_select_statement::indexed_table_select_statement(schema_ptr schema, uint32_t bound_terms,
::shared_ptr<parameters> parameters,
::shared_ptr<selection::selection> selection,
@@ -660,16 +813,74 @@ indexed_table_select_statement::indexed_table_select_statement(schema_ptr schema
bool is_reversed,
ordering_comparator_type ordering_comparator,
::shared_ptr<term> limit, cql_stats &stats,
const secondary_index::index& index)
const secondary_index::index& index,
schema_ptr view_schema)
: select_statement{schema, bound_terms, parameters, selection, restrictions, is_reversed, ordering_comparator, limit, stats}
, _index{index}
, _view_schema(view_schema)
{}
template<typename KeyType>
GCC6_CONCEPT(
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key_prefix>)
)
static void append_base_key_to_index_ck(std::vector<bytes_view>& exploded_index_ck, const KeyType& base_key, const column_definition& index_cdef) {
auto key_view = base_key.view();
auto begin = key_view.begin();
if ((std::is_same_v<KeyType, partition_key> && index_cdef.is_partition_key())
|| (std::is_same_v<KeyType, clustering_key_prefix> && index_cdef.is_clustering_key())) {
auto key_position = std::next(begin, index_cdef.id);
std::move(begin, key_position, std::back_inserter(exploded_index_ck));
begin = std::next(key_position);
}
std::move(begin, key_view.end(), std::back_inserter(exploded_index_ck));
}
::shared_ptr<const service::pager::paging_state> indexed_table_select_statement::generate_view_paging_state_from_base_query_results(::shared_ptr<const service::pager::paging_state> paging_state,
const foreign_ptr<lw_shared_ptr<query::result>>& results, service::storage_proxy& proxy, service::query_state& state, const query_options& options) const {
const column_definition* cdef = _schema->get_column_definition(to_bytes(_index.target_column()));
if (!cdef) {
throw exceptions::invalid_request_exception("Indexed column not found in schema");
}
//NOTICE(sarna): Executing indexed_table branch implies there was at least 1 index restriction present
bytes_opt index_pk_value = _restrictions->index_restrictions().front()->value_for(*cdef, options);
auto index_pk = partition_key::from_single_value(*_view_schema, *index_pk_value);
auto result_view = query::result_view(*results);
if (!results->row_count() || *results->row_count() == 0) {
return std::move(paging_state);
}
auto [last_base_pk, last_base_ck] = result_view.get_last_partition_and_clustering_key();
std::vector<bytes_view> exploded_index_ck;
exploded_index_ck.reserve(_view_schema->clustering_key_size());
dht::i_partitioner& partitioner = dht::global_partitioner();
bytes token_bytes = partitioner.token_to_bytes(partitioner.get_token(*_schema, last_base_pk));
exploded_index_ck.push_back(bytes_view(token_bytes));
append_base_key_to_index_ck<partition_key>(exploded_index_ck, last_base_pk, *cdef);
if (last_base_ck) {
append_base_key_to_index_ck<clustering_key>(exploded_index_ck, *last_base_ck, *cdef);
}
auto index_ck = clustering_key::from_range(std::move(exploded_index_ck));
if (partition_key::tri_compare(*_view_schema)(paging_state->get_partition_key(), index_pk) == 0
&& (!paging_state->get_clustering_key() || clustering_key::prefix_equal_tri_compare(*_view_schema)(*paging_state->get_clustering_key(), index_ck) == 0)) {
return std::move(paging_state);
}
auto paging_state_copy = ::make_shared<service::pager::paging_state>(service::pager::paging_state(*paging_state));
paging_state_copy->set_partition_key(std::move(index_pk));
paging_state_copy->set_clustering_key(std::move(index_ck));
return std::move(paging_state_copy);
}
future<shared_ptr<cql_transport::messages::result_message>>
indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
service::query_state& state,
const query_options& options)
{
tracing::add_table_name(state.get_trace_state(), _view_schema->ks_name(), _view_schema->cf_name());
tracing::add_table_name(state.get_trace_state(), keyspace(), column_family());
auto cl = options.get_consistency();
@@ -684,6 +895,8 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
assert(_restrictions->uses_secondary_indexing());
_stats.unpaged_select_queries += options.get_page_size() <= 0;
// Secondary index search has two steps: 1. use the index table to find a
// list of primary keys matching the query. 2. read the rows matching
// these primary keys from the base table and return the selected columns.
@@ -719,120 +932,142 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
if (whole_partitions || partition_slices) {
// In this case, can use our normal query machinery, which retrieves
// entire partitions or the same slice for many partitions.
return find_index_partition_ranges(proxy, state, options).then([limit, now, &state, &options, &proxy, this] (dht::partition_range_vector partition_ranges) {
auto command = ::make_lw_shared<query::read_command>(
_schema->id(),
_schema->version(),
make_partition_slice(options),
limit,
now,
tracing::make_trace_info(state.get_trace_state()),
query::max_partitions,
utils::UUID(),
options.get_timestamp(state));
return this->execute(proxy, command, std::move(partition_ranges), state, options, now);
return find_index_partition_ranges(proxy, state, options).then([limit, now, &state, &options, &proxy, this] (dht::partition_range_vector partition_ranges, ::shared_ptr<const service::pager::paging_state> paging_state) {
return this->execute_base_query(proxy, std::move(partition_ranges), state, options, now, std::move(paging_state));
});
} else {
// In this case, we need to retrieve a list of rows (not entire
// partitions) and then retrieve those specific rows.
return find_index_clustering_rows(proxy, state, options).then([limit, now, &state, &options, &proxy, this] (std::vector<primary_key> primary_keys) {
auto command = ::make_lw_shared<query::read_command>(
_schema->id(),
_schema->version(),
// Note: the "clustering bounds" set in make_partition_slice()
// here is garbage, and will be overridden by execute() anyway
make_partition_slice(options),
limit,
now,
tracing::make_trace_info(state.get_trace_state()),
query::max_partitions,
utils::UUID(),
options.get_timestamp(state));
return this->execute(proxy, command, std::move(primary_keys), state, options, now);
return find_index_clustering_rows(proxy, state, options).then([limit, now, &state, &options, &proxy, this] (std::vector<primary_key> primary_keys, ::shared_ptr<const service::pager::paging_state> paging_state) {
return this->execute_base_query(proxy, std::move(primary_keys), state, options, now, std::move(paging_state));
});
}
}
// Utility function for getting the schema of the materialized view used for
// the secondary index implementation.
static schema_ptr
get_index_schema(service::storage_proxy& proxy,
const secondary_index::index& index,
const schema_ptr& schema,
tracing::trace_state_ptr& trace_state)
{
const auto& im = index.metadata();
sstring index_table_name = im.name() + "_index";
tracing::add_table_name(trace_state, schema->ks_name(), index_table_name);
return proxy.get_db().local().find_schema(schema->ks_name(), index_table_name);
}
// Utility function for reading from the index view (get_index_view()))
// the posting-list for a particular value of the indexed column.
// Remember a secondary index can only be created on a single column.
static future<service::storage_proxy::coordinator_query_result>
template<typename KeyType>
GCC6_CONCEPT(
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key>)
)
static future<::shared_ptr<cql_transport::messages::result_message::rows>>
read_posting_list(service::storage_proxy& proxy,
schema_ptr view_schema,
const std::vector<::shared_ptr<restrictions::restrictions>>& index_restrictions,
schema_ptr base_schema,
const secondary_index::index& index,
::shared_ptr<restrictions::statement_restrictions> base_restrictions,
const query_options& options,
int32_t limit,
service::query_state& state,
gc_clock::time_point now,
db::timeout_clock::time_point timeout)
db::timeout_clock::time_point timeout,
cql3::cql_stats& stats)
{
dht::partition_range_vector partition_ranges;
// FIXME: there should be only one index restriction for this index!
// Perhaps even one index restriction entirely (do we support
// intersection queries?).
for (const auto& restriction : index_restrictions) {
auto pk = partition_key::from_optional_exploded(*view_schema, restriction->values(options));
auto dk = dht::global_partitioner().decorate_key(*view_schema, pk);
auto range = dht::partition_range::make_singular(dk);
partition_ranges.emplace_back(range);
for (const auto& restrictions : base_restrictions->index_restrictions()) {
const column_definition* cdef = base_schema->get_column_definition(to_bytes(index.target_column()));
if (!cdef) {
throw exceptions::invalid_request_exception("Indexed column not found in schema");
}
bytes_opt value = restrictions->value_for(*cdef, options);
if (value) {
auto pk = partition_key::from_single_value(*view_schema, *value);
auto dk = dht::global_partitioner().decorate_key(*view_schema, pk);
auto range = dht::partition_range::make_singular(dk);
partition_ranges.emplace_back(range);
}
}
partition_slice_builder partition_slice_builder{*view_schema};
if (!base_restrictions->has_partition_key_unrestricted_components()) {
auto single_pk_restrictions = dynamic_pointer_cast<restrictions::single_column_primary_key_restrictions<partition_key>>(base_restrictions->get_partition_key_restrictions());
// Only EQ restrictions on base partition key can be used in an index view query
if (single_pk_restrictions && single_pk_restrictions->is_all_eq()) {
auto clustering_restrictions = ::make_shared<restrictions::single_column_primary_key_restrictions<clustering_key_prefix>>(view_schema, *single_pk_restrictions);
// Computed token column needs to be added to index view restrictions
const column_definition& token_cdef = *view_schema->clustering_key_columns().begin();
auto base_pk = partition_key::from_optional_exploded(*base_schema, base_restrictions->get_partition_key_restrictions()->values(options));
bytes token_value = dht::global_partitioner().token_to_bytes(dht::global_partitioner().get_token(*base_schema, base_pk));
auto token_restriction = ::make_shared<restrictions::single_column_restriction::EQ>(token_cdef, ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(token_value)));
clustering_restrictions->merge_with(token_restriction);
if (base_restrictions->get_clustering_columns_restrictions()->prefix_size() > 0) {
auto single_ck_restrictions = dynamic_pointer_cast<restrictions::single_column_primary_key_restrictions<clustering_key>>(base_restrictions->get_clustering_columns_restrictions());
if (single_ck_restrictions) {
auto prefix_restrictions = single_ck_restrictions->get_longest_prefix_restrictions();
auto clustering_restrictions_from_base = ::make_shared<restrictions::single_column_primary_key_restrictions<clustering_key_prefix>>(view_schema, *prefix_restrictions);
for (auto restriction_it : clustering_restrictions_from_base->restrictions()) {
clustering_restrictions->merge_with(restriction_it.second);
}
}
}
partition_slice_builder.with_ranges(clustering_restrictions->bounds_ranges(options));
}
}
auto partition_slice = partition_slice_builder.build();
auto cmd = ::make_lw_shared<query::read_command>(
view_schema->id(),
view_schema->version(),
partition_slice_builder.build(),
partition_slice,
limit,
now,
tracing::make_trace_info(state.get_trace_state()),
query::max_partitions,
utils::UUID(),
options.get_timestamp(state));
return proxy.query(view_schema,
cmd,
std::move(partition_ranges),
options.get_consistency(),
{timeout, state.get_trace_state()});
std::vector<const column_definition*> columns;
for (const column_definition& cdef : base_schema->partition_key_columns()) {
columns.emplace_back(view_schema->get_column_definition(cdef.name()));
}
if constexpr (std::is_same_v<KeyType, clustering_key>) {
for (const column_definition& cdef : base_schema->clustering_key_columns()) {
columns.emplace_back(view_schema->get_column_definition(cdef.name()));
}
}
auto selection = selection::selection::for_columns(view_schema, columns);
int32_t page_size = options.get_page_size();
if (page_size <= 0 || !service::pager::query_pagers::may_need_paging(*view_schema, page_size, *cmd, partition_ranges)) {
stats.unpaged_select_queries += 1;
return proxy.query(view_schema, cmd, std::move(partition_ranges), options.get_consistency(), {timeout, state.get_trace_state()})
.then([base_schema, view_schema, now, &options, selection = std::move(selection), partition_slice = std::move(partition_slice)] (service::storage_proxy::coordinator_query_result qr) {
cql3::selection::result_set_builder builder(*selection, now, options.get_cql_serialization_format());
query::result_view::consume(*qr.query_result,
std::move(partition_slice),
cql3::selection::result_set_builder::visitor(builder, *view_schema, *selection));
return ::make_shared<cql_transport::messages::result_message::rows>(std::move(result(builder.build())));
});
}
auto p = service::pager::query_pagers::pager(view_schema, selection,
state, options, cmd, std::move(partition_ranges), stats, nullptr);
return p->fetch_page(options.get_page_size(), now, timeout).then([p, &options, limit, now] (std::unique_ptr<cql3::result_set> rs) {
rs->get_metadata().set_paging_state(p->state());
return ::make_shared<cql_transport::messages::result_message::rows>(result(std::move(rs)));
});
}
// Note: the partitions keys returned by this function are sorted
// in token order. See issue #3423.
future<dht::partition_range_vector>
future<dht::partition_range_vector, ::shared_ptr<const service::pager::paging_state>>
indexed_table_select_statement::find_index_partition_ranges(service::storage_proxy& proxy,
service::query_state& state,
const query_options& options)
{
schema_ptr view = get_index_schema(proxy, _index, _schema, state.get_trace_state());
auto now = gc_clock::now();
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
return read_posting_list(proxy, view, _restrictions->index_restrictions(), options, get_limit(options), state, now, timeout).then(
[this, now, &options, view] (service::storage_proxy::coordinator_query_result qr) {
std::vector<const column_definition*> columns;
for (const column_definition& cdef : _schema->partition_key_columns()) {
columns.emplace_back(view->get_column_definition(cdef.name()));
}
auto selection = selection::selection::for_columns(view, columns);
cql3::selection::result_set_builder builder(*selection, now, options.get_cql_serialization_format());
// FIXME: read_posting_list already asks to read primary keys only.
// why do we need to specify this again?
auto slice = partition_slice_builder(*view).build();
query::result_view::consume(*qr.query_result,
slice,
cql3::selection::result_set_builder::visitor(builder, *view, *selection));
auto rs = cql3::untyped_result_set(::make_shared<cql_transport::messages::result_message::rows>(std::move(result(builder.build()))));
return read_posting_list<partition_key>(proxy, _view_schema, _schema, _index, _restrictions, options, get_limit(options), state, now, timeout, _stats).then(
[this, now, &options] (::shared_ptr<cql_transport::messages::result_message::rows> rows) {
auto rs = cql3::untyped_result_set(rows);
dht::partition_range_vector partition_ranges;
partition_ranges.reserve(rs.size());
// We are reading the list of primary keys as rows of a single
@@ -858,36 +1093,22 @@ indexed_table_select_statement::find_index_partition_ranges(service::storage_pro
auto range = dht::partition_range::make_singular(dk);
partition_ranges.emplace_back(range);
}
return partition_ranges;
auto paging_state = rows->rs().get_metadata().paging_state();
return make_ready_future<dht::partition_range_vector, ::shared_ptr<const service::pager::paging_state>>(std::move(partition_ranges), std::move(paging_state));
});
}
// Note: the partitions keys returned by this function are sorted
// in token order. See issue #3423.
future<std::vector<indexed_table_select_statement::primary_key>>
future<std::vector<indexed_table_select_statement::primary_key>, ::shared_ptr<const service::pager::paging_state>>
indexed_table_select_statement::find_index_clustering_rows(service::storage_proxy& proxy, service::query_state& state, const query_options& options)
{
schema_ptr view = get_index_schema(proxy, _index, _schema, state.get_trace_state());
auto now = gc_clock::now();
auto timeout = db::timeout_clock::now() + options.get_timeout_config().*get_timeout_config_selector();
return read_posting_list(proxy, view, _restrictions->index_restrictions(), options, get_limit(options), state, now, timeout).then(
[this, now, &options, view] (service::storage_proxy::coordinator_query_result qr) {
std::vector<const column_definition*> columns;
for (const column_definition& cdef : _schema->partition_key_columns()) {
columns.emplace_back(view->get_column_definition(cdef.name()));
}
for (const column_definition& cdef : _schema->clustering_key_columns()) {
columns.emplace_back(view->get_column_definition(cdef.name()));
}
auto selection = selection::selection::for_columns(view, columns);
cql3::selection::result_set_builder builder(*selection, now, options.get_cql_serialization_format());
// FIXME: read_posting_list already asks to read primary keys only.
// why do we need to specify this again?
auto slice = partition_slice_builder(*view).build();
query::result_view::consume(*qr.query_result,
slice,
cql3::selection::result_set_builder::visitor(builder, *view, *selection));
auto rs = cql3::untyped_result_set(::make_shared<cql_transport::messages::result_message::rows>(result(builder.build())));
return read_posting_list<clustering_key>(proxy, _view_schema, _schema, _index, _restrictions, options, get_limit(options), state, now, timeout, _stats).then(
[this, now, &options] (::shared_ptr<cql_transport::messages::result_message::rows> rows) {
auto rs = cql3::untyped_result_set(rows);
std::vector<primary_key> primary_keys;
primary_keys.reserve(rs.size());
for (size_t i = 0; i < rs.size(); i++) {
@@ -903,7 +1124,8 @@ indexed_table_select_statement::find_index_clustering_rows(service::storage_prox
auto ck = clustering_key::from_range(ck_columns);
primary_keys.emplace_back(primary_key{std::move(dk), std::move(ck)});
}
return primary_keys;
auto paging_state = rows->rs().get_metadata().paging_state();
return make_ready_future<std::vector<indexed_table_select_statement::primary_key>, ::shared_ptr<const service::pager::paging_state>>(std::move(primary_keys), std::move(paging_state));
});
}
@@ -986,6 +1208,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(database& db, cql_
}
check_needs_filtering(restrictions);
ensure_filtering_columns_retrieval(db, selection, restrictions);
::shared_ptr<cql3::statements::select_statement> stmt;
if (restrictions->uses_secondary_indexing()) {
@@ -1124,7 +1347,7 @@ select_statement::get_ordering_comparator(schema_ptr schema,
}
auto index = selection->index_of(*def);
if (index < 0) {
index = selection->add_column_for_ordering(*def);
index = selection->add_column_for_post_processing(*def);
}
sorters.emplace_back(index, def->type);
@@ -1211,6 +1434,23 @@ void select_statement::check_needs_filtering(::shared_ptr<restrictions::statemen
}
}
/**
* Adds columns that are needed for the purpose of filtering to the selection.
* The columns that are added to the selection are columns that
* are needed for filtering on the coordinator but are not part of the selection.
* The columns are added with a meta-data indicating they are not to be returned
* to the user.
*/
void select_statement::ensure_filtering_columns_retrieval(database& db,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions) {
for (auto&& cdef : restrictions->get_column_defs_for_filtering(db)) {
if (!selection->has_column(*cdef)) {
selection->add_column_for_post_processing(*cdef);
}
}
}
bool select_statement::contains_alias(::shared_ptr<column_identifier> name) {
return std::any_of(_select_clause.begin(), _select_clause.end(), [name] (auto raw) {
return raw->alias && *name == *raw->alias;

View File

@@ -126,14 +126,6 @@ public:
clustering_key_prefix clustering;
};
future<::shared_ptr<cql_transport::messages::result_message>> execute(
service::storage_proxy& proxy,
lw_shared_ptr<query::read_command> cmd,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
const query_options& options,
gc_clock::time_point now);
shared_ptr<cql_transport::messages::result_message> process_results(foreign_ptr<lw_shared_ptr<query::result>> results,
lw_shared_ptr<query::read_command> cmd, const query_options& options, gc_clock::time_point now);
@@ -168,6 +160,7 @@ public:
class indexed_table_select_statement : public select_statement {
secondary_index::index _index;
schema_ptr _view_schema;
public:
static ::shared_ptr<cql3::statements::select_statement> prepare(database& db,
schema_ptr schema,
@@ -189,24 +182,55 @@ public:
ordering_comparator_type ordering_comparator,
::shared_ptr<term> limit,
cql_stats &stats,
const secondary_index::index& index);
const secondary_index::index& index,
schema_ptr view_schema);
private:
static stdx::optional<secondary_index::index> find_idx(database& db,
schema_ptr schema,
::shared_ptr<restrictions::statement_restrictions> restrictions);
virtual future<::shared_ptr<cql_transport::messages::result_message>> do_execute(service::storage_proxy& proxy,
service::query_state& state, const query_options& options) override;
future<dht::partition_range_vector> find_index_partition_ranges(service::storage_proxy& proxy,
::shared_ptr<const service::pager::paging_state> generate_view_paging_state_from_base_query_results(::shared_ptr<const service::pager::paging_state> paging_state,
const foreign_ptr<lw_shared_ptr<query::result>>& results, service::storage_proxy& proxy, service::query_state& state, const query_options& options) const;
future<dht::partition_range_vector, ::shared_ptr<const service::pager::paging_state>> find_index_partition_ranges(service::storage_proxy& proxy,
service::query_state& state,
const query_options& options);
future<std::vector<primary_key>> find_index_clustering_rows(service::storage_proxy& proxy,
future<std::vector<primary_key>, ::shared_ptr<const service::pager::paging_state>> find_index_clustering_rows(service::storage_proxy& proxy,
service::query_state& state,
const query_options& options);
shared_ptr<cql_transport::messages::result_message>
process_base_query_results(
foreign_ptr<lw_shared_ptr<query::result>> results,
lw_shared_ptr<query::read_command> cmd,
service::storage_proxy& proxy,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state);
lw_shared_ptr<query::read_command>
prepare_command_for_base_query(const query_options& options, service::query_state& state, gc_clock::time_point now, bool use_paging);
future<shared_ptr<cql_transport::messages::result_message>>
execute_base_query(
service::storage_proxy& proxy,
dht::partition_range_vector&& partition_ranges,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state);
future<shared_ptr<cql_transport::messages::result_message>>
execute_base_query(
service::storage_proxy& proxy,
std::vector<primary_key>&& primary_keys,
service::query_state& state,
const query_options& options,
gc_clock::time_point now,
::shared_ptr<const service::pager::paging_state> paging_state);
virtual void update_stats_rows_read(int64_t rows_read) override {
_stats.rows_read += rows_read;
_stats.secondary_index_rows_read += rows_read;

View File

@@ -84,8 +84,11 @@ parse(const sstring& json_string, const std::vector<column_definition>& expected
for (const auto& def : expected_receivers) {
sstring cql_name = def.name_as_text();
auto value_it = prepared_map.find(cql_name);
if (value_it == prepared_map.end() || value_it->second.isNull()) {
if (value_it == prepared_map.end()) {
continue;
} else if (value_it->second.isNull()) {
json_map.emplace(std::move(cql_name), bytes_opt{});
prepared_map.erase(value_it);
} else {
json_map.emplace(std::move(cql_name), def.type->from_json_object(value_it->second, sf));
prepared_map.erase(value_it);
@@ -172,29 +175,45 @@ void update_statement::add_update_for_key(mutation& m, const query::clustering_r
}
modification_statement::json_cache_opt insert_prepared_json_statement::maybe_prepare_json_cache(const query_options& options) {
sstring json_string = utf8_type->to_string(_term->bind_and_get(options).data().value().to_string());
sstring json_string = with_linearized(_term->bind_and_get(options).data().value(), [&] (bytes_view value) {
return utf8_type->to_string(value.to_string());
});
return json_helpers::parse(std::move(json_string), s->all_columns(), options.get_cql_serialization_format());
}
void
insert_prepared_json_statement::execute_set_value(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, const bytes_opt& value) {
if (!value) {
if (column.type->is_collection()) {
auto& k = static_pointer_cast<const collection_type_impl>(column.type)->_kind;
if (&k == &collection_type_impl::kind::list) {
lists::setter::execute(m, prefix, params, column, make_shared<lists::value>(lists::value(std::vector<bytes_opt>())));
} else if (&k == &collection_type_impl::kind::set) {
sets::setter::execute(m, prefix, params, column, make_shared<sets::value>(sets::value(std::set<bytes, serialized_compare>(serialized_compare(empty_type)))));
} else if (&k == &collection_type_impl::kind::map) {
maps::setter::execute(m, prefix, params, column, make_shared<maps::value>(maps::value(std::map<bytes, bytes, serialized_compare>(serialized_compare(empty_type)))));
} else {
throw exceptions::invalid_request_exception("Incorrect value kind in JSON INSERT statement");
}
return;
}
m.set_cell(prefix, column, std::move(operation::make_dead_cell(params)));
return;
} else if (!column.type->is_collection()) {
constants::setter::execute(m, prefix, params, column, raw_value_view::make_value(bytes_view(*value)));
constants::setter::execute(m, prefix, params, column, raw_value_view::make_value(fragmented_temporary_buffer::view(*value)));
return;
}
auto& k = static_pointer_cast<const collection_type_impl>(column.type)->_kind;
cql_serialization_format sf = params._options.get_cql_serialization_format();
if (&k == &collection_type_impl::kind::list) {
auto list_terminal = make_shared<lists::value>(lists::value::from_serialized(*value, dynamic_pointer_cast<const list_type_impl>(column.type), sf));
auto list_terminal = make_shared<lists::value>(lists::value::from_serialized(fragmented_temporary_buffer::view(*value), dynamic_pointer_cast<const list_type_impl>(column.type), sf));
lists::setter::execute(m, prefix, params, column, std::move(list_terminal));
} else if (&k == &collection_type_impl::kind::set) {
auto set_terminal = make_shared<sets::value>(sets::value::from_serialized(*value, dynamic_pointer_cast<const set_type_impl>(column.type), sf));
auto set_terminal = make_shared<sets::value>(sets::value::from_serialized(fragmented_temporary_buffer::view(*value), dynamic_pointer_cast<const set_type_impl>(column.type), sf));
sets::setter::execute(m, prefix, params, column, std::move(set_terminal));
} else if (&k == &collection_type_impl::kind::map) {
auto map_terminal = make_shared<maps::value>(maps::value::from_serialized(*value, dynamic_pointer_cast<const map_type_impl>(column.type), sf));
auto map_terminal = make_shared<maps::value>(maps::value::from_serialized(fragmented_temporary_buffer::view(*value), dynamic_pointer_cast<const map_type_impl>(column.type), sf));
maps::setter::execute(m, prefix, params, column, std::move(map_terminal));
} else {
throw exceptions::invalid_request_exception("Incorrect value kind in JSON INSERT statement");
@@ -204,15 +223,17 @@ insert_prepared_json_statement::execute_set_value(mutation& m, const clustering_
dht::partition_range_vector
insert_prepared_json_statement::build_partition_keys(const query_options& options, const json_cache_opt& json_cache) {
dht::partition_range_vector ranges;
std::vector<bytes_opt> exploded;
for (const auto& def : s->partition_key_columns()) {
auto json_value = json_cache->at(def.name_as_text());
auto k = query::range<partition_key>::make_singular(partition_key::from_single_value(*s, json_value.value()));
ranges.emplace_back(std::move(k).transform(
[this] (partition_key&& k) -> query::ring_position {
auto token = dht::global_partitioner().get_token(*s, k);
return { std::move(token), std::move(k) };
}));
if (!json_value) {
throw exceptions::invalid_request_exception(sprint("Missing mandatory PRIMARY KEY part %s", def.name_as_text()));
}
exploded.emplace_back(*json_value);
}
auto pkey = partition_key::from_optional_exploded(*s, std::move(exploded));
auto k = query::range<query::ring_position>::make_singular(dht::global_partitioner().decorate_key(*s, std::move(pkey)));
ranges.emplace_back(std::move(k));
return ranges;
}
@@ -221,7 +242,10 @@ query::clustering_row_ranges insert_prepared_json_statement::create_clustering_r
std::vector<bytes_opt> exploded;
for (const auto& def : s->clustering_key_columns()) {
auto json_value = json_cache->at(def.name_as_text());
exploded.emplace_back(json_value.value());
if (!json_value) {
throw exceptions::invalid_request_exception(sprint("Missing mandatory PRIMARY KEY part %s", def.name_as_text()));
}
exploded.emplace_back(*json_value);
}
auto k = query::range<clustering_key_prefix>::make_singular(clustering_key_prefix::from_optional_exploded(*s, std::move(exploded)));
ranges.emplace_back(query::clustering_range(std::move(k)));
@@ -234,8 +258,12 @@ void insert_prepared_json_statement::execute_operations_for_key(mutation& m, con
throw exceptions::invalid_request_exception(sprint("Cannot set the value of counter column %s in JSON", def.name_as_text()));
}
auto value = json_cache->at(def.name_as_text());
execute_set_value(m, prefix, params, def, value);
auto it = json_cache->find(def.name_as_text());
if (it != json_cache->end()) {
execute_set_value(m, prefix, params, def, it->second);
} else if (!_default_unset) {
execute_set_value(m, prefix, params, def, bytes_opt{});
}
}
}
@@ -301,12 +329,14 @@ insert_statement::prepare_internal(database& db, schema_ptr schema,
insert_json_statement::insert_json_statement( ::shared_ptr<cf_name> name,
::shared_ptr<attributes::raw> attrs,
::shared_ptr<term::raw> json_value,
bool if_not_exists)
bool if_not_exists,
bool default_unset)
: raw::modification_statement{name, attrs, conditions_vector{}, if_not_exists, false}
, _name(name)
, _attrs(attrs)
, _json_value(json_value)
, _if_not_exists(if_not_exists) { }
, _if_not_exists(if_not_exists)
, _default_unset(default_unset) { }
::shared_ptr<cql3::statements::modification_statement>
insert_json_statement::prepare_internal(database& db, schema_ptr schema,
@@ -316,7 +346,7 @@ insert_json_statement::prepare_internal(database& db, schema_ptr schema,
auto json_column_placeholder = ::make_shared<column_identifier>("", true);
auto prepared_json_value = _json_value->prepare(db, "", ::make_shared<column_specification>("", "", json_column_placeholder, utf8_type));
prepared_json_value->collect_marker_specification(bound_names);
return ::make_shared<cql3::statements::insert_prepared_json_statement>(bound_names->size(), schema, std::move(attrs), &stats.inserts, std::move(prepared_json_value));
return ::make_shared<cql3::statements::insert_prepared_json_statement>(bound_names->size(), schema, std::move(attrs), &stats.inserts, std::move(prepared_json_value), _default_unset);
}
update_statement::update_statement( ::shared_ptr<cf_name> name,

View File

@@ -82,9 +82,10 @@ private:
*/
class insert_prepared_json_statement : public update_statement {
::shared_ptr<term> _term;
bool _default_unset;
public:
insert_prepared_json_statement(uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, uint64_t* cql_stats_counter_ptr, ::shared_ptr<term> t)
: update_statement(statement_type::INSERT, bound_terms, s, std::move(attrs), cql_stats_counter_ptr), _term(t) {
insert_prepared_json_statement(uint32_t bound_terms, schema_ptr s, std::unique_ptr<attributes> attrs, uint64_t* cql_stats_counter_ptr, ::shared_ptr<term> t, bool default_unset)
: update_statement(statement_type::INSERT, bound_terms, s, std::move(attrs), cql_stats_counter_ptr), _term(t), _default_unset(default_unset) {
_restrictions = ::make_shared<restrictions::statement_restrictions>(s, false);
}
private:

View File

@@ -36,6 +36,8 @@ struct cql_stats {
uint64_t batches_pure_unlogged = 0;
uint64_t batches_unlogged_from_logged = 0;
uint64_t rows_read = 0;
uint64_t reverse_queries = 0;
uint64_t unpaged_select_queries = 0;
int64_t secondary_index_creates = 0;
int64_t secondary_index_drops = 0;

View File

@@ -159,8 +159,10 @@ public:
_elements.push_back(e ? bytes_opt(bytes(e->begin(), e->size())) : bytes_opt());
}
}
static value from_serialized(bytes_view buffer, tuple_type type) {
return value(type->split(buffer));
static value from_serialized(const fragmented_temporary_buffer::view& buffer, tuple_type type) {
return with_linearized(buffer, [&] (bytes_view view) {
return value(type->split(view));
});
}
virtual cql3::raw_value get(const query_options& options) override {
return cql3::raw_value::make_value(tuple_type_impl::build_value(_elements));
@@ -251,20 +253,29 @@ public:
}
}
static in_value from_serialized(bytes_view value, list_type type, const query_options& options) {
static in_value from_serialized(const fragmented_temporary_buffer::view& value_view, list_type type, const query_options& options) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but the deserialization does the validation (so we're fine).
return with_linearized(value_view, [&] (bytes_view value) {
auto l = value_cast<list_type_impl::native_type>(type->deserialize(value, options.get_cql_serialization_format()));
auto ttype = dynamic_pointer_cast<const tuple_type_impl>(type->get_elements_type());
assert(ttype);
std::vector<std::vector<bytes_view_opt>> elements;
std::vector<std::vector<bytes_opt>> elements;
elements.reserve(l.size());
for (auto&& element : l) {
elements.emplace_back(ttype->split(ttype->decompose(element)));
auto tuple_buff = ttype->decompose(element);
auto tuple = ttype->split(tuple_buff);
std::vector<bytes_opt> elems;
elems.reserve(tuple.size());
for (auto&& e : tuple) {
elems.emplace_back(to_bytes_opt(e));
}
elements.emplace_back(std::move(elems));
}
return in_value(elements);
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -405,7 +416,7 @@ public:
in_marker(int32_t bind_index, ::shared_ptr<column_specification> receiver)
: abstract_marker(bind_index, std::move(receiver))
{
assert(dynamic_pointer_cast<const list_type_impl>(receiver->type));
assert(dynamic_pointer_cast<const list_type_impl>(_receiver->type));
}
virtual shared_ptr<terminal> bind(const query_options& options) override {

View File

@@ -53,6 +53,9 @@ update_parameters::get_prefetched_list(
return {};
}
if (column.is_static()) {
ckey = clustering_key_view::make_empty();
}
auto i = _prefetched->rows.find(std::make_pair(std::move(pkey), std::move(ckey)));
if (i == _prefetched->rows.end()) {
return {};

View File

@@ -142,7 +142,7 @@ public:
return atomic_cell::make_dead(_timestamp, _local_deletion_time);
}
atomic_cell make_cell(const abstract_type& type, bytes_view value, atomic_cell::collection_member cm = atomic_cell::collection_member::no) const {
atomic_cell make_cell(const abstract_type& type, const fragmented_temporary_buffer::view& value, atomic_cell::collection_member cm = atomic_cell::collection_member::no) const {
auto ttl = _ttl;
if (ttl.count() <= 0) {
@@ -156,6 +156,10 @@ public:
}
};
atomic_cell make_cell(const abstract_type& type, bytes_view value, atomic_cell::collection_member cm = atomic_cell::collection_member::no) const {
return make_cell(type, fragmented_temporary_buffer::view(value), cm);
}
atomic_cell make_counter_update_cell(int64_t delta) const {
return atomic_cell::make_live_counter_update(_timestamp, delta);
}

View File

@@ -28,6 +28,10 @@
#include <experimental/optional>
#include <seastar/util/variant_utils.hh>
#include "utils/fragmented_temporary_buffer.hh"
namespace cql3 {
struct null_value {
@@ -40,7 +44,7 @@ struct unset_value {
///
/// \see raw_value
struct raw_value_view {
boost::variant<bytes_view, null_value, unset_value> _data;
boost::variant<fragmented_temporary_buffer::view, null_value, unset_value> _data;
raw_value_view(null_value&& data)
: _data{std::move(data)}
@@ -48,10 +52,7 @@ struct raw_value_view {
raw_value_view(unset_value&& data)
: _data{std::move(data)}
{}
raw_value_view(bytes_view&& data)
: _data{std::move(data)}
{}
raw_value_view(const bytes_view& data)
raw_value_view(fragmented_temporary_buffer::view data)
: _data{data}
{}
public:
@@ -61,10 +62,7 @@ public:
static raw_value_view make_unset_value() {
return raw_value_view{std::move(unset_value{})};
}
static raw_value_view make_value(bytes_view &&view) {
return raw_value_view{std::move(view)};
}
static raw_value_view make_value(const bytes_view& view) {
static raw_value_view make_value(fragmented_temporary_buffer::view view) {
return raw_value_view{view};
}
bool is_null() const {
@@ -76,20 +74,47 @@ public:
bool is_value() const {
return _data.which() == 0;
}
bytes_view_opt data() const {
std::optional<fragmented_temporary_buffer::view> data() const {
if (_data.which() == 0) {
return boost::get<bytes_view>(_data);
return boost::get<fragmented_temporary_buffer::view>(_data);
}
return {};
}
explicit operator bool() const {
return _data.which() == 0;
}
const bytes_view* operator->() const {
return &boost::get<bytes_view>(_data);
const fragmented_temporary_buffer::view* operator->() const {
return &boost::get<fragmented_temporary_buffer::view>(_data);
}
const bytes_view& operator*() const {
return boost::get<bytes_view>(_data);
const fragmented_temporary_buffer::view& operator*() const {
return boost::get<fragmented_temporary_buffer::view>(_data);
}
bool operator==(const raw_value_view& other) const {
if (_data.which() != other._data.which()) {
return false;
}
if (is_value() && **this != *other) {
return false;
}
return true;
}
bool operator!=(const raw_value_view& other) const {
return !(*this == other);
}
friend std::ostream& operator<<(std::ostream& os, const raw_value_view& value) {
seastar::visit(value._data, [&] (fragmented_temporary_buffer::view v) {
os << "{ value: ";
using boost::range::for_each;
for_each(v, [&os] (bytes_view bv) { os << bv; });
os << " }";
}, [&] (null_value) {
os << "{ null }";
}, [&] (unset_value) {
os << "{ unset }";
});
return os;
}
};
@@ -127,7 +152,7 @@ public:
if (view.is_unset_value()) {
return make_unset_value();
}
return make_value(to_bytes(*view));
return make_value(linearized(*view));
}
static raw_value make_value(bytes&& bytes) {
return raw_value{std::move(bytes)};
@@ -167,7 +192,7 @@ public:
}
raw_value_view to_view() const {
switch (_data.which()) {
case 0: return raw_value_view::make_value(bytes_view{boost::get<bytes>(_data)});
case 0: return raw_value_view::make_value(fragmented_temporary_buffer::view(bytes_view{boost::get<bytes>(_data)}));
case 1: return raw_value_view::make_null();
default: return raw_value_view::make_unset_value();
}
@@ -176,10 +201,19 @@ public:
}
inline bytes to_bytes(const cql3::raw_value_view& view)
{
return linearized(*view);
}
inline bytes_opt to_bytes_opt(const cql3::raw_value_view& view) {
return to_bytes_opt(view.data());
auto buffer_view = view.data();
if (buffer_view) {
return bytes_opt(linearized(*buffer_view));
}
return bytes_opt();
}
inline bytes_opt to_bytes_opt(const cql3::raw_value& value) {
return value.data();
return to_bytes_opt(value.to_view());
}

File diff suppressed because it is too large Load Diff

View File

@@ -77,6 +77,7 @@
#include <seastar/core/metrics_registration.hh>
#include "tracing/trace_state.hh"
#include "db/view/view.hh"
#include "db/view/view_update_backlog.hh"
#include "db/view/row_locking.hh"
#include "lister.hh"
#include "utils/phased_barrier.hh"
@@ -279,6 +280,9 @@ struct cf_stats {
int64_t clustering_filter_fast_path_count = 0;
// how many sstables survived the clustering key checks
int64_t surviving_sstables_after_clustering_filter = 0;
// How many view updates were dropped due to overload.
int64_t dropped_view_updates = 0;
};
class cache_temperature {
@@ -298,9 +302,12 @@ public:
class table;
using column_family = table;
class database_sstable_write_monitor;
class table : public enable_lw_shared_from_this<table> {
public:
struct config {
std::vector<sstring> all_datadirs;
sstring datadir;
bool enable_disk_writes = true;
bool enable_disk_reads = true;
@@ -322,6 +329,7 @@ public:
bool enable_metrics_reporting = false;
db::large_partition_handler* large_partition_handler;
db::timeout_semaphore* view_update_concurrency_semaphore;
size_t view_update_concurrency_semaphore_limit;
};
struct no_commitlog {};
struct stats {
@@ -394,7 +402,7 @@ private:
// plan memtables and the resulting sstables are not made visible until
// the streaming is complete.
struct monitored_sstable {
std::unique_ptr<sstables::write_monitor> monitor;
std::unique_ptr<database_sstable_write_monitor> monitor;
sstables::shared_sstable sstable;
};
@@ -431,8 +439,15 @@ private:
// but for correct compaction we need to start the compaction only after
// reading all sstables.
std::unordered_map<uint64_t, sstables::shared_sstable> _sstables_need_rewrite;
// sstables that should not be compacted (e.g. because they need to be used
// to generate view updates later)
std::unordered_map<uint64_t, sstables::shared_sstable> _sstables_staging;
// Control background fibers waiting for sstables to be deleted
seastar::gate _sstable_deletion_gate;
// This semaphore ensures that an operation like snapshot won't have its selected
// sstables deleted by compaction in parallel, a race condition which could
// easily result in failure.
seastar::semaphore _sstable_deletion_sem = {1};
// There are situations in which we need to stop writing sstables. Flushers will take
// the read lock, and the ones that wish to stop that process will take the write lock.
rwlock _sstables_lock;
@@ -482,6 +497,13 @@ private:
utils::phased_barrier _pending_writes_phaser;
// Corresponding phaser for in-progress reads.
utils::phased_barrier _pending_reads_phaser;
public:
future<> add_sstable_and_update_cache(sstables::shared_sstable sst);
void move_sstable_from_staging_in_thread(sstables::shared_sstable sst);
sstables::shared_sstable get_staging_sstable(uint64_t generation) {
auto it = _sstables_staging.find(generation);
return it != _sstables_staging.end() ? it->second : nullptr;
}
private:
void update_stats_for_new_sstable(uint64_t disk_space_used_by_sstable, const std::vector<unsigned>& shards_for_the_sstable) noexcept;
// Adds new sstable to the set of sstables
@@ -534,7 +556,7 @@ private:
void rebuild_statistics();
// This function replaces new sstables by their ancestors, which are sstables that needed resharding.
void replace_ancestors_needed_rewrite(std::vector<sstables::shared_sstable> new_sstables);
void replace_ancestors_needed_rewrite(std::unordered_set<uint64_t> ancestors, std::vector<sstables::shared_sstable> new_sstables);
void remove_ancestors_needed_rewrite(std::unordered_set<uint64_t> ancestors);
private:
mutation_source_opt _virtual_reader;
@@ -615,6 +637,14 @@ public:
tracing::trace_state_ptr trace_state = nullptr,
streamed_mutation::forwarding fwd = streamed_mutation::forwarding::no,
mutation_reader::forwarding fwd_mr = mutation_reader::forwarding::yes) const;
flat_mutation_reader make_reader_excluding_sstable(schema_ptr schema,
sstables::shared_sstable sst,
const dht::partition_range& range,
const query::partition_slice& slice,
const io_priority_class& pc = default_priority_class(),
tracing::trace_state_ptr trace_state = nullptr,
streamed_mutation::forwarding fwd = streamed_mutation::forwarding::no,
mutation_reader::forwarding fwd_mr = mutation_reader::forwarding::yes) const;
flat_mutation_reader make_reader(schema_ptr schema, const dht::partition_range& range = query::full_partition_range) const {
auto& full_slice = schema->full_slice();
@@ -629,7 +659,13 @@ public:
flat_mutation_reader make_streaming_reader(schema_ptr schema,
const dht::partition_range_vector& ranges) const;
sstables::shared_sstable make_streaming_sstable_for_write(std::optional<sstring> subdir = {});
sstables::shared_sstable make_streaming_staging_sstable() {
return make_streaming_sstable_for_write("staging");
}
mutation_source as_mutation_source() const;
mutation_source as_mutation_source_excluding(sstables::shared_sstable sst) const;
void set_virtual_reader(mutation_source virtual_reader) {
_virtual_reader = std::move(virtual_reader);
@@ -683,7 +719,7 @@ public:
query::result_memory_limiter& memory_limiter,
uint64_t max_result_size,
db::timeout_clock::time_point timeout = db::no_timeout,
querier_cache_context cache_ctx = { });
query::querier_cache_context cache_ctx = { });
void start();
future<> stop();
@@ -837,6 +873,8 @@ public:
void clear_views();
const std::vector<view_ptr>& views() const;
future<row_locker::lock_holder> push_view_replica_updates(const schema_ptr& s, const frozen_mutation& fm, db::timeout_clock::time_point timeout) const;
future<row_locker::lock_holder> push_view_replica_updates(const schema_ptr& s, mutation&& m, db::timeout_clock::time_point timeout) const;
future<row_locker::lock_holder> stream_view_replica_updates(const schema_ptr& s, mutation&& m, db::timeout_clock::time_point timeout, sstables::shared_sstable excluded_sstable) const;
void add_coordinator_read_latency(utils::estimated_histogram::duration latency);
std::chrono::milliseconds get_coordinator_read_latency_percentile(double percentile);
@@ -854,13 +892,17 @@ public:
dht::token base_token,
flat_mutation_reader&&);
reader_concurrency_semaphore& read_concurrency_semaphore() {
return *_config.read_concurrency_semaphore;
}
private:
future<row_locker::lock_holder> do_push_view_replica_updates(const schema_ptr& s, mutation&& m, db::timeout_clock::time_point timeout, mutation_source&& source) const;
std::vector<view_ptr> affected_views(const schema_ptr& base, const mutation& update) const;
future<> generate_and_propagate_view_updates(const schema_ptr& base,
std::vector<view_ptr>&& views,
mutation&& m,
flat_mutation_reader_opt existings,
db::timeout_clock::time_point timeout) const;
flat_mutation_reader_opt existings) const;
mutable row_locker _row_locker;
future<row_locker::lock_holder> local_base_lock(
@@ -1029,6 +1071,7 @@ public:
class keyspace {
public:
struct config {
std::vector<sstring> all_datadirs;
sstring datadir;
bool enable_commitlog = true;
bool enable_disk_reads = true;
@@ -1049,6 +1092,7 @@ public:
seastar::scheduling_group streaming_scheduling_group;
bool enable_metrics_reporting = false;
db::timeout_semaphore* view_update_concurrency_semaphore = nullptr;
size_t view_update_concurrency_semaphore_limit;
};
private:
std::unique_ptr<locator::abstract_replication_strategy> _replication_strategy;
@@ -1106,6 +1150,7 @@ public:
return _config.datadir;
}
sstring column_family_directory(const sstring& base_path, const sstring& name, utils::UUID uuid) const;
sstring column_family_directory(const sstring& name, utils::UUID uuid) const;
};
@@ -1149,6 +1194,7 @@ private:
static const size_t max_count_system_concurrent_reads{10};
size_t max_memory_system_concurrent_reads() { return _dbcfg.available_memory * 0.02; };
static constexpr size_t max_concurrent_sstable_loads() { return 3; }
size_t max_memory_pending_view_updates() const { return _dbcfg.available_memory * 0.1; }
struct db_stats {
uint64_t total_writes = 0;
@@ -1160,6 +1206,11 @@ private:
uint64_t short_data_queries = 0;
uint64_t short_mutation_queries = 0;
uint64_t multishard_query_unpopped_fragments = 0;
uint64_t multishard_query_unpopped_bytes = 0;
uint64_t multishard_query_failed_reader_stops = 0;
uint64_t multishard_query_failed_reader_saves = 0;
};
lw_shared_ptr<db_stats> _stats;
@@ -1180,11 +1231,11 @@ private:
semaphore _sstable_load_concurrency_sem{max_concurrent_sstable_loads()};
db::timeout_semaphore _view_update_concurrency_sem{100}; // Stand-in hack for #2538
db::timeout_semaphore _view_update_concurrency_sem{max_memory_pending_view_updates()};
cache_tracker _row_cache_tracker;
concrete_execution_stage<future<lw_shared_ptr<query::result>>,
inheriting_concrete_execution_stage<future<lw_shared_ptr<query::result>>,
column_family*,
schema_ptr,
const query::read_command&,
@@ -1194,10 +1245,17 @@ private:
query::result_memory_limiter&,
uint64_t,
db::timeout_clock::time_point,
querier_cache_context> _data_query_stage;
query::querier_cache_context> _data_query_stage;
mutation_query_stage _mutation_query_stage;
inheriting_concrete_execution_stage<
future<>,
database*,
schema_ptr,
const frozen_mutation&,
db::timeout_clock::time_point> _apply_stage;
std::unordered_map<sstring, keyspace> _keyspaces;
std::unordered_map<utils::UUID, lw_shared_ptr<column_family>> _column_families;
std::unordered_map<std::pair<sstring, sstring>, utils::UUID, utils::tuple_hash> _ks_cf_to_uuid;
@@ -1208,7 +1266,7 @@ private:
seastar::metrics::metric_groups _metrics;
bool _enable_incremental_backups = false;
querier_cache _querier_cache;
query::querier_cache _querier_cache;
std::unique_ptr<db::large_partition_handler> _large_partition_handler;
@@ -1380,6 +1438,12 @@ public:
std::unordered_set<sstring> get_initial_tokens();
std::experimental::optional<gms::inet_address> get_replace_address();
bool is_replacing();
reader_concurrency_semaphore& user_read_concurrency_sem() {
return _read_concurrency_sem;
}
reader_concurrency_semaphore& streaming_read_concurrency_sem() {
return _streaming_concurrency_sem;
}
reader_concurrency_semaphore& system_keyspace_read_concurrency_sem() {
return _system_read_concurrency_sem;
}
@@ -1396,15 +1460,25 @@ public:
_querier_cache.set_entry_ttl(entry_ttl);
}
const querier_cache::stats& get_querier_cache_stats() const {
const query::querier_cache::stats& get_querier_cache_stats() const {
return _querier_cache.get_stats();
}
query::querier_cache& get_querier_cache() {
return _querier_cache;
}
db::view::update_backlog get_view_update_backlog() const {
return {max_memory_pending_view_updates() - _view_update_concurrency_sem.current(), max_memory_pending_view_updates()};
}
friend class distributed_loader;
};
future<> update_schema_version_and_announce(distributed<service::storage_proxy>& proxy);
bool is_internal_keyspace(const sstring& name);
class distributed_loader {
public:
static void reshard(distributed<database>& db, sstring ks_name, sstring cf_name);

View File

@@ -76,8 +76,7 @@ const uint32_t db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp)
: _qp(qp)
, _e1(_rd()) {
: _qp(qp) {
namespace sm = seastar::metrics;
_metrics.add_group("batchlog_manager", {
@@ -117,7 +116,7 @@ future<> db::batchlog_manager::start() {
// round-robin scheduling.
if (engine().cpu_id() == 0) {
_timer.set_callback([this] {
return do_batch_log_replay().handle_exception([] (auto ep) {
do_batch_log_replay().handle_exception([] (auto ep) {
blogger.error("Exception in batch replay: {}", ep);
}).finally([this] {
_timer.arm(lowres_clock::now() + std::chrono::milliseconds(replay_interval));
@@ -268,7 +267,7 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
// send to partially or wholly fail in actually sending stuff. Since we don't
// have hints (yet), send with CL=ALL, and hope we can re-do this soon.
// See below, we use retry on write failure.
return _qp.proxy().mutate(mutations, db::consistency_level::ALL, nullptr);
return _qp.proxy().mutate(mutations, db::consistency_level::ALL, db::no_timeout, nullptr);
});
}).then_wrapped([this, id](future<> batch_result) {
try {

View File

@@ -75,8 +75,7 @@ private:
unsigned _cpu = 0;
bool _stop = false;
std::random_device _rd;
std::default_random_engine _e1;
std::default_random_engine _e1{std::random_device{}()};
future<> replay_all_failed_batches();
public:

View File

@@ -107,6 +107,11 @@ public:
void process_bytes(const char* data, size_t size) {
return _c.process(reinterpret_cast<const uint8_t*>(data), size);
}
template<typename FragmentedBuffer>
GCC6_CONCEPT(requires FragmentRange<FragmentedBuffer>)
void process_fragmented(const FragmentedBuffer& buffer) {
return _c.process_fragmented(buffer);
}
};
class db::cf_holder {
@@ -308,10 +313,9 @@ public:
uint64_t get_num_dirty_segments() const;
uint64_t get_num_active_segments() const;
using buffer_type = temporary_buffer<char>;
using buffer_type = fragmented_temporary_buffer;
buffer_type acquire_buffer(size_t s);
void release_buffer(buffer_type&&);
future<std::vector<descriptor>> list_descriptors(sstring dir);
@@ -333,7 +337,6 @@ private:
segment_id_type _ids = 0;
std::vector<sseg_ptr> _segments;
queue<sseg_ptr> _reserve_segments;
std::vector<buffer_type> _temp_buffers;
std::unordered_map<flush_handler_id, flush_handler> _flush_handlers;
flush_handler_id _flush_ids = 0;
replay_position _flush_position;
@@ -344,6 +347,12 @@ private:
uint64_t _new_counter = 0;
};
template<typename T, typename Output>
static void write(Output& out, T value) {
auto v = net::hton(value);
out.write(reinterpret_cast<const char*>(&v), sizeof(v));
}
/*
* A single commit log file on disk. Manages creation of the file and writing mutations to disk,
* as well as tracking the last mutation position of any "dirty" CFs covered by the segment file. Segment
@@ -398,7 +407,6 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
uint64_t _file_pos = 0;
uint64_t _flush_pos = 0;
uint64_t _buf_pos = 0;
bool _closed = false;
using buffer_type = segment_manager::buffer_type;
@@ -407,6 +415,7 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
using time_point = segment_manager::time_point;
buffer_type _buffer;
fragmented_temporary_buffer::ostream _buffer_ostream;
std::unordered_map<cf_id_type, uint64_t> _cf_dirty;
time_point _sync_time;
seastar::gate _gate;
@@ -420,6 +429,10 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
friend std::ostream& operator<<(std::ostream&, const segment&);
friend class segment_manager;
size_t buffer_position() const {
return _buffer.size_bytes() - _buffer_ostream.size();
}
future<> begin_flush() {
// This is maintaining the semantica of only using the write-lock
// as a gate for flushing, i.e. once we've begun a flush for position X
@@ -466,7 +479,7 @@ public:
clogger.debug("Segment {} is no longer active and will submitted for delete now", *this);
++_segment_manager->totals.segments_destroyed;
_segment_manager->totals.total_size_on_disk -= size_on_disk();
_segment_manager->totals.total_size -= (size_on_disk() + _buffer.size());
_segment_manager->totals.total_size -= (size_on_disk() + _buffer.size_bytes());
_segment_manager->add_file_to_delete(_file_name, _desc);
} else {
clogger.warn("Segment {} is dirty and is left on disk.", *this);
@@ -607,29 +620,16 @@ public:
auto a = align_up(s + overhead, alignment);
auto k = std::max(a, default_size);
for (;;) {
try {
_buffer = _segment_manager->acquire_buffer(k);
break;
} catch (std::bad_alloc&) {
clogger.warn("Could not allocate {} k bytes output buffer ({} k required)", k / 1024, a / 1024);
if (k > a) {
k = std::max(a, k / 2);
clogger.debug("Trying reduced size: {} k", k / 1024);
continue;
}
throw;
}
}
_buf_pos = overhead;
auto * p = reinterpret_cast<uint32_t *>(_buffer.get_write());
std::fill(p, p + overhead, 0);
_buffer = _segment_manager->acquire_buffer(k);
_buffer_ostream = _buffer.get_ostream();
auto out = _buffer_ostream.write_substream(overhead);
out.fill('\0', overhead);
_segment_manager->totals.total_size += k;
}
bool buffer_is_empty() const {
return _buf_pos <= segment_overhead_size
|| (_file_pos == 0 && _buf_pos <= (segment_overhead_size + descriptor_header_size));
return buffer_position() <= segment_overhead_size
|| (_file_pos == 0 && buffer_position() <= (segment_overhead_size + descriptor_header_size));
}
/**
* Send any buffer contents to disk and get a new tmp buffer
@@ -641,35 +641,32 @@ public:
}
auto size = clear_buffer_slack();
auto buf = std::move(_buffer);
auto buf = std::exchange(_buffer, { });
auto off = _file_pos;
auto top = off + size;
auto num = _num_allocs;
_file_pos = top;
_buf_pos = 0;
_buffer_ostream = { };
_num_allocs = 0;
auto me = shared_from_this();
assert(me.use_count() > 1);
auto * p = buf.get_write();
assert(std::count(p, p + 2 * sizeof(uint32_t), 0) == 2 * sizeof(uint32_t));
data_output out(p, p + buf.size());
auto out = buf.get_ostream();
auto header_size = 0;
if (off == 0) {
// first block. write file header.
out.write(segment_magic);
out.write(_desc.ver);
out.write(_desc.id);
write(out, segment_magic);
write(out, _desc.ver);
write(out, _desc.id);
crc32_nbo crc;
crc.process(_desc.ver);
crc.process<int32_t>(_desc.id & 0xffffffff);
crc.process<int32_t>(_desc.id >> 32);
out.write(crc.checksum());
write(out, crc.checksum());
header_size = descriptor_header_size;
}
@@ -679,8 +676,8 @@ public:
crc.process<int32_t>(_desc.id >> 32);
crc.process(uint32_t(off + header_size));
out.write(uint32_t(_file_pos));
out.write(crc.checksum());
write(out, uint32_t(_file_pos));
write(out, crc.checksum());
forget_schema_versions();
@@ -690,25 +687,32 @@ public:
// The write will be allowed to start now, but flush (below) must wait for not only this,
// but all previous write/flush pairs.
return _pending_ops.run_with_ordered_post_op(rp, [this, size, off, buf = std::move(buf)]() mutable {
auto written = make_lw_shared<size_t>(0);
auto p = buf.get();
return repeat([this, size, off, written, p]() mutable {
return _pending_ops.run_with_ordered_post_op(rp, [this, size, off, buf = std::move(buf)]() mutable { ///////////////////////////////////////////////////
auto view = fragmented_temporary_buffer::view(buf);
view.remove_suffix(buf.size_bytes() - size);
assert(size == view.size_bytes());
return do_with(off, view, [&] (uint64_t& off, fragmented_temporary_buffer::view& view) {
if (view.empty()) {
return make_ready_future<>();
}
return repeat([this, size, &off, &view] {
auto&& priority_class = service::get_local_commitlog_priority();
return _file.dma_write(off + *written, p + *written, size - *written, priority_class).then_wrapped([this, size, written](future<size_t>&& f) {
auto current = *view.begin();
return _file.dma_write(off, current.data(), current.size(), priority_class).then_wrapped([this, size, &off, &view](future<size_t>&& f) {
try {
auto bytes = std::get<0>(f.get());
*written += bytes;
_segment_manager->totals.bytes_written += bytes;
_segment_manager->totals.total_size_on_disk += bytes;
++_segment_manager->totals.cycle_count;
if (*written == size) {
if (bytes == view.size_bytes()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
// gah, partial write. should always get here with dma chunk sized
// "bytes", but lets make sure...
clogger.debug("Partial write {}: {}/{} bytes", *this, *written, size);
*written = align_down(*written, alignment);
bytes = align_down(bytes, alignment);
off += bytes;
view.remove_prefix(bytes);
clogger.debug("Partial write {}: {}/{} bytes", *this, size - view.size_bytes(), size);
return make_ready_future<stop_iteration>(stop_iteration::no);
// TODO: retry/ignore/fail/stop - optional behaviour in origin.
// we fast-fail the whole commit.
@@ -717,10 +721,10 @@ public:
throw;
}
});
}).finally([this, buf = std::move(buf), size]() mutable {
_segment_manager->release_buffer(std::move(buf));
_segment_manager->notify_memory_written(size);
});
}).finally([this, buf = std::move(buf), size] {
_segment_manager->notify_memory_written(size);
});
}, [me, flush_after, top, rp] { // lambda instead of bind, so we keep "me" alive.
assert(me->_pending_ops.has_operation(rp));
return flush_after ? me->do_flush(top) : make_ready_future<sseg_ptr>(me);
@@ -786,7 +790,7 @@ public:
return finish_and_get_new(timeout).then([id, writer = std::move(writer), permit = std::move(permit), timeout] (auto new_seg) mutable {
return new_seg->allocate(id, std::move(writer), std::move(permit), timeout);
});
} else if (!_buffer.empty() && (s > (_buffer.size() - _buf_pos))) { // enough data?
} else if (!_buffer.empty() && (s > _buffer_ostream.size())) { // enough data?
if (_segment_manager->cfg.mode == sync_mode::BATCH) {
// TODO: this could cause starvation if we're really unlucky.
// If we run batch mode and find ourselves not fit in a non-empty
@@ -805,7 +809,7 @@ public:
size_t buf_memory = s;
if (_buffer.empty()) {
new_buffer(s);
buf_memory += _buf_pos;
buf_memory += buffer_position();
}
_gate.enter(); // this might throw. I guess we accept this?
@@ -813,29 +817,26 @@ public:
_segment_manager->account_memory_usage(buf_memory);
replay_position rp(_desc.id, position());
auto pos = _buf_pos;
_buf_pos += s;
_cf_dirty[id]++; // increase use count for cf.
rp_handle h(static_pointer_cast<cf_holder>(shared_from_this()), std::move(id), rp);
auto * p = _buffer.get_write() + pos;
auto * e = _buffer.get_write() + pos + s - sizeof(uint32_t);
data_output out(p, e);
auto out = _buffer_ostream.write_substream(s);
crc32_nbo crc;
out.write(uint32_t(s));
write<uint32_t>(out, s);
crc.process(uint32_t(s));
out.write(crc.checksum());
write<uint32_t>(out, crc.checksum());
// actual data
writer->write(*this, out);
auto entry_out = out.write_substream(size);
auto entry_data = entry_out.to_input_stream();
writer->write(*this, entry_out);
entry_data.with_stream([&] (auto data_str) {
crc.process_fragmented(ser::buffer_view<typename std::vector<temporary_buffer<char>>::iterator>(data_str));
});
crc.process_bytes(p + 2 * sizeof(uint32_t), size);
out = data_output(e, sizeof(uint32_t));
out.write(crc.checksum());
write<uint32_t>(out, crc.checksum());
++_segment_manager->totals.allocation_count;
++_num_allocs;
@@ -850,7 +851,7 @@ public:
// If this buffer alone is too big, potentially bigger than the maximum allowed size,
// then no other request will be allowed in to force the cycle()ing of this buffer. We
// have to do it ourselves.
if ((_buf_pos >= (db::commitlog::segment::default_size))) {
if ((buffer_position() >= (db::commitlog::segment::default_size))) {
cycle().discard_result().handle_exception([] (auto ex) {
clogger.error("Failed to flush commits to disk: {}", ex);
});
@@ -860,7 +861,7 @@ public:
}
position_type position() const {
return position_type(_file_pos + _buf_pos);
return position_type(_file_pos + buffer_position());
}
size_t size_on_disk() const {
@@ -870,11 +871,12 @@ public:
// ensures no more of this segment is writeable, by allocating any unused section at the end and marking it discarded
// a.k.a. zero the tail.
size_t clear_buffer_slack() {
auto size = align_up(_buf_pos, alignment);
std::fill(_buffer.get_write() + _buf_pos, _buffer.get_write() + size,
0);
_segment_manager->totals.bytes_slack += (size - _buf_pos);
_segment_manager->account_memory_usage(size - _buf_pos);
auto buf_pos = buffer_position();
auto size = align_up(buf_pos, alignment);
auto fill_size = size - buf_pos;
_buffer_ostream.fill('\0', fill_size);
_segment_manager->totals.bytes_slack += fill_size;
_segment_manager->account_memory_usage(fill_size);
return size;
}
void mark_clean(const cf_id_type& id, uint64_t count) {
@@ -1514,41 +1516,20 @@ uint64_t db::commitlog::segment_manager::get_num_active_segments() const {
db::commitlog::segment_manager::buffer_type db::commitlog::segment_manager::acquire_buffer(size_t s) {
auto i = _temp_buffers.begin();
auto e = _temp_buffers.end();
s = align_up(s, segment::default_size);
auto fragment_count = s / segment::default_size;
while (i != e) {
if (i->size() >= s) {
auto r = std::move(*i);
_temp_buffers.erase(i);
totals.buffer_list_bytes -= r.size();
return r;
std::vector<temporary_buffer<char>> buffers;
buffers.reserve(fragment_count);
while (buffers.size() < fragment_count) {
auto a = ::memalign(segment::alignment, segment::default_size);
if (a == nullptr) {
throw std::bad_alloc();
}
++i;
}
auto a = ::memalign(segment::alignment, s);
if (a == nullptr) {
throw std::bad_alloc();
buffers.emplace_back(static_cast<char*>(a), segment::default_size, make_free_deleter(a));
}
clogger.trace("Allocated {} k buffer", s / 1024);
return buffer_type(reinterpret_cast<char *>(a), s, make_free_deleter(a));
}
void db::commitlog::segment_manager::release_buffer(buffer_type&& b) {
_temp_buffers.emplace_back(std::move(b));
std::sort(_temp_buffers.begin(), _temp_buffers.end(), [](const buffer_type& b1, const buffer_type& b2) {
return b1.size() < b2.size();
});
constexpr const size_t max_temp_buffers = 4;
if (_temp_buffers.size() > max_temp_buffers) {
clogger.trace("Deleting {} buffers", _temp_buffers.size() - max_temp_buffers);
_temp_buffers.erase(_temp_buffers.begin() + max_temp_buffers, _temp_buffers.end());
}
totals.buffer_list_bytes = boost::accumulate(
_temp_buffers | boost::adaptors::transformed(std::mem_fn(&buffer_type::size)),
size_t(0), std::plus<size_t>());
return fragmented_temporary_buffer(std::move(buffers), s);
}
/**
@@ -1694,14 +1675,14 @@ const db::commitlog::config& db::commitlog::active_config() const {
// No commit_io_check needed in the log reader since the database will fail
// on error at startup if required
future<std::unique_ptr<subscription<temporary_buffer<char>, db::replay_position>>>
db::commitlog::read_log_file(const sstring& filename, commit_load_reader_func next, position_type off, const db::extensions* exts) {
db::commitlog::read_log_file(const sstring& filename, seastar::io_priority_class read_io_prio_class, commit_load_reader_func next, position_type off, const db::extensions* exts) {
struct work {
private:
file_input_stream_options make_file_input_stream_options() {
file_input_stream_options make_file_input_stream_options(seastar::io_priority_class read_io_prio_class) {
file_input_stream_options fo;
fo.buffer_size = db::commitlog::segment::default_size;
fo.read_ahead = 10;
fo.io_priority_class = service::get_local_commitlog_priority();
fo.io_priority_class = read_io_prio_class;
return fo;
}
public:
@@ -1720,8 +1701,8 @@ db::commitlog::read_log_file(const sstring& filename, commit_load_reader_func ne
bool header = true;
bool failed = false;
work(file f, position_type o = 0)
: f(f), fin(make_file_input_stream(f, 0, make_file_input_stream_options())), start_off(o) {
work(file f, seastar::io_priority_class read_io_prio_class, position_type o = 0)
: f(f), fin(make_file_input_stream(f, 0, make_file_input_stream_options(read_io_prio_class))), start_off(o) {
}
work(work&&) = default;
@@ -1939,9 +1920,9 @@ db::commitlog::read_log_file(const sstring& filename, commit_load_reader_func ne
return fut;
});
return fut.then([off, next](file f) {
return fut.then([off, next, read_io_prio_class] (file f) {
f = make_checked_file(commit_error_handler, std::move(f));
auto w = make_lw_shared<work>(std::move(f), off);
auto w = make_lw_shared<work>(std::move(f), read_io_prio_class, off);
auto ret = w->s.listen(next);
w->s.started().then(std::bind(&work::read_file, w.get())).then([w] {

View File

@@ -42,7 +42,6 @@
#include <memory>
#include "utils/data_output.hh"
#include "core/future.hh"
#include "core/shared_ptr.hh"
#include "core/stream.hh"
@@ -176,7 +175,7 @@ public:
* of data to be written. (See add).
* Don't write less, absolutely don't write more...
*/
using output = data_output;
using output = fragmented_temporary_buffer::ostream;
using serializer_func = std::function<void(output&)>;
/**
@@ -356,7 +355,7 @@ public:
};
static future<std::unique_ptr<subscription<temporary_buffer<char>, replay_position>>> read_log_file(
const sstring&, commit_load_reader_func, position_type = 0, const db::extensions* = nullptr);
const sstring&, seastar::io_priority_class read_io_prio_class, commit_load_reader_func, position_type = 0, const db::extensions* = nullptr);
private:
commitlog(config);

View File

@@ -51,9 +51,8 @@ void commitlog_entry_writer::compute_size() {
_size = ms.size();
}
void commitlog_entry_writer::write(data_output& out) const {
seastar::simple_output_stream str(out.reserve(size()), size());
serialize(str);
void commitlog_entry_writer::write(typename seastar::memory_output_stream<std::vector<temporary_buffer<char>>::iterator>& out) const {
serialize(out);
}
commitlog_entry_reader::commitlog_entry_reader(const temporary_buffer<char>& buffer)

View File

@@ -25,7 +25,6 @@
#include "frozen_mutation.hh"
#include "schema.hh"
#include "utils/data_output.hh"
#include "stdx.hh"
class commitlog_entry {
@@ -35,7 +34,8 @@ public:
commitlog_entry(stdx::optional<column_mapping> mapping, frozen_mutation&& mutation)
: _mapping(std::move(mapping)), _mutation(std::move(mutation)) { }
const stdx::optional<column_mapping>& mapping() const { return _mapping; }
const frozen_mutation& mutation() const { return _mutation; }
const frozen_mutation& mutation() const & { return _mutation; }
frozen_mutation&& mutation() && { return std::move(_mutation); }
};
class commitlog_entry_writer {
@@ -72,7 +72,7 @@ public:
return _mutation.representation().size();
}
void write(data_output& out) const;
void write(typename seastar::memory_output_stream<std::vector<temporary_buffer<char>>::iterator>& out) const;
};
class commitlog_entry_reader {
@@ -81,5 +81,6 @@ public:
commitlog_entry_reader(const temporary_buffer<char>& buffer);
const stdx::optional<column_mapping>& get_column_mapping() const { return _ce.mapping(); }
const frozen_mutation& mutation() const { return _ce.mutation(); }
const frozen_mutation& mutation() const & { return _ce.mutation(); }
frozen_mutation&& mutation() && { return std::move(_ce).mutation(); }
};

View File

@@ -58,6 +58,7 @@
#include "converting_mutation_partition_applier.hh"
#include "schema_registry.hh"
#include "commitlog_entry.hh"
#include "service/priority_manager.hh"
static logging::logger rlogger("commitlog_replayer");
@@ -163,7 +164,7 @@ future<> db::commitlog_replayer::impl::init() {
// Get all truncation records for the CF and initialize max rps if
// present. Cannot do this on demand, as there may be no sstables to
// mark the CF as "needed".
return db::system_keyspace::get_truncated_position(uuid).then([&map, &uuid](std::vector<db::replay_position> tpps) {
return db::system_keyspace::get_truncated_position(uuid).then([&map, uuid](std::vector<db::replay_position> tpps) {
for (auto& p : tpps) {
rlogger.trace("CF {} truncated at {}", uuid, p);
auto& pp = map[p.shard_id()][uuid];
@@ -223,7 +224,7 @@ db::commitlog_replayer::impl::recover(sstring file, const sstring& fname_prefix)
auto s = make_lw_shared<stats>();
auto& exts = _qp.local().db().local().get_config().extensions();
return db::commitlog::read_log_file(file,
return db::commitlog::read_log_file(file, service::get_local_commitlog_priority(),
std::bind(&impl::process, this, s.get(), std::placeholders::_1,
std::placeholders::_2), p, &exts).then([](auto s) {
auto f = s->done();

View File

@@ -102,6 +102,8 @@ db::config::config()
db::config::~config()
{}
const sstring db::config::default_tls_priority("SECURE128:-VERS-TLS1.0");
namespace utils {
template<>

View File

@@ -155,6 +155,9 @@ public:
val(hints_directory, sstring, "/var/lib/scylla/hints", Used, \
"The directory where hints files are stored if hinted handoff is enabled." \
) \
val(view_hints_directory, sstring, "/var/lib/scylla/view_hints", Used, \
"The directory where materialized-view updates are stored while a view replica is unreachable." \
) \
val(saved_caches_directory, sstring, "/var/lib/scylla/saved_caches", Unused, \
"The directory location where table key and row caches are stored." \
) \
@@ -453,7 +456,7 @@ public:
"The maximum number of tombstones a query can scan before aborting." \
) \
/* Network timeout settings */ \
val(range_request_timeout_in_ms, uint32_t, 10000, Unused, \
val(range_request_timeout_in_ms, uint32_t, 10000, Used, \
"The time in milliseconds that the coordinator waits for sequential or index scans to complete." \
) \
val(read_request_timeout_in_ms, uint32_t, 5000, Used, \
@@ -472,7 +475,7 @@ public:
"The time in milliseconds that the coordinator waits for write operations to complete.\n" \
"Related information: About hinted handoff writes" \
) \
val(request_timeout_in_ms, uint32_t, 10000, Unused, \
val(request_timeout_in_ms, uint32_t, 10000, Used, \
"The default timeout for other, miscellaneous operations.\n" \
"Related information: About hinted handoff writes" \
) \
@@ -578,8 +581,8 @@ public:
val(dynamic_snitch_update_interval_in_ms, uint32_t, 100, Unused, \
"The time interval for how often the snitch calculates node scores. Because score calculation is CPU intensive, be careful when reducing this interval." \
) \
val(hinted_handoff_enabled, sstring, "false", Used, \
"Experimental: enable or disable hinted handoff. To enable per data center, add data center list. For example: hinted_handoff_enabled: DC1,DC2. A hint indicates that the write needs to be replayed to an unavailable node. " \
val(hinted_handoff_enabled, sstring, "true", Used, \
"Enable or disable hinted handoff. To enable per data center, add data center list. For example: hinted_handoff_enabled: DC1,DC2. A hint indicates that the write needs to be replayed to an unavailable node. " \
"Related information: About hinted handoff writes" \
) \
val(hinted_handoff_throttle_in_kb, uint32_t, 1024, Unused, \
@@ -621,7 +624,7 @@ public:
val(thrift_framed_transport_size_in_mb, uint32_t, 15, Unused, \
"Frame size (maximum field length) for Thrift. The frame is the row or part of the row the application is inserting." \
) \
val(thrift_max_message_length_in_mb, uint32_t, 16, Unused, \
val(thrift_max_message_length_in_mb, uint32_t, 16, Used, \
"The maximum length of a Thrift message in megabytes, including all fields and internal Thrift overhead (1 byte of overhead for each frame). Message length is usually used in conjunction with batches. A frame length greater than or equal to 24 accommodates a batch with four inserts, each of which is 24 bytes. The required message length is greater than or equal to 24+24+24+24+4 (number of frames)." \
) \
/* Security properties */ \
@@ -728,7 +731,7 @@ public:
val(prometheus_address, sstring, "0.0.0.0", Used, "Prometheus listening address") \
val(prometheus_prefix, sstring, "scylla", Used, "Set the prefix of the exported Prometheus metrics. Changing this will break Scylla's dashboard compatibility, do not change unless you know what you are doing.") \
val(abort_on_lsa_bad_alloc, bool, false, Used, "Abort when allocation in LSA region fails") \
val(murmur3_partitioner_ignore_msb_bits, unsigned, 0, Used, "Number of most siginificant token bits to ignore in murmur3 partitioner; increase for very large clusters") \
val(murmur3_partitioner_ignore_msb_bits, unsigned, 12, Used, "Number of most siginificant token bits to ignore in murmur3 partitioner; increase for very large clusters") \
val(virtual_dirty_soft_limit, double, 0.6, Used, "Soft limit of virtual dirty memory expressed as a portion of the hard limit") \
val(sstable_summary_ratio, double, 0.0005, Used, "Enforces that 1 byte of summary is written for every N (2000 by default) " \
"bytes written to data file. Value must be between 0 and 1.") \
@@ -739,6 +742,7 @@ public:
" Performance is affected to some extent as a result. Useful to help debugging problems that may arise at another layers.") \
val(cpu_scheduler, bool, true, Used, "Enable cpu scheduling") \
val(view_building, bool, true, Used, "Enable view building; should only be set to false when the node is experience issues due to view building") \
val(enable_sstables_mc_format, bool, false, Used, "Enable SSTables 'mc' format to be used as the default file format") \
/* done! */
#define _make_value_member(name, type, deflt, status, desc, ...) \
@@ -752,6 +756,8 @@ public:
add_options(boost::program_options::options_description_easy_init&);
const db::extensions& extensions() const;
static const sstring default_tls_priority;
private:
template<typename T>
struct log_legacy_value : public named_value<T, value_status::Used> {

View File

@@ -253,8 +253,12 @@ filter_for_query(consistency_level cl,
return selected_endpoints;
}
std::vector<gms::inet_address> filter_for_query(consistency_level cl, keyspace& ks, std::vector<gms::inet_address>& live_endpoints, column_family* cf) {
return filter_for_query(cl, ks, live_endpoints, {}, read_repair_decision::NONE, nullptr, cf);
std::vector<gms::inet_address> filter_for_query(consistency_level cl,
keyspace& ks,
std::vector<gms::inet_address>& live_endpoints,
const std::vector<gms::inet_address>& preferred_endpoints,
column_family* cf) {
return filter_for_query(cl, ks, live_endpoints, preferred_endpoints, read_repair_decision::NONE, nullptr, cf);
}
bool

View File

@@ -84,7 +84,11 @@ filter_for_query(consistency_level cl,
gms::inet_address* extra,
column_family* cf);
std::vector<gms::inet_address> filter_for_query(consistency_level cl, keyspace& ks, std::vector<gms::inet_address>& live_endpoints, column_family* cf);
std::vector<gms::inet_address> filter_for_query(consistency_level cl,
keyspace& ks,
std::vector<gms::inet_address>& live_endpoints,
const std::vector<gms::inet_address>& preferred_endpoints,
column_family* cf);
struct dc_node_count {
size_t live = 0;

View File

@@ -49,7 +49,10 @@
#include "types.hh"
static ::shared_ptr<cql3::cql3_type::raw> parse_raw(const sstring& str) {
return cql3::util::do_with_parser(str, std::mem_fn(&cql3_parser::CqlParser::comparatorType));
return cql3::util::do_with_parser(str,
[] (cql3_parser::CqlParser& parser) {
return parser.comparator_type(true);
});
}
data_type db::cql_type_parser::parse(const sstring& keyspace, const sstring& str, lw_shared_ptr<user_types_metadata> user_types) {

View File

@@ -28,8 +28,7 @@ logging::logger hr_logger("heat_load_balance");
// Return a uniformly-distributed random number in [0,1)
// We use per-thread state for thread safety. We seed the random number generator
// once with a real random value, if available,
static thread_local std::random_device r;
static thread_local std::default_random_engine random_engine(r());
static thread_local std::default_random_engine random_engine{std::random_device{}()};
float
rand_float() {
static thread_local std::uniform_real_distribution<float> u(0, 1);

View File

@@ -20,9 +20,11 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <algorithm>
#include <seastar/core/future.hh>
#include <seastar/core/seastar.hh>
#include <seastar/core/gate.hh>
#include <boost/range/adaptors.hpp>
#include "service/storage_service.hh"
#include "utils/div_ceil.hh"
#include "db/config.hh"
@@ -33,6 +35,9 @@
#include "disk-error-handler.hh"
#include "lister.hh"
#include "db/timeout_clock.hh"
#include "service/priority_manager.hh"
using namespace std::literals::chrono_literals;
namespace db {
namespace hints {
@@ -74,6 +79,9 @@ void manager::register_metrics(const sstring& group_name) {
sm::make_derive("sent", _stats.sent,
sm::description("Number of sent hints.")),
sm::make_derive("discarded", _stats.discarded,
sm::description("Number of hints that were discarded during sending (too old, schema changed, etc.).")),
});
}
@@ -91,6 +99,7 @@ future<> manager::start(shared_ptr<service::storage_proxy> proxy_ptr, shared_ptr
return compute_hints_dir_device_id();
}).then([this] {
_strorage_service_anchor->register_subscriber(this);
set_started();
});
}
@@ -101,7 +110,7 @@ future<> manager::stop() {
_strorage_service_anchor->unregister_subscriber(this);
}
_stopping = true;
set_stopping();
return _draining_eps_gate.close().finally([this] {
return parallel_for_each(_ep_managers, [] (auto& pair) {
@@ -273,7 +282,7 @@ inline bool manager::have_ep_manager(ep_key_type ep) const noexcept {
}
bool manager::store_hint(ep_key_type ep, schema_ptr s, lw_shared_ptr<const frozen_mutation> fm, tracing::trace_state_ptr tr_state) noexcept {
if (_stopping || !can_hint_for(ep)) {
if (stopping() || !started() || !can_hint_for(ep)) {
manager_logger.trace("Can't store a hint to {}", ep);
++_stats.dropped;
return false;
@@ -376,7 +385,7 @@ future<timespec> manager::end_point_hints_manager::sender::get_last_file_modific
});
}
future<> manager::end_point_hints_manager::sender::do_send_one_mutation(mutation m, const std::vector<gms::inet_address>& natural_endpoints) noexcept {
future<> manager::end_point_hints_manager::sender::do_send_one_mutation(frozen_mutation_and_schema m, const std::vector<gms::inet_address>& natural_endpoints) noexcept {
return futurize_apply([this, m = std::move(m), &natural_endpoints] () mutable -> future<> {
// The fact that we send with CL::ALL in both cases below ensures that new hints are not going
// to be generated as a result of hints sending.
@@ -385,7 +394,11 @@ future<> manager::end_point_hints_manager::sender::do_send_one_mutation(mutation
return _proxy.send_to_endpoint(std::move(m), end_point_key(), { }, write_type::SIMPLE);
} else {
manager_logger.trace("Endpoints set has changed and {} is no longer a replica. Mutating from scratch...", end_point_key());
return _proxy.mutate({std::move(m)}, consistency_level::ALL, nullptr);
// FIXME: using 1h as infinite timeout. If a node is down, we should get an
// unavailable exception.
auto timeout = db::timeout_clock::now() + 1h;
//FIXME: Add required frozen_mutation overloads
return _proxy.mutate({m.fm.unfreeze(m.s)}, consistency_level::ALL, timeout, nullptr);
}
});
}
@@ -411,21 +424,19 @@ bool manager::end_point_hints_manager::sender::can_send() noexcept {
}
}
mutation manager::end_point_hints_manager::sender::get_mutation(lw_shared_ptr<send_one_file_ctx> ctx_ptr, temporary_buffer<char>& buf) {
frozen_mutation_and_schema manager::end_point_hints_manager::sender::get_mutation(lw_shared_ptr<send_one_file_ctx> ctx_ptr, temporary_buffer<char>& buf) {
hint_entry_reader hr(buf);
auto& fm = hr.mutation();
auto& cm = get_column_mapping(std::move(ctx_ptr), fm, hr);
auto& cf = _db.find_column_family(fm.column_family_id());
auto schema = _db.find_schema(fm.column_family_id());
if (cf.schema()->version() != fm.schema_version()) {
mutation m(cf.schema(), fm.decorated_key(*cf.schema()));
converting_mutation_partition_applier v(cm, *cf.schema(), m.partition());
if (schema->version() != fm.schema_version()) {
mutation m(schema, fm.decorated_key(*schema));
converting_mutation_partition_applier v(cm, *schema, m.partition());
fm.partition().accept(cm, v);
return std::move(m);
} else {
return fm.unfreeze(cf.schema());
return {freeze(m), std::move(schema)};
}
return {std::move(hr).mutation(), std::move(schema)};
}
const column_mapping& manager::end_point_hints_manager::sender::get_column_mapping(lw_shared_ptr<send_one_file_ctx> ctx_ptr, const frozen_mutation& fm, const hint_entry_reader& hr) {
@@ -495,7 +506,7 @@ bool manager::check_dc_for(ep_key_type ep) const noexcept {
}
void manager::drain_for(gms::inet_address endpoint) {
if (_stopping) {
if (stopping()) {
return;
}
@@ -536,6 +547,7 @@ manager::end_point_hints_manager::sender::sender(end_point_hints_manager& parent
, _resource_manager(_shard_manager._resource_manager)
, _proxy(local_storage_proxy)
, _db(local_db)
, _hints_cpu_sched_group(_db.get_streaming_scheduling_group())
, _gossiper(local_gossiper)
, _file_update_mutex(_ep_manager.file_update_mutex())
{}
@@ -548,6 +560,7 @@ manager::end_point_hints_manager::sender::sender(const sender& other, end_point_
, _resource_manager(_shard_manager._resource_manager)
, _proxy(other._proxy)
, _db(other._db)
, _hints_cpu_sched_group(other._hints_cpu_sched_group)
, _gossiper(other._gossiper)
, _file_update_mutex(_ep_manager.file_update_mutex())
{}
@@ -603,7 +616,10 @@ manager::end_point_hints_manager::sender::clock::duration manager::end_point_hin
}
void manager::end_point_hints_manager::sender::start() {
_stopped = seastar::async([this] {
seastar::thread_attributes attr;
attr.sched_group = _hints_cpu_sched_group;
_stopped = seastar::async(std::move(attr), [this] {
manager_logger.trace("ep_manager({})::sender: started", end_point_key());
while (!stopping()) {
try {
@@ -623,10 +639,11 @@ void manager::end_point_hints_manager::sender::start() {
});
}
future<> manager::end_point_hints_manager::sender::send_one_mutation(mutation m) {
keyspace& ks = _db.find_keyspace(m.schema()->ks_name());
future<> manager::end_point_hints_manager::sender::send_one_mutation(frozen_mutation_and_schema m) {
keyspace& ks = _db.find_keyspace(m.s->ks_name());
auto& rs = ks.get_replication_strategy();
std::vector<gms::inet_address> natural_endpoints = rs.get_natural_endpoints(m.token());
auto token = dht::global_partitioner().get_token(*m.s, m.fm.key(*m.s));
std::vector<gms::inet_address> natural_endpoints = rs.get_natural_endpoints(std::move(token));
return do_send_one_mutation(std::move(m), natural_endpoints);
}
@@ -644,8 +661,8 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
return make_ready_future<>();
}
mutation m = this->get_mutation(ctx_ptr, buf);
gc_clock::duration gc_grace_sec = m.schema()->gc_grace_seconds();
auto m = this->get_mutation(ctx_ptr, buf);
gc_clock::duration gc_grace_sec = m.s->gc_grace_seconds();
// The hint is too old - drop it.
//
@@ -666,10 +683,13 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
// ignore these errors and move on - probably this hint is too old and the KS/CF has been deleted...
} catch (no_such_column_family& e) {
manager_logger.debug("send_hints(): no_such_column_family: {}", e.what());
++this->shard_stats().discarded;
} catch (no_such_keyspace& e) {
manager_logger.debug("send_hints(): no_such_keyspace: {}", e.what());
++this->shard_stats().discarded;
} catch (no_column_mapping& e) {
manager_logger.debug("send_hints(): {}: {}", fname, e.what());
manager_logger.debug("send_hints(): {} at {}: {}", fname, rp, e.what());
++this->shard_stats().discarded;
}
return make_ready_future<>();
}).finally([units = std::move(units), ctx_ptr] {});
@@ -683,10 +703,10 @@ future<> manager::end_point_hints_manager::sender::send_one_hint(lw_shared_ptr<s
bool manager::end_point_hints_manager::sender::send_one_file(const sstring& fname) {
timespec last_mod = get_last_file_modification(fname).get0();
gc_clock::duration secs_since_file_mod = std::chrono::seconds(last_mod.tv_sec);
lw_shared_ptr<send_one_file_ctx> ctx_ptr = make_lw_shared<send_one_file_ctx>();
lw_shared_ptr<send_one_file_ctx> ctx_ptr = make_lw_shared<send_one_file_ctx>(_last_schema_ver_to_column_mapping);
try {
auto s = commitlog::read_log_file(fname, [this, secs_since_file_mod, &fname, ctx_ptr] (temporary_buffer<char> buf, db::replay_position rp) mutable {
auto s = commitlog::read_log_file(fname, service::get_local_streaming_read_priority(), [this, secs_since_file_mod, &fname, ctx_ptr] (temporary_buffer<char> buf, db::replay_position rp) mutable {
// Check that we can still send the next hint. Don't try to send it if the destination host
// is DOWN or if we have already failed to send some of the previous hints.
if (!draining() && ctx_ptr->state.contains(send_state::segment_replay_failed)) {
@@ -740,6 +760,7 @@ bool manager::end_point_hints_manager::sender::send_one_file(const sstring& fnam
// clear the replay position - we are going to send the next segment...
_last_not_complete_rp = replay_position();
_last_schema_ver_to_column_mapping.clear();
manager_logger.trace("send_one_file(): segment {} was sent in full and deleted", fname);
return true;
}
@@ -752,7 +773,7 @@ void manager::end_point_hints_manager::sender::send_hints_maybe() noexcept {
int replayed_segments_count = 0;
try {
while (have_segments()) {
while (replay_allowed() && have_segments()) {
if (!send_one_file(*_segments_to_replay.begin())) {
break;
}
@@ -777,5 +798,175 @@ void manager::end_point_hints_manager::sender::send_hints_maybe() noexcept {
manager_logger.trace("send_hints(): we handled {} segments", replayed_segments_count);
}
template<typename Func>
static future<> scan_for_hints_dirs(const sstring& hints_directory, Func&& f) {
return lister::scan_dir(hints_directory, { directory_entry_type::directory }, [f = std::forward<Func>(f)] (lister::path dir, directory_entry de) {
try {
return f(std::move(dir), std::move(de), std::stoi(de.name.c_str()));
} catch (std::invalid_argument& ex) {
manager_logger.debug("Ignore invalid directory {}", de.name);
return make_ready_future<>();
}
});
}
// runs in seastar::async context
manager::hints_segments_map manager::get_current_hints_segments(const sstring& hints_directory) {
hints_segments_map current_hints_segments;
// shards level
scan_for_hints_dirs(hints_directory, [&current_hints_segments] (lister::path dir, directory_entry de, unsigned shard_id) {
manager_logger.trace("shard_id = {}", shard_id);
// IPs level
return lister::scan_dir(dir / de.name.c_str(), { directory_entry_type::directory }, [&current_hints_segments, shard_id] (lister::path dir, directory_entry de) {
manager_logger.trace("\tIP: {}", de.name);
// hints files
return lister::scan_dir(dir / de.name.c_str(), { directory_entry_type::regular }, [&current_hints_segments, shard_id, ep_addr = de.name] (lister::path dir, directory_entry de) {
manager_logger.trace("\t\tfile: {}", de.name);
current_hints_segments[ep_addr][shard_id].emplace_back(dir / de.name.c_str());
return make_ready_future<>();
});
});
}).get();
return current_hints_segments;
}
// runs in seastar::async context
void manager::rebalance_segments(const sstring& hints_directory, hints_segments_map& segments_map) {
// Count how many hints segments to each destination we have.
std::unordered_map<sstring, size_t> per_ep_hints;
for (auto& ep_info : segments_map) {
per_ep_hints[ep_info.first] = boost::accumulate(ep_info.second | boost::adaptors::map_values | boost::adaptors::transformed(std::mem_fn(&std::list<lister::path>::size)), 0);
manager_logger.trace("{}: total files: {}", ep_info.first, per_ep_hints[ep_info.first]);
}
// Create a map of lists of segments that we will move (for each destination end point): if a shard has segments
// then we will NOT move q = int(N/S) segments out of them, where N is a total number of segments to the current
// destination and S is a current number of shards.
std::unordered_map<sstring, std::list<lister::path>> segments_to_move;
for (auto& [ep, ep_segments] : segments_map) {
size_t q = per_ep_hints[ep] / smp::count;
auto& current_segments_to_move = segments_to_move[ep];
for (auto& [shard_id, shard_segments] : ep_segments) {
// Move all segments from the shards that are no longer relevant (re-sharding to the lower number of shards)
if (shard_id >= smp::count) {
current_segments_to_move.splice(current_segments_to_move.end(), shard_segments);
} else if (shard_segments.size() > q) {
current_segments_to_move.splice(current_segments_to_move.end(), shard_segments, std::next(shard_segments.begin(), q), shard_segments.end());
}
}
}
// Since N (a total number of segments to a specific destination) may be not a multiple of S (a current number of
// shards) we will distribute files in two passes:
// * if N = S * q + r, then
// * one pass for segments_per_shard = q
// * another one for segments_per_shard = q + 1.
//
// This way we will ensure as close to the perfect distribution as possible.
//
// Right till this point we haven't moved any segments. However we have created a logical separation of segments
// into two groups:
// * Segments that are not going to be moved: segments in the segments_map.
// * Segments that are going to be moved: segments in the segments_to_move.
//
// rebalance_segments_for() is going to consume segments from segments_to_move and move them to corresponding
// lists in the segments_map AND actually move segments to the corresponding shard's sub-directory till the requested
// segments_per_shard level is reached (see more details in the description of rebalance_segments_for()).
for (auto& [ep, N] : per_ep_hints) {
size_t q = N / smp::count;
size_t r = N - q * smp::count;
auto& current_segments_to_move = segments_to_move[ep];
auto& current_segments_map = segments_map[ep];
if (q) {
rebalance_segments_for(ep, q, hints_directory, current_segments_map, current_segments_to_move);
}
if (r) {
rebalance_segments_for(ep, q + 1, hints_directory, current_segments_map, current_segments_to_move);
}
}
}
// runs in seastar::async context
void manager::rebalance_segments_for(
const sstring& ep,
size_t segments_per_shard,
const sstring& hints_directory,
hints_ep_segments_map& ep_segments,
std::list<lister::path>& segments_to_move)
{
manager_logger.trace("{}: segments_per_shard: {}, total number of segments to move: {}", ep, segments_per_shard, segments_to_move.size());
// sanity check
if (segments_to_move.empty() || !segments_per_shard) {
return;
}
for (unsigned i = 0; i < smp::count && !segments_to_move.empty(); ++i) {
lister::path shard_path_dir(lister::path(hints_directory.c_str()) / seastar::format("{:d}", i).c_str() / ep.c_str());
std::list<lister::path>& current_shard_segments = ep_segments[i];
// Make sure that the shard_path_dir exists and if not - create it
io_check(recursive_touch_directory, shard_path_dir.c_str()).get();
while (current_shard_segments.size() < segments_per_shard && !segments_to_move.empty()) {
auto seg_path_it = segments_to_move.begin();
lister::path new_path(shard_path_dir / seg_path_it->filename());
// Don't move the file to the same location - it's pointless.
if (*seg_path_it != new_path) {
manager_logger.trace("going to move: {} -> {}", *seg_path_it, new_path);
io_check(rename_file, seg_path_it->native(), new_path.native()).get();
} else {
manager_logger.trace("skipping: {}", *seg_path_it);
}
current_shard_segments.splice(current_shard_segments.end(), segments_to_move, seg_path_it, std::next(seg_path_it));
}
}
}
// runs in seastar::async context
void manager::remove_irrelevant_shards_directories(const sstring& hints_directory) {
// shards level
scan_for_hints_dirs(hints_directory, [] (lister::path dir, directory_entry de, unsigned shard_id) {
if (shard_id >= smp::count) {
// IPs level
return lister::scan_dir(dir / de.name.c_str(), { directory_entry_type::directory, directory_entry_type::regular }, lister::show_hidden::yes, [] (lister::path dir, directory_entry de) {
return io_check(remove_file, (dir / de.name.c_str()).native());
}).then([shard_base_dir = dir, shard_entry = de] {
return io_check(remove_file, (shard_base_dir / shard_entry.name.c_str()).native());
});
}
return make_ready_future<>();
}).get();
}
future<> manager::rebalance(sstring hints_directory) {
return seastar::async([hints_directory = std::move(hints_directory)] {
// Scan currently present hints segments.
hints_segments_map current_hints_segments = get_current_hints_segments(hints_directory);
// Move segments to achieve an even distribution of files among all present shards.
rebalance_segments(hints_directory, current_hints_segments);
// Remove the directories of shards that are not present anymore - they should not have any segments by now
remove_irrelevant_shards_directories(hints_directory);
});
}
void manager::update_backlog(size_t backlog, size_t max_backlog) {
_backlog_size = backlog;
_max_backlog_size = max_backlog;
if (backlog < max_backlog) {
allow_hints();
} else {
forbid_hints_for_eps_with_pending_hints();
}
}
}
}

View File

@@ -31,6 +31,7 @@
#include <seastar/core/timer.hh>
#include <seastar/core/lowres_clock.hh>
#include <seastar/core/shared_mutex.hh>
#include "lister.hh"
#include "gms/gossiper.hh"
#include "locator/snitch_base.hh"
#include "service/endpoint_lifecycle_subscriber.hh"
@@ -58,11 +59,19 @@ private:
uint64_t errors = 0;
uint64_t dropped = 0;
uint64_t sent = 0;
uint64_t discarded = 0;
};
// map: shard -> segments
using hints_ep_segments_map = std::unordered_map<unsigned, std::list<lister::path>>;
// map: IP -> map: shard -> segments
using hints_segments_map = std::unordered_map<sstring, hints_ep_segments_map>;
class drain_tag {};
using drain = seastar::bool_class<drain_tag>;
friend class space_watchdog;
public:
class end_point_hints_manager {
public:
@@ -94,7 +103,10 @@ public:
send_state::restart_segment>>;
struct send_one_file_ctx {
std::unordered_map<table_schema_version, column_mapping> schema_ver_to_column_mapping;
send_one_file_ctx(std::unordered_map<table_schema_version, column_mapping>& last_schema_ver_to_column_mapping)
: schema_ver_to_column_mapping(last_schema_ver_to_column_mapping)
{}
std::unordered_map<table_schema_version, column_mapping>& schema_ver_to_column_mapping;
seastar::gate file_send_gate;
std::unordered_set<db::replay_position> rps_set; // number of elements in this set is never going to be greater than the maximum send queue length
send_state_set state;
@@ -103,6 +115,7 @@ public:
private:
std::list<sstring> _segments_to_replay;
replay_position _last_not_complete_rp;
std::unordered_map<table_schema_version, column_mapping> _last_schema_ver_to_column_mapping;
state_set _state;
future<> _stopped;
clock::time_point _next_flush_tp;
@@ -113,6 +126,7 @@ public:
resource_manager& _resource_manager;
service::storage_proxy& _proxy;
database& _db;
seastar::scheduling_group _hints_cpu_sched_group;
gms::gossiper& _gossiper;
seastar::shared_mutex& _file_update_mutex;
@@ -173,6 +187,10 @@ public:
return _state.contains(state::stopping);
}
bool replay_allowed() const noexcept {
return _ep_manager.replay_allowed();
}
/// \brief Try to send one hint read from the file.
/// - Limit the maximum memory size of hints "in the air" and the maximum total number of hints "in the air".
/// - Discard the hints that are older than the grace seconds value of the corresponding table.
@@ -204,7 +222,7 @@ public:
/// \param ctx_ptr pointer to the send context
/// \param buf hints file entry
/// \return The mutation object representing the original mutation stored in the hints file.
mutation get_mutation(lw_shared_ptr<send_one_file_ctx> ctx_ptr, temporary_buffer<char>& buf);
frozen_mutation_and_schema get_mutation(lw_shared_ptr<send_one_file_ctx> ctx_ptr, temporary_buffer<char>& buf);
/// \brief Get a reference to the column_mapping object for a given frozen mutation.
/// \param ctx_ptr pointer to the send context
@@ -221,13 +239,13 @@ public:
/// \param m mutation to send
/// \param natural_endpoints current replicas for the given mutation
/// \return future that resolves when the operation is complete
future<> do_send_one_mutation(mutation m, const std::vector<gms::inet_address>& natural_endpoints) noexcept;
future<> do_send_one_mutation(frozen_mutation_and_schema m, const std::vector<gms::inet_address>& natural_endpoints) noexcept;
/// \brief Send one mutation out.
///
/// \param m mutation to send
/// \return future that resolves when the mutation sending processing is complete.
future<> send_one_mutation(mutation m);
future<> send_one_mutation(frozen_mutation_and_schema m);
/// \brief Get the last modification time stamp for a given file.
/// \param fname File name
@@ -322,6 +340,10 @@ public:
return _hints_in_progress;
}
bool replay_allowed() const noexcept {
return _shard_manager.replay_allowed();
}
bool can_hint() const noexcept {
return _state.contains(state::can_hint);
}
@@ -387,6 +409,17 @@ public:
}
};
enum class state {
started, // hinting is currently allowed (start() call is complete)
replay_allowed, // replaying (hints sending) is allowed
stopping // hinting is not allowed - stopping is in progress (stop() method has been called)
};
using state_set = enum_set<super_enum<state,
state::started,
state::replay_allowed,
state::stopping>>;
private:
using ep_key_type = typename end_point_hints_manager::key_type;
using ep_managers_map_type = std::unordered_map<ep_key_type, end_point_hints_manager>;
@@ -397,6 +430,7 @@ public:
static const std::chrono::seconds hint_file_write_timeout;
private:
state_set _state;
const boost::filesystem::path _hints_dir;
dev_t _hints_dir_device_id = 0;
@@ -408,7 +442,7 @@ private:
locator::snitch_ptr& _local_snitch_ptr;
int64_t _max_hint_window_us = 0;
database& _local_db;
bool _stopping = false;
seastar::gate _draining_eps_gate; // gate used to control the progress of ep_managers stopping not in the context of manager::stop() call
resource_manager& _resource_manager;
@@ -418,9 +452,14 @@ private:
seastar::metrics::metric_groups _metrics;
std::unordered_set<ep_key_type> _eps_with_pending_hints;
size_t _max_backlog_size = 1;
size_t _backlog_size = 0;
public:
manager(sstring hints_directory, std::vector<sstring> hinted_dcs, int64_t max_hint_window_ms, resource_manager&res_manager, distributed<database>& db);
virtual ~manager();
manager(manager&&) = delete;
manager& operator=(manager&&) = delete;
void register_metrics(const sstring& group_name);
future<> start(shared_ptr<service::storage_proxy> proxy_ptr, shared_ptr<gms::gossiper> gossiper_ptr, shared_ptr<service::storage_service> ss_ptr);
future<> stop();
@@ -497,23 +536,101 @@ public:
void forbid_hints();
void forbid_hints_for_eps_with_pending_hints();
static future<> rebalance() {
// TODO
return make_ready_future<>();
size_t max_backlog_size() const {
return _max_backlog_size;
}
size_t backlog_size() const {
return _backlog_size;
}
void allow_replaying() noexcept {
_state.set(state::replay_allowed);
}
/// \brief Rebalance hints segments among all present shards.
///
/// The difference between the number of segments on every two shard will be not greater than 1 after the
/// rebalancing.
///
/// Removes the sub-directories of \ref hints_directory that correspond to shards that are not relevant any more
/// (re-sharding to a lower shards number case).
///
/// Complexity: O(N+K), where N is a total number of present hints' segments and
/// K = <number of shards during the previous boot> * <number of end points for which hints where ever created>
///
/// \param hints_directory A hints directory to rebalance
/// \return A future that resolves when the operation is complete.
static future<> rebalance(sstring hints_directory);
virtual void on_join_cluster(const gms::inet_address& endpoint) override {}
virtual void on_leave_cluster(const gms::inet_address& endpoint) override {
drain_for(endpoint);
};
virtual void on_up(const gms::inet_address& endpoint) override {}
virtual void on_down(const gms::inet_address& endpoint) override {}
virtual void on_move(const gms::inet_address& endpoint) override {}
private:
future<> compute_hints_dir_device_id();
/// \brief Scan the given hints directory and build the map of all present hints segments.
///
/// Complexity: O(N+K), where N is a total number of present hints' segments and
/// K = <number of shards during the previous boot> * <number of end points for which hints where ever created>
///
/// \note Should be called from a seastar::thread context.
///
/// \param hints_directory directory to scan
/// \return a map: ep -> map: shard -> segments (full paths)
static hints_segments_map get_current_hints_segments(const sstring& hints_directory);
/// \brief Rebalance hints segments for a given (destination) end point
///
/// This method is going to consume files from the \ref segments_to_move and distribute them between the present
/// shards (taking into an account the \ref ep_segments state - there may be zero or more segments that belong to a
/// particular shard in it) until we either achieve the requested \ref segments_per_shard level on each shard
/// or until we are out of files to move.
///
/// As a result (in addition to the actual state on the disk) both \ref ep_segments and \ref segments_to_move are going
/// to be modified.
///
/// Complexity: O(N), where N is a total number of present hints' segments for the \ref ep end point (as a destination).
///
/// \note Should be called from a seastar::thread context.
///
/// \param ep destination end point ID (a string with its IP address)
/// \param segments_per_shard number of hints segments per-shard we want to achieve
/// \param hints_directory a root hints directory
/// \param ep_segments a map that was originally built by get_current_hints_segments() for this end point
/// \param segments_to_move a list of segments we are allowed to move
static void rebalance_segments_for(
const sstring& ep,
size_t segments_per_shard,
const sstring& hints_directory,
hints_ep_segments_map& ep_segments,
std::list<lister::path>& segments_to_move);
/// \brief Rebalance all present hints segments.
///
/// The difference between the number of segments on every two shard will be not greater than 1 after the
/// rebalancing.
///
/// Complexity: O(N), where N is a total number of present hints' segments.
///
/// \note Should be called from a seastar::thread context.
///
/// \param hints_directory a root hints directory
/// \param segments_map a map that was built by get_current_hints_segments()
static void rebalance_segments(const sstring& hints_directory, hints_segments_map& segments_map);
/// \brief Remove sub-directories of shards that are not relevant any more (re-sharding to a lower number of shards case).
///
/// Complexity: O(S*E), where S is a number of shards during the previous boot and
/// E is a number of end points for which hints where ever created.
///
/// \param hints_directory a root hints directory
static void remove_irrelevant_shards_directories(const sstring& hints_directory);
node_to_hint_store_factory_type& store_factory() noexcept {
return _store_factory;
}
@@ -544,6 +661,28 @@ private:
/// \param endpoint node that left the cluster
void drain_for(gms::inet_address endpoint);
void update_backlog(size_t backlog, size_t max_backlog);
bool stopping() const noexcept {
return _state.contains(state::stopping);
}
void set_stopping() noexcept {
_state.set(state::stopping);
}
bool started() const noexcept {
return _state.contains(state::started);
}
void set_started() noexcept {
_state.set(state::started);
}
bool replay_allowed() const noexcept {
return _state.contains(state::replay_allowed);
}
public:
ep_managers_map_type::iterator find_ep_manager(ep_key_type ep_key) noexcept {
return _ep_managers.find(ep_key);

View File

@@ -27,6 +27,7 @@
#include "lister.hh"
#include "disk-error-handler.hh"
#include "seastarx.hh"
#include <seastar/core/sleep.hh>
namespace db {
namespace hints {
@@ -65,19 +66,28 @@ const std::chrono::seconds space_watchdog::_watchdog_period = std::chrono::secon
space_watchdog::space_watchdog(shard_managers_set& managers, per_device_limits_map& per_device_limits_map)
: _shard_managers(managers)
, _per_device_limits_map(per_device_limits_map)
, _timer([this] { on_timer(); })
{}
void space_watchdog::start() {
_timer.arm(timer_clock_type::now());
_started = seastar::async([this] {
while (!_as.abort_requested()) {
try {
on_timer();
} catch (...) {
resource_manager_logger.trace("space_watchdog: unexpected exception - stop all hints generators");
// Stop all hint generators if space_watchdog callback failed
for (manager& shard_manager : _shard_managers) {
shard_manager.forbid_hints();
}
}
seastar::sleep_abortable(_watchdog_period, _as).get();
}
}).handle_exception_type([] (const seastar::sleep_aborted& ignored) { });
}
future<> space_watchdog::stop() noexcept {
try {
return _gate.close().finally([this] { _timer.cancel(); });
} catch (...) {
return make_exception_future<>(std::current_exception());
}
_as.request_abort();
return std::move(_started);
}
future<> space_watchdog::scan_one_ep_dir(boost::filesystem::path path, manager& shard_manager, ep_key_type ep_key) {
@@ -94,83 +104,62 @@ future<> space_watchdog::scan_one_ep_dir(boost::filesystem::path path, manager&
});
}
// Called from the context of a seastar::thread.
void space_watchdog::on_timer() {
with_gate(_gate, [this] {
return futurize_apply([this] {
_total_size = 0;
// The hints directories are organized as follows:
// <hints root>
// |- <shard1 ID>
// | |- <EP1 address>
// | |- <hints file1>
// | |- <hints file2>
// | |- ...
// | |- <EP2 address>
// | |- ...
// | |-...
// |- <shard2 ID>
// | |- ...
// ...
// |- <shardN ID>
// | |- ...
//
return do_for_each(_shard_managers, [this] (manager& shard_manager) {
shard_manager.clear_eps_with_pending_hints();
// The hints directories are organized as follows:
// <hints root>
// |- <shard1 ID>
// | |- <EP1 address>
// | |- <hints file1>
// | |- <hints file2>
// | |- ...
// | |- <EP2 address>
// | |- ...
// | |-...
// |- <shard2 ID>
// | |- ...
// ...
// |- <shardN ID>
// | |- ...
for (auto& per_device_limits : _per_device_limits_map | boost::adaptors::map_values) {
_total_size = 0;
for (manager& shard_manager : per_device_limits.managers) {
shard_manager.clear_eps_with_pending_hints();
lister::scan_dir(shard_manager.hints_dir(), {directory_entry_type::directory}, [this, &shard_manager] (lister::path dir, directory_entry de) {
_files_count = 0;
// Let's scan per-end-point directories and enumerate hints files...
//
return lister::scan_dir(shard_manager.hints_dir(), {directory_entry_type::directory}, [this, &shard_manager] (lister::path dir, directory_entry de) {
_files_count = 0;
// Let's scan per-end-point directories and enumerate hints files...
//
// Let's check if there is a corresponding end point manager (may not exist if the corresponding DC is
// not hintable).
// If exists - let's take a file update lock so that files are not changed under our feet. Otherwise, simply
// continue to enumeration - there is no one to change them.
auto it = shard_manager.find_ep_manager(de.name);
if (it != shard_manager.ep_managers_end()) {
return with_lock(it->second.file_update_mutex(), [this, &shard_manager, dir = std::move(dir), ep_name = std::move(de.name)]() mutable {
return scan_one_ep_dir(dir / ep_name.c_str(), shard_manager, ep_key_type(ep_name));
});
} else {
return scan_one_ep_dir(dir / de.name.c_str(), shard_manager, ep_key_type(de.name));
}
});
}).then([this] {
return do_for_each(_per_device_limits_map, [this](per_device_limits_map::value_type& per_device_limits_entry) {
space_watchdog::per_device_limits& per_device_limits = per_device_limits_entry.second;
size_t adjusted_quota = 0;
size_t delta = boost::accumulate(per_device_limits.managers, 0, [] (size_t sum, manager& shard_manager) {
return sum + shard_manager.ep_managers_size() * resource_manager::hint_segment_size_in_mb * 1024 * 1024;
// Let's check if there is a corresponding end point manager (may not exist if the corresponding DC is
// not hintable).
// If exists - let's take a file update lock so that files are not changed under our feet. Otherwise, simply
// continue to enumeration - there is no one to change them.
auto it = shard_manager.find_ep_manager(de.name);
if (it != shard_manager.ep_managers_end()) {
return with_lock(it->second.file_update_mutex(), [this, &shard_manager, dir = std::move(dir), ep_name = std::move(de.name)]() mutable {
return scan_one_ep_dir(dir / ep_name.c_str(), shard_manager, ep_key_type(ep_name));
});
if (per_device_limits.max_shard_disk_space_size > delta) {
adjusted_quota = per_device_limits.max_shard_disk_space_size - delta;
}
} else {
return scan_one_ep_dir(dir / de.name.c_str(), shard_manager, ep_key_type(de.name));
}
}).get();
}
bool can_hint = _total_size < adjusted_quota;
resource_manager_logger.trace("space_watchdog: total_size ({}) {} max_shard_disk_space_size ({})", _total_size, can_hint ? "<" : ">=", adjusted_quota);
if (!can_hint) {
for (manager& shard_manager : per_device_limits.managers) {
shard_manager.forbid_hints_for_eps_with_pending_hints();
}
} else {
for (manager& shard_manager : per_device_limits.managers) {
shard_manager.allow_hints();
}
}
});
});
}).handle_exception([this] (auto eptr) {
resource_manager_logger.trace("space_watchdog: unexpected exception - stop all hints generators");
// Stop all hint generators if space_watchdog callback failed
for (manager& shard_manager : _shard_managers) {
shard_manager.forbid_hints();
}
}).finally([this] {
_timer.arm(_watchdog_period);
// Adjust the quota to take into account the space we guarantee to every end point manager
size_t adjusted_quota = 0;
size_t delta = boost::accumulate(per_device_limits.managers, 0, [] (size_t sum, manager& shard_manager) {
return sum + shard_manager.ep_managers_size() * resource_manager::hint_segment_size_in_mb * 1024 * 1024;
});
});
if (per_device_limits.max_shard_disk_space_size > delta) {
adjusted_quota = per_device_limits.max_shard_disk_space_size - delta;
}
resource_manager_logger.trace("space_watchdog: consuming {}/{} bytes", _total_size, adjusted_quota);
for (manager& shard_manager : per_device_limits.managers) {
shard_manager.update_backlog(_total_size, adjusted_quota);
}
}
}
future<> resource_manager::start(shared_ptr<service::storage_proxy> proxy_ptr, shared_ptr<gms::gossiper> gossiper_ptr, shared_ptr<service::storage_service> ss_ptr) {
@@ -183,6 +172,10 @@ future<> resource_manager::start(shared_ptr<service::storage_proxy> proxy_ptr, s
});
}
void resource_manager::allow_replaying() noexcept {
boost::for_each(_shard_managers, [] (manager& m) { m.allow_replaying(); });
}
future<> resource_manager::stop() noexcept {
return parallel_for_each(_shard_managers, [](manager& m) {
return m.stop();
@@ -201,14 +194,18 @@ future<> resource_manager::prepare_per_device_limits() {
auto it = _per_device_limits_map.find(device_id);
if (it == _per_device_limits_map.end()) {
return is_mountpoint(shard_manager.hints_dir().parent_path()).then([this, device_id, &shard_manager](bool is_mountpoint) {
// By default, give each group of managers 10% of the available disk space. Give each shard an equal share of the available space.
size_t max_size = boost::filesystem::space(shard_manager.hints_dir().c_str()).capacity / (10 * smp::count);
// If hints directory is a mountpoint, we assume it's on dedicated (i.e. not shared with data/commitlog/etc) storage.
// Then, reserve 90% of all space instead of 10% above.
if (is_mountpoint) {
max_size *= 9;
auto [it, inserted] = _per_device_limits_map.emplace(device_id, space_watchdog::per_device_limits{});
// Since we possibly deferred, we need to recheck the _per_device_limits_map.
if (inserted) {
// By default, give each group of managers 10% of the available disk space. Give each shard an equal share of the available space.
it->second.max_shard_disk_space_size = boost::filesystem::space(shard_manager.hints_dir().c_str()).capacity / (10 * smp::count);
// If hints directory is a mountpoint, we assume it's on dedicated (i.e. not shared with data/commitlog/etc) storage.
// Then, reserve 90% of all space instead of 10% above.
if (is_mountpoint) {
it->second.max_shard_disk_space_size *= 9;
}
}
_per_device_limits_map.emplace(device_id, space_watchdog::per_device_limits{{std::ref(shard_manager)}, max_size});
it->second.managers.emplace_back(std::ref(shard_manager));
});
} else {
it->second.managers.emplace_back(std::ref(shard_manager));

Some files were not shown because too many files have changed in this diff Show More