Compare commits

..

2524 Commits

Author SHA1 Message Date
Jenkins
d27eb734a7 release: prepare for 2.2.2 by hagitsegev 2019-01-12 18:28:25 +02:00
Avi Kivity
e6aeb490b5 Update seastar submodule
* seastar 6f61d74...88cb58c (2):
  > reactor: disable nowait aio due to a kernel bug
  > configure.py: Enhance detection for gcc -fvisibility=hidden bug

Fixes #3996.
2018-12-17 15:57:58 +02:00
Vladimir Krivopalov
2e3b09b593 database: Capture io_priority_class by reference to avoid dangling ref.
The original reference points to a thread-local storage object that
guaranteed to outlive the continuation, but copying it make the
subsequent calls point to a local object and introduces a use-after-free
bug.

Fixes #3948

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
(cherry picked from commit 68458148e7)
2018-12-02 13:32:59 +02:00
Tomasz Grabiec
92c74f4e0b utils: phased_barrier: Make advance_and_await() have strong exception guarantees
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.

One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.

This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>

Fixes #3931.

(cherry picked from commit 57e25fa0f8)
2018-11-21 12:18:25 +02:00
Avi Kivity
89d835e9e3 tests: fix network_topology_test timing out in debug mode
In 2.2, SEASTAR_DEBUG is just DEBUG.
2018-10-21 19:04:08 +03:00
Takuya ASADA
263a740084 dist/debian: use --configfile to specify pbuilderrc
Use --configfile to specify pbuilderrc, instead of copying it to home directory.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180420024624.9661-1-syuu@scylladb.com>
(cherry picked from commit 01c36556bf)
2018-10-21 18:21:18 +03:00
Avi Kivity
7f24b5319e release: prepare for 2.2.1 2018-10-19 21:16:14 +03:00
Avi Kivity
fe16c0e985 locator: fix abstract_replication_strategy::get_ranges() and friends violating sort order
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.

Fixes #3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>

(cherry picked from commit 1ce52d5432)
2018-10-19 21:16:12 +03:00
Glauber Costa
f85badaaac api: use longs instead of ints for snapshot sizes
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.

Without this patch, listsnapshots is broken in all versions.

Fixes: #3845

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
(cherry picked from commit 98332de268)
2018-10-12 22:02:56 +03:00
Eliran Sinvani
2193d41683 cql3 : add workaround to antlr3 null dereference bug
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.

Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).

Fixes #3740
Fixes #3764

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
(cherry picked from commit 20f49566a2)
2018-10-08 11:02:16 +03:00
Avi Kivity
1e1f0c29bf utils: crc32: mark power crc32 assembly as not requiring an executable stack
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).

However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.

Fix by adding the correct incantation to the file.

Fixes #3799.

Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
2018-10-08 11:02:16 +03:00
Calle Wilund
84d4588b5f storage_proxy: Add missing re-throw in truncate_blocking
Iff truncation times out, we want to log it, but the exception should
not be swallowed, but re-thrown.

Fixes #3796.

Message-Id: <20181001112325.17809-1-calle@scylladb.com>
(cherry picked from commit 2996b8154f)
2018-10-08 11:02:16 +03:00
Duarte Nunes
7b43b26709 tests/aggregate_fcts_test: Add test case for wrapped types
Provide a test case which checks a type being wrapped in a
reverse_type plays no role in assignment.

Refs #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-2-duarte@scylladb.com>
(cherry picked from commit 17578c3579)
2018-10-08 11:02:16 +03:00
Duarte Nunes
0ed01acf15 cql3/selection/selector: Unwrap types when validating assignment
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.

Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.

Tests: unit(release)

Fixes #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
(cherry picked from commit 5e7bb20c8a)
2018-10-08 11:02:16 +03:00
Gleb Natapov
7ce160f408 mutation_query_test: add test for result size calculation
Check that digest only and digest+data query calculate result size to be
the same.

Message-Id: <20180906153800.GK2326@scylladb.com>
(cherry picked from commit 9e438933a2)

Message-Id: <20181008075901.GC2380@scylladb.com>
2018-10-08 11:02:09 +03:00
Gleb Natapov
5017d9b46a mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
(cherry picked from commit d7674288a9)
2018-10-07 18:16:19 +03:00
Gleb Natapov
50b6ab3552 mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
(cherry picked from commit 98092353df)
2018-09-06 16:51:19 +03:00
Eliran Sinvani
b1652823aa cql3: ensure repeated values in IN clauses don't return repeated rows
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.

Added queries for the in restriction unitest and fixed with a bad result check.

Fixes #2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
2018-08-26 15:51:17 +03:00
Tomasz Grabiec
02b24aec34 Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

 std::experimental::optional<clustering_key> row_key;
 if (!column.is_static()) {
   row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
 }
 ...
 auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

* https://github.com/duarten/scylla cql-list-fixes/v1:
  tests/cql_query_test: Test multi-cell static list updates with ckeys
  cql3/lists: Fix multi-cell static list updates in the presence of ckeys
  keys: Add factory for an empty clustering_key_prefix_view

(cherry picked from commit 6937cc2d1c)
2018-08-21 21:39:22 +01:00
Duarte Nunes
22eea4d8cf cql3/query_options: Use _value_views in prepare()
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.

In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.

As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.

Fixes #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
(cherry picked from commit a4355fe7e7)
2018-08-21 21:39:22 +01:00
Tomasz Grabiec
d257f6d57c mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 024b3c9fd9)
2018-08-21 21:39:18 +01:00
Takuya ASADA
6fca92ac3c dist/common/scripts/scylla_ec2_check: support custom NIC ifname on EC2
This is bash version of commit 88fe3c2694.

Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.

Fixes #3658

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180807231650.13697-1-syuu@scylladb.com>
2018-08-08 09:16:57 +03:00
Jesse Haber-Kucharsky
26e3917046 auth: Don't use unsupported hashing algorithms
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.

This is consistent with the documentation of the function in its man
page.

As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".

The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):

    Some implementations return `NULL` on failure, and others return an
    _invalid_ hashed passphrase, which will begin with a `*` and will
    not be the same as SALT.

Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.

With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.

Fixes #3637.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
(cherry picked from commit fce10f2c6e)
2018-08-05 10:30:47 +03:00
Gleb Natapov
3892594a93 cache_hitrate_calculator: fix race when new table is added during calculations
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.

Fixes #3636

Message-Id: <20180801093147.GD23569@scylladb.com>
(cherry picked from commit 44a6afad8c)
2018-08-01 14:34:08 +03:00
Amos Kong
4b24439841 scylla_setup: fix conditional statement of silent mode
Commit 300af65555 introdued a problem in
conditional statement, script will always abort in silent mode, it doesn't
care about the return value.

Fixes #3485

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <1c12ab04651352964a176368f8ee28f19ae43c68.1528077114.git.amos@scylladb.com>
(cherry picked from commit 364c2551c8)
2018-07-25 09:36:32 +03:00
Takuya ASADA
a02a4592d8 dist/common/scripts/scylla_setup: abort running script when one of setup failed in silent mode
Current script silently continues even one of setup fails, need to
abort.

Fixes #3433

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180522180355.1648-1-syuu@scylladb.com>
(cherry picked from commit 300af65555)
2018-07-25 09:36:29 +03:00
Avi Kivity
b6e1c08451 Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()

(cherry picked from commit 31151cadd4)
2018-07-18 12:07:01 +02:00
Botond Dénes
9469afcd27 storage_proxy: use the original row limits for the final results merging
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.

Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.

Fixes #3605.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
(cherry picked from commit cc4acb6e26)
2018-07-16 17:51:06 +03:00
Avi Kivity
240b9f122b Merge "Backport empty partition range scan fixes" from Botond
"
This mini-series lumps together the fix for the empty partition range
scan crash (#3564) and the two follow-up patches.
"

* 'paging-fix-backport-2.2/v1' of https://github.com/denesb/scylla:
  query_pager: use query::is_single_partition() to check for singular range
  tests/cql_query_tess: add unit test for querying empty ranges test
  query_pager: be prepared to _ranges being empty
2018-07-05 10:29:31 +03:00
Botond Dénes
cb16cd7724 query_pager: use query::is_single_partition() to check for singular range
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.

Found during code-review.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
(cherry picked from commit 8084ce3a8e)
2018-07-04 12:57:45 +03:00
Botond Dénes
c864d198fc tests/cql_query_tess: add unit test for querying empty ranges test
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.

Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
(cherry picked from commit c236a96d7d)
2018-07-04 09:52:54 +03:00
Botond Dénes
25125e9c4f query_pager: be prepared to _ranges being empty
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.

Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
(cherry picked from commit 59a30f0684)
2018-07-04 09:52:54 +03:00
Shlomi Livne
faf10fe6aa release: prepare for 2.2.0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-07-01 22:40:42 +03:00
Calle Wilund
f76269cdcf sstables::compress: Ensure unqualified compressor name if possible
Fixes #3546

Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).

This behaviour was not preserved in the virtualization change. But
probably should be.

Message-Id: <20180627110930.1619-1-calle@scylladb.com>
(cherry picked from commit 054514a47a)
2018-06-28 18:55:15 +03:00
Avi Kivity
a9b0ccf116 Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.

There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.

Disable until a more elaborate fix is implemented.

Fixes #3552
Fixes #3553
"

* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
  tests: Add test for slicing a mutation source with date tiered compaction strategy
  tests: Check that database conforms to mutation source
  database: Disable sstable filtering based on min/max clustering key components

(cherry picked from commit e1efda8b0c)
2018-06-28 18:55:15 +03:00
Tomasz Grabiec
abc5941f87 flat_mutation_reader: Move field initialization to initializer list
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.

Backtrace:

Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
  #0  0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
  #1  0x00007ffff17d077d in abort () from /lib64/libc.so.6
  #2  0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
  #3  0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
  #4  0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
  #5  0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #6  0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
  #7  0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #8  0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
  #9  0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
    at /usr/include/c++/7/bits/unique_ptr.h:825
  #10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
    at flat_mutation_reader.hh:440
  #11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
  #12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
    at memtable.cc:592

Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 6d6b93d1e7)
2018-06-28 18:55:15 +03:00
Asias He
a152ac12af gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
(cherry picked from commit c3b5a2ecd5)
2018-06-28 18:55:15 +03:00
Botond Dénes
c274fdf2ec querier: find_querier(): return end() when no querier matches the range
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.

Fixes: #3530
(cherry picked from commit 2609a17a23)
2018-06-28 18:55:15 +03:00
Botond Dénes
5b88d6b4d6 querier_cache: restructure entries storage
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.

All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.

Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.

To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
  work on the entry list itself, no need to involve additional data
  structures for this.
* All data related to an entry is stored in one place, no data
  duplication.
* Removing an entry automatically removes it from the index as intrusive
  containers support auto unlink. This means there is no need to store
  iterators for long terms, risking use-after-free when the container
  invalidates it's iterators.

Additional changes:
* Modify eviction strategies so that they work with the `entry`
  interface rather than the stored value directly.

Ref #3424

(cherry picked from commit 7ce7f3f0cc)
2018-06-28 18:55:15 +03:00
Botond Dénes
2d626e1cf8 tests/querier_cache: fix memory based eviction test
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.

Fixes: #3529
(cherry picked from commit b9d51b4c08)
2018-06-28 18:55:15 +03:00
Avi Kivity
c11bd3e1cf Merge "Do not allow compaction controller shares to grow indefinitely" from Glauber
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.

It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"

* 'tame-controller' of github.com:glommer/scylla:
  controller: do not increase shares of controllers for inputs higher than the maximum
  controller: adjust constants for compaction controller

(cherry picked from commit e0eb66af6b)
2018-06-20 10:58:20 +03:00
Avi Kivity
9df3df92bc Merge "Try harder to move STCS towards zero-backlog" from Glauber
"
Tests: unit (release)

Before merging the LCS controller, we merged patches that would
guarantee that LCS would move towards zero backlog - otherwise the
backlog could get too high.

We didn't do the same for STCS, our first controlled strategy. So we may
end up with a situation where there are many SSTables inducing a large
backlog, but they are not yet meeting the minimum criteria for
compaction. The backlog, then, never goes down.

This patch changes the SSTable selection criteria so that if there is
nothing to do, we'll keep pushing towards reaching a state of zero
backlog. Very similar to what we did for LCS.
"

* 'stcs-min-threshold-v4' of github.com:glommer/scylla:
  STCS: bypass min_threshold unless configure to enforce strictly
  compaction_strategy: allow the user to tell us if min_threshold has to be strict

(cherry picked from commit f0fc888381)
2018-06-18 14:21:52 +03:00
Takuya ASADA
8ad9578a6c dist/debian: add --jobs <njobs> option just like build_rpm.sh
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
(cherry picked from commit 782ebcece4)
2018-06-14 15:04:50 +03:00
Tomasz Grabiec
4cb6061a9f tests: row_cache: Reduce concurrency limit to avoid bad_alloc
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.

Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
(cherry picked from commit a91974af7a)
2018-06-14 13:40:00 +02:00
Tomasz Grabiec
1940e6bd95 tests: row_cache: Do not hang when only one of the readers throws
Message-Id: <20180531122729.3314-1-tgrabiec@scylladb.com>
(cherry picked from commit b5e42bc6a0)
2018-06-14 13:40:00 +02:00
Avi Kivity
044cfde5f3 database: stop using incremental selectors
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.

This degrades scan performance of Leveled Compaction Strategy tables.

Fixes #3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>

(cherry picked from commit aeffbb6732)
2018-06-13 21:04:56 +03:00
Vlad Zolotarov
262a246436 locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting()
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.

Fixes #3454

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 2dde372ae6)
2018-06-12 19:02:19 +03:00
Botond Dénes
799dbb4f2e forwardable reader: implement fast_forward_to(position_in_partition)
Instead of throwing std::bad_function_call. Needed by the foreign_reader
unit test. Not sure how other tests didn't hit this before as the test
is using `run_mutation_source_tests()`.

(cherry picked from commit 50b67232e5)
Fixes #3491.
2018-06-05 12:34:15 +03:00
Shlomi Livne
a2fe669dd3 dist/docker: Switch to Scylla 2.2 repository
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <83b4ff801b283ade512a7035ecea9057a864dcdd.1526995747.git.shlomi@scylladb.com>
2018-06-05 12:34:15 +03:00
Avi Kivity
56de761daf Update seastar submodule
* seastar 7c6ba3a...6f61d74 (1):
  > tls: Ensure handshake always drains output before return/throw

Fixes #3461.
2018-06-05 12:34:15 +03:00
Shlomi Livne
c3187093a3 release: prepare for 2.2.rc2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-05-30 17:32:16 +03:00
Avi Kivity
111c2ecf5d Update scylla-ami submodule
* dist/ami/files/scylla-ami 49896ec...6ed71a3 (1):
  > scylla_install_ami: Update CentOS to latest version
2018-05-28 14:02:43 +03:00
Takuya ASADA
a6ecdbbba6 Revert "dist/ami: update CentOS base image to latest version"
This reverts commit 69d226625a.
Since ami-4bf3d731 is Market Place AMI, not possible to publish public AMI based on it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523112414.27307-1-syuu@scylladb.com>
(cherry picked from commit 6b1b9f9e602c570bbc96692d30046117e7d31ea7)
2018-05-28 13:40:15 +03:00
Glauber Costa
17cc62d0b3 commitlog: don't move pointer to segment
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.

The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.

Probably #3440.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
(cherry picked from commit 596a525950)
2018-05-19 19:12:26 +03:00
Shlomi Livne
eb646c61ed release: prepare for 2.2.rc1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-05-16 21:31:50 +03:00
Avi Kivity
782d817e84 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>

(cherry picked from commit 3b8118d4e5)
2018-05-16 20:13:59 +03:00
Avi Kivity
3ed5e63e8a Update scylla-ami submodule
* dist/ami/files/scylla-ami 02b1853...49896ec (1):
  > Merge "AMI build fix" from Takuya
2018-05-16 12:37:03 +03:00
Tomasz Grabiec
d17ce46983 Update seastar submodule
Fixes #3339.

* seastar 491f994...7c6ba3a (2):
  > Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
  > Merge 'Handle Intel's NICs in a special way'  from Vlad
2018-05-16 09:37:41 +02:00
Takuya ASADA
7ca5e7e993 dist/redhat: replace scylla-libgcc72/scylla-libstdc++72 with scylla-2.2 metapackage
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.

Fixes #3373

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
(cherry picked from commit 6fa3c4dcad)
2018-05-11 09:42:28 +03:00
Duarte Nunes
07b0ce27fa Merge 'Include OPTIONS with LIST ROLES' from Jesse
"
Fixes #3420.

Tests: dtest (`auth_test.py`), unit (release)
"

* 'jhk/fix_3420/v2' of https://github.com/hakuch/scylla:
  cql3: Include custom options in LIST ROLES
  auth: Query custom options from the `authenticator`
  auth: Add type alias for custom auth. options

(cherry picked from commit d49348b0e1)
2018-05-10 13:22:49 +03:00
Amnon Heiman
27be3cd242 scylla-housekeeping: support new 2018.1 path variation
Starting from 2018.1 and 2.2 there was a change in the repository path.
It was made to support multiple product (like manager and place the
enterprise in a different path).

As a result, the regular expression that look for the repository fail.

This patch change the way the path is searched, both rpm and debian
varations are combined and both options of the repository path are
unified.

See scylladb/scylla-enterprise#527

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180429151926.20431-1-amnon@scylladb.com>
(cherry picked from commit 6bf759128b)
2018-05-09 15:22:55 +03:00
Calle Wilund
abf50aafef database: Fix assert in truncate
Fixes crash in cql_tests.StorageProxyCQLTester.table_test
"avoid race condition when deleting sstable on behalf..." changed
discard_sstables behaviour to only return rp:s for sstables owned
and submitted for deletion (not all matching time stamp),
which can in some cases cause zero rp returned.
Message-Id: <20180508070003.1110-1-calle@scylladb.com>
2018-05-09 10:02:09 +01:00
Duarte Nunes
dfe5b38a43 db/view: Limit number of pending view updates
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
(cherry picked from commit 4b3562c3f5)
2018-05-08 00:46:33 +01:00
Duarte Nunes
9bdc8c25f5 db/view: Return a future when sending view updates
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit dc44a08370)
2018-05-08 00:46:19 +01:00
Duarte Nunes
e75c55b2db db/timeout_clock: Properly scope type names
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-1-duarte@scylladb.com>
(cherry picked from commit 2be75bdfc9)
2018-05-07 19:29:48 +01:00
Botond Dénes
756feae052 database: when dropping a table evict all relevant queriers
Queriers shouldn't outlive the table they read from as that could lead
to use-after-free problems when they are destroyed.

Fixes: #3414

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3d7172cef79bb52b7097596e1d4ebba3a6ff757e.1525716986.git.bdenes@scylladb.com>
(cherry picked from commit 6f7d919470)
2018-05-07 21:20:42 +03:00
Tomasz Grabiec
202b4e6797 storage_proxy: Request schema from the coordinator in the original DC
The mutation forwarding intermediary (src_addr) may not always know
about the schema which was used by the original coordinator. I think
this may be the cause of the "Schema version ... not found" error seen
in one of the clusters which entered some pathological state:

  storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found)

Fixes #3393.

Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 423712f1fe)
2018-05-07 13:08:40 +03:00
Raphael S. Carvalho
76ac200eff database: avoid race condition when deleting sstable on behalf of cf truncate
After removal of deletion manager, caller is now responsible for properly
submitting the deletion of a shared sstable. That's because deletion manager
was responsible for holding deletion until all owners agreed on it.
Resharding for example was changed to delete the shared sstables at the end,
but truncate wasn't changed and so race condition could happen when deleting
same sstable at more than one shard in parallel. Change the operation to only
submit a shared sstable for deletion in only one owner.

Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>
2018-05-04 13:10:12 +01:00
Tomasz Grabiec
9aa172fe8e db: schema_tables: Treat drop of scylla_tables.version as an alter
After upgrade from 1.7 to 2.0, nodes will record a per-table schema
version which matches that on 1.7 to support the rolling upgrade. Any
later schema change (after the upgrade is done) will drop this record
from affected tables so that the per-table schema version is
recalculated. If nodes perform a schema pull (they detect schema
mismatch), then the merge will affect all tables and will wipe the
per-table schema version record from all tables, even if their schema
did not change. If then only some nodes get restarted, the restarted
nodes will load tables with the new (recalculated) per-table schema
version, while not restarted nodes will still use the 1.7 per-table
schema version. Until all nodes are restarted, writes or reads between
nodes from different groups will involve a needless exchange of schema
definition.

This will manifest in logs with repeated messages indicating schema
merge with no effect, triggered by writes:

  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f

The sync will be performed if the receiving shard forgets the foreign
version, which happens if it doesn't process any request referencing
it for more than 1 second.

This may impact latency of writes and reads.

The fix is to treat schema changes which drop the 1.7 per-table schema
version marker as an alter, which will switch in-memory data
structures to use the new per-table schema version immediately,
without the need for a restart.

Fixes #3394

Tests:
    - dtest: schema_test.py, schema_management_test.py
    - reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git
    - unit (release)

Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b1465291cf)
2018-05-03 10:51:19 +03:00
Takuya ASADA
c4af043ef7 dist/common/scripts/scylla_raid_setup: prevent 'device or resource busy' on creating mdraid device
According to this web site, there is possibility we have race condition with
mdraid creation vs udev:
http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html
And looks like it can happen on our AMI, too (see #2784).

To initialize RAID safely, we should wait udev events are finished before and
after mdadm executed.

Fixes #2784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1505898196-28389-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 4a8ed4cc6f)
2018-04-24 12:53:34 +03:00
Raphael S. Carvalho
06b25320be sstables: Fix bloom filter size after resharding by properly estimating partition count
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.

So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.

Partition count estimation for a shard S will now be done as follow:
    //
    // TE, the total estimated partition count for a shard S, is defined as
    // TE = Sum(i = 0...N) { Ei / Si }.
    //
    // where i is an input sstable that belongs to shard S,
    //       Ei is the estimated partition count for sstable i,
    //       Si is the total number of shards that own sstable i.

Fixes #2672.
Refs #3302.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
(cherry picked from commit 11940ca39e)
2018-04-24 12:53:34 +03:00
Takuya ASADA
ff70d9f15c dist: Drop AmbientCapabilities from scylla-server.service for Debian 8
Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd
unit file, so drop the line when we build .deb package for Debian 8.
For other distributions, keep using the feature.

Fixes #3344

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180423102041.2138-1-syuu@scylladb.com>
(cherry picked from commit 7b92c3fd3f)
2018-04-24 12:53:34 +03:00
Avi Kivity
9bbd5821a2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 9b4be70...02b1853 (1):
  > scylla_install_ami: remove the host id file after scylla_setup
2018-04-24 12:53:34 +03:00
Avi Kivity
a7841f1f2e release: prepare for 2.2.rc0 2018-04-18 11:08:43 +03:00
Takuya ASADA
84859e0745 dist/debian: use ~root as HOME to place .pbuilderrc
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.

So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.

Fixes #3366

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ace44784e8)
2018-04-17 09:38:43 +03:00
Avi Kivity
6b74e1f02d Update seastar submodule
* seastar bcfbe0c...491f994 (3):
  > tls: Ensure we always pass through semaphores on shutdown
  > cpu scheduler: don't penalize first group to run
  > reactor: fix sleep mode

Fixes #3350.
2018-04-14 20:44:11 +03:00
Avi Kivity
520f17b315 Point seastar submodule at scylla-seastar.git
This allows backporting seastar patches.
2018-04-14 20:43:28 +03:00
Gleb Natapov
9fe3d04f31 cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
(cherry picked from commit 1a9aaece3e)
2018-04-12 16:57:07 +03:00
Raphael S. Carvalho
a74183eb1e sstables/compaction_manager: do not break lcs invariant by not allowing parallel compaction for it
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.

The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.

That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.

Fixes #3279.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
(cherry picked from commit 638a647b7d)
2018-04-10 20:59:48 +03:00
Raphael S. Carvalho
e059f17bf2 database: make sure sstable is also forwarded to shard responsible for its generation
After f59f423f3c, sstable is loaded only at shards
that own it so as to reduce the sstable load overhead.

The problem is that a sstable may no longer be forwarded to a shard that needs to
be aware of its existence which would result in that sstable generation being
reallocated for a write request.
That would result in a failure as follow:
"SSTable write failed due to existence of TOC file for generation..."

This can be fixed by forwarding any sstable at load to all its owner shards
*and* the shard responsible for its generation, which is determined as follow:
s = generation % smp::count

Fixes #3273.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>
(cherry picked from commit 30b6c9b4cd)
2018-04-05 10:58:29 +03:00
Duarte Nunes
0e8e005357 db/view: Reject view entries with non-composite, empty partition key
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.

Fixes #3262
Refs CASSANDRA-14345

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
(cherry picked from commit ec8960df45)
2018-04-03 17:20:33 +03:00
Glauber Costa
8bf6f39392 docker: default docker to overprovisioned mode.
By default, overprovisioned is not enabled on docker unless it is
explicitly set. I have come to believe that this is a mistake.

If the user is running alone in the machine, and there are no other
processes pinned anywhere - including interrupts - not running
overprovisioned is the best choice.

But everywhere else, it is not: even if a user runs 2 docker containers
in the same machine and statically partitions CPUs with --smp (but
without cpuset) the docker containers will pin themselves to the same
sets of CPU, as they are totally unaware of each other.

It is also very common, specially in some virtualized environments, for
interrupts not to be properly distributed - being particularly keen on
being delivered on CPU0, a CPU which Scylla will pin by default.

Lastly, environments like Kubernetes simply don't support pinning at the
moment.

This patch enables the overprovisioned flag if it is explicitly set -
like we did before - but also by default unless --cpuset is set.

Fixes #3336.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180331142131.842-1-glauber@scylladb.com>
(cherry picked from commit ef84780c27)
2018-04-02 17:07:20 +03:00
Glauber Costa
04ba51986e parse and ignore background writer controller
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.

That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.

While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.

There are two ways to handle this issue:

1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.

The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.

v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
(cherry picked from commit a9ef72537f)
2018-03-29 17:57:43 +03:00
Asias He
1d5379c462 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
(cherry picked from commit f539e993d3)
2018-03-29 12:10:09 +03:00
Avi Kivity
cb5dc56bfd Update scylla-ami submodule
Ref #3332.
2018-03-29 10:35:54 +03:00
Duarte Nunes
b578b492cd column_family: Don't retry flushing memtable if shutdown is requested
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.

The data will be safe in the commit log.

Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.

Ref #3318.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
(cherry picked from commit a985ea0fcb)
2018-03-26 15:26:56 +03:00
Duarte Nunes
30c950a7f6 column_family: Increase scope of exception handling when flushing a memtable
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).

Fix this by increasing the scope of handle_exception().

Fixes #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
(cherry picked from commit 50ad37d39b)
2018-03-26 15:26:54 +03:00
Duarte Nunes
f0d1e9c518 backlog_controller: Stop update timer
On database shutdown, this timer can cause use-after-free errors if
not stopped.

Refs #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-1-duarte@scylladb.com>
(cherry picked from commit b7bd9b8058)
2018-03-26 15:26:52 +03:00
Avi Kivity
597aeca93d Merge "Bug fixes for access-control, and finalizing roles" from Jesse
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.

"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.

"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.

Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.

Tests: unit (release), dtest (PENDING)
"

* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
  Roles are implemented
  auth: Increase delay before background tasks start
  auth: Remove ordering dependence
  auth: Don't warn on rescheduled task
  auth: Wait for schema agreement
  Single-node clusters can agree on schema

(cherry picked from commit 999df41a49)
2018-03-26 12:37:41 +03:00
Duarte Nunes
1a94b90a4d Merge 'Grant default permissions' from Jesse
The functional change in this series is in the last patch
("auth: Grant all permissions to object creator").

The first patch addresses `const` correctness in `auth`. This change
allowed the new code added in the last patch to be written with the
correct `const` specifiers, and also some code to be removed.

The second-to-last patch addresses error-handling in the authorizer for
unsupported operations and is a prerequisite for the last patch (since
we now always grant permissions for new database objects).

Tests: unit (release)

* 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla:
  auth: Grant all permissions to object creator
  auth: Unify handling for unsupported errors
  auth: Fix life-time issue with parameter
  auth: Fix `const` correctness

(cherry picked from commit 934d805b4b)
2018-03-26 12:37:35 +03:00
Avi Kivity
acdd42c7c8 Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges

(cherry picked from commit 054854839a)
2018-03-22 18:13:29 +02:00
Takuya ASADA
bd4f658555 scripts/scylla_install_pkg: follow redirection of specified repo URL
We should follow redirection on curl, just like normal web browser does.
Fixes #3312

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521712056-301-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit bef08087e1)
2018-03-22 12:56:58 +02:00
Vladimir Krivopalov
a983ba7aad perf_fast_forward: fix error in date formatting
Instead of 'month', 'minutes' has been used.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <1e005ecaa992d8205ca44ea4eebbca4621ad9886.1521659341.git.vladimir@scylladb.com>
(cherry picked from commit 3010b637c9)
2018-03-22 12:56:56 +02:00
Duarte Nunes
0a561fc326 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Fixes #3299.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
(cherry picked from commit 810db425a5)
2018-03-18 14:54:54 +02:00
Takuya ASADA
1f10549056 dist/redhat: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521066363-4859-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 1bb3531b90)
2018-03-15 10:48:36 +02:00
Takuya ASADA
c2a2560ea3 dist/debian: use 3rdparty ppa on Ubuntu 18.04
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 945e6ec4f6)
2018-03-15 10:48:31 +02:00
Takuya ASADA
237e36a0b4 dist/ami: update CentOS base image to latest version
Since we requires updated version of systemd, we need to update CentOS base
image.

Fixes #3184

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 69d226625a)
2018-03-15 10:47:54 +02:00
Takuya ASADA
e78c137bfc dist/redhat: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521064473-17906-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 856dc0a636)
2018-03-15 10:39:40 +02:00
Avi Kivity
fb99a7c902 Merge "Ubuntu/Debian build error fixes" from Takuya
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
  dist/debian: build only scylla, iotune
  dist/debian: switch to boost-1.65
  dist/debian: switch to gcc-7.3

(cherry picked from commit bb4b1f0e91)
2018-03-14 22:51:44 +02:00
Asias He
9b5585ebd5 range_streamer: Stream 10% of ranges instead of 10 ranges per time
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.

It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.

Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:

Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds

After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds

Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
2018-03-14 10:12:12 +02:00
Asias He
ad7b132188 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
2018-03-14 10:11:00 +02:00
Avi Kivity
f8613a8415 Merge "Save and recall queriers for paged singular-mutation queries" from Botond
"
Terms
-----

querier: A class encapsulating all the logic and state needed to fill a
page. This Includes the reader, the compact_mutation object and all
associated state.

Preamble
--------

Currently for paged-queries we throw away all readers, compactors and
all associated state that contributed to filling the page and on the
next page we create them from scratch again. Thus on each page we throw
away a considerable amount of work, only to redo it again on the next
page. This has been one of the major contributors to latencies as from
the point of view of a replica each page is as much work as a fresh
query.

Solution
--------

The solution presented in this patch-series is to save queriers after
filling a page and reuse them on the next pages, thus doing the
considerable amount of work involved with creating the them only once.
On each page the coordinator will generate a UUID that identifies this
page. This UUID is used as the key, under which the contributing
queriers will be saved in the cache. On the next page the UUID from the
previous page will be used to lookup saved queriers, and the one from
the current one to saved them afterwards (if the query isn't finished).
These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to
the page-state. Also attached to the page state is the list of replicas
hit on the last page. On the next page this list will be consulted to
hit the same replicas again, thus reusing the queriers saved on them.
Cached queriers will be evicted after a certain period of time to avoid
unecessary resource consumption by abandoned reads.
Cached queriers may also be evicted when the shard faces
resource-pressure, to free up resources.

Splitting up the work
---------------------

This series only fixes the singular-mutation query path, that is queries
that either fetch a single partition, or severeal single partitions (IN
queries). The fix for the scanning query path will be done in a
follow-up series, however much of the infrastructure needed for the
general querier reuse is already introduced by this series.

Ref #1865

Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test)

Benchmarking summary (read-from-disk)
-------------------------------------

1) Latency

BEFORE
latency mean              : 58.0
latency median            : 57.4
latency 95th percentile   : 68.8
latency 99th percentile   : 79.9
latency 99.9th percentile : 93.6
latency max               : 93.6

AFTER
latency mean              : 41.3
latency median            : 40.5
latency 95th percentile   : 50.8
latency 99th percentile   : 68.9
latency 99.9th percentile : 89.2
latency max               : 89.2

2) Throughput (single partition query)

sum(scylla_cql_reads):
BEFORE: 173'567
AFTER:  427'774

+246%

3) Throughput (IN query, 2 partitions)

sum(scylla_cql_reads):
BEFORE: 85'637
AFTER: 127'431

+148%
"

* '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits)
  Add unit test for resource based cache eviction
  Add unit tests for querier_cache
  Add counters to monitor querier-cache efficiency
  Memory based cache eviction
  Add buffer_size() to flat_mutation_reader
  Resource-based cache eviction
  Time-based cache eviction
  Save and restore queriers in mutation_query() and data_query()
  Add the querier_cache_context helper
  Add querier_cache
  Add querier
  Add are_limits_reached() compact_mutation_state
  Add start_new_page() to compact_mutation_state
  Save last key of the page and method to query it
  Make compact_mutation reusable
  Add the CompactedFragmentsConsumer
  Use the last_replicas stored in the page_state
  query_singular(): return the used replicas
  Consider preferred replicas when choosing endpoints for query_singular()
  Add preferred and last replicas to the signature of query()
  ...
2018-03-13 18:38:59 +02:00
Botond Dénes
c0009750c3 Add unit test for resource based cache eviction
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
2018-03-13 16:20:50 +02:00
Botond Dénes
c53b6f75c8 Add unit tests for querier_cache 2018-03-13 12:59:45 +02:00
Avi Kivity
636760c282 Merge "Introduce JSON output format to perf_fast_forward tests." from Vladimir
"
This patchset is a part of a bigger effort for bringing our
microbenchmarking tests from the source tree to be used for regression
testing purposes with CI.

Now, it is possible to export results of tests run into JSON format that
can be stored in ElasticSearch and compared among runs to detect
performance degradation should it happen.

Example of JSON output (formatted for readability):
{
	"results" :
	{
		"parameters" :
		{
			"read" : "64",
			"read,skip,test_run_count" : "64,256,1",
			"skip" : "256",
			"test_run_count" : 1
		},
		"stats" :
		{
			"(KiB)" : 126960,
			"aio" : 993,
			"blocked" : 208,
			"c blk" : 1,
			"c hit" : 0,
			"c miss" : 1,
			"cpu" : 99.779365539550781,
			"dropped" : 0,
			"frag/s" : 311939.61559016741,
			"frags" : 200000,
			"idx blk" : 0,
			"idx hit" : 0,
			"idx miss" : 0,
			"time (s)" : 0.641149729
		}
	},
	"test_group_properties" :
	{
		"message" : "Testing scanning large partition with skips.\nReads whole range interleaving reads with skips according to read-skip pattern",
		"name" : "large-partition-skips",
		"needs_cache" : false,
		"partition_type" : "large"
	},
	"versions" :
	{
		"scylla-server" :
		{
			"commit_id" : "4acfa17f4",
			"date" : "20180306",
			"run_date_time" : "2018-16-06 12:16:41",
			"version" : "666.development"
		}
	}
}
"

* 'issues/2947/v6' of https://github.com/argenet/scylla:
  Add support for JSON output format for perf_fast_forward results.
  Wrap output for customization. Move all output handling to a single managing class.
2018-03-13 12:37:34 +02:00
Benoît Canet
1d0cc7cf20 messaging_service: Start messaging service earlier
The messaging service was completely started
after a bootstraping node finished to join hence
leading to #2034.

Fixes #2034
Message-Id: <20180313084500.27265-1-amnon@scylladb.com>
2018-03-13 10:59:53 +02:00
Botond Dénes
b2f75a6c53 Add counters to monitor querier-cache efficiency
Add the following counters:
(1) querier_cache_lookups
(2) querier_cache_misses
(3) querier_cache_drops
(4) querier_cache_time_based_evictions
(5) querier_cache_resource_based_evictions
(6) querier_cache_memory_based_evictions
(6) querier_cache_population

(1) counts the total number of querier cache lookups. Not all
page-fetches will result in a querier lookup. For example the first page
of a query will not do a lookup as there was no previous page to reuse
the querier from. The second, and all subsequent pages however should
attempt to reuse the querier from the previous page.
(2) counts the subset of (1) where the read have missed the querier
cache (failed to find a matching saved querier).
(3) counts the subset of (1) where the querier was recalled and dropped
immediately. This can happen for example if the querier was at the wrong
position.
(4) counts the cached queriers that were evicted due to their TTL
expiring.
(5) counts the cached queriers that were evicted due to reader-resource
(those limited by reader-concurrency limits) shortage.
(6) counts the cached queriers that were evicted due to reaching the
cache's memory limits (currently set to 4% of the shards' memory).
(7) is the current number of entries in the cache

Note:
* The count of cache hits can be derived from these counters as
(1) - (2).
* cache_drop (3) also implies a cache hit (see above). This means that
the number of actually reused queriers is:
(1) - (2) - (3)
2018-03-13 10:34:34 +02:00
Botond Dénes
8513549b55 Memory based cache eviction
To bound the memory consumption of the querier-cache the total memory
consumption of the cached queriers is limited to 4% of the shard's total
memory.
When inserting a new querier it is first checked whether it's insertion
would cause the limit to be crossed. If this is the case existing
entries are evicted until the memory consumption is sufficiently reduced
so that after inserting the querier it stays below the limit.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
To calculate the memory consumption of the cached queriers
flat_mutation_reader::buffer_size() is used. While this is not very
precise as it doesn't include object sizes and member containers it
gives a good picture of the memory consumption of the queriers.

Memory based cache eviction overlaps with resource-based cache eviction
but only to some degree as that only accounts the memory consumption of
sstable readers.
2018-03-13 10:34:34 +02:00
Botond Dénes
f488ae3917 Add buffer_size() to flat_mutation_reader
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.
2018-03-13 10:34:34 +02:00
Botond Dénes
212b2dabc4 Resource-based cache eviction
Readers serving user-reads need to obtain a permit to start reading.
There exists a restriction on how much active readers can be admitted
based on their count and their memory onsumption.
Since the saved readers of cached queriers are techically active (they
hold a permit) they can block new readers from obtaining a permit.
New readers have a higher priority because a cached reader might be
abandoned or used later at best so in the face of memory pressure we
evict cached readers to free up permits for new readers.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
2018-03-13 10:34:34 +02:00
Botond Dénes
d5bcadcfda Time-based cache eviction
Cached queriers should not sit in the cache indefinitely otherwise
abandoned reads would cause excess and unncessary resource-usage. Attach
an expiry timer to each cache-entry which evicts it after the TTL
passes.
2018-03-13 10:34:34 +02:00
Botond Dénes
ff808d9ce6 Save and restore queriers in mutation_query() and data_query()
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
2018-03-13 10:34:34 +02:00
Botond Dénes
cab38c9f81 Add the querier_cache_context helper
querier_cache_context is supposed to make propagating the cache and the
key down the layers. It comes bundled with some of the required
parameters (the lookup and save state) and aso hides all of the
boiler-plate of dealing with the cache (checking whether the key is
non-empty, etc.). It also makes it possible to not use the cache and
hide this from the lower layers.
2018-03-13 10:34:34 +02:00
Botond Dénes
bbfe17437e Add querier_cache
This is the cache where suspended queriers are going to be saved between
pages. This is not a general purpose cache. It caters to the specific
needs of the querier recall mechanism. More specifically:
(1) Cache entries are of single-use, they are inserted once and the first
lookup removes them. Multiple items may be stored under a single key.
Identifying the correct one happens based on additional information like
the query range. Lookup knows to drop queriers when they cannot be used
to serve the next page.
(2) Cache entries are evicted after a certain time to avoid the
depletion of resources due to abandoned reads.
(3) Cache entries are evicted when facing reader-permit shortage, until
either enough permits are freed up or all entries are evicted.
(4) A memory limiter is set up which keeps the total memory consumption
of the cache under a limit (4% of memory) by evicting the oldest entries
when inserting a new one would cause the total memory consumption to go
above the limit.
(5) It updates the relevant counters of the db_stats.

This patch only implements (1), the other features will be implemented
in their own patches.
2018-03-13 10:34:34 +02:00
Botond Dénes
7a5143a670 Add querier
The querier encapsulates all objects needed to serve queries, except
result builders. It is designed to be suspendable, savable and
resumable. It contains all logic needed to suspend, resume and determine
whether the querier can be resumed or not.
It is the foundation upon which the "reader-reuse" mechanism is built.
2018-03-13 10:34:34 +02:00
Botond Dénes
84d872babf Add are_limits_reached() compact_mutation_state
are_limits_reached() allows querying whether the compactor reached
the page's limits. This is needed to determine whether there will be
more pages and thus whether the compact_mutation_state has to be kept
around.
2018-03-13 10:34:34 +02:00
Botond Dénes
2c1081b0e9 Add start_new_page() to compact_mutation_state
start_new_page() resets the limits to the current page's ones and
sets the _empty_partition flag so that the partition header (if the last
page finished inside a partition) will be reemitted.
2018-03-13 10:34:34 +02:00
Botond Dénes
3fca8aaefb Save last key of the page and method to query it
Make a copy of the current decorated-key in consume_end_of_stream() so
that it persists while the compaction state is suspended.
Also add current_partition() to allow client code to query the partition
the compaction is positioned in. This is needed to determine whether
the start position of the next page matches that of the
compact_mutation_state.
2018-03-13 10:34:34 +02:00
Botond Dénes
2fcc99fe43 Make compact_mutation reusable
Currently compact_mutation is used as a use-once-then-throw-away object.
After it satisfies its consumer it's destroyed together with the
consumer. This conflicts with the effort to save and reuse readers and
associated infrastructure between pages of a query.

To resolve this conflict compact_mutation is split into two classes:
(1) compact_mutation_state
(2) compact_mutation

compact_mutation_state encapsulates all the compaction logic and state,
while compact_mutation continues to provide the same API using
compact_mutation_state behind the scenes.
compact_mutation_state doesn't store the consumer, instead its
consume_* methods are templated on the consumer and take it as an
argument. This allows compact_mutation_state to be independent of the
consumer's type.
Additionally compact_mutation can now be constructed from a shared
pointer to compact_mutation_state. This allows client code to
pre-construct a compaction state and retain it after the
compact_mutation object is destroyed.
These changes allow the state of a compaction to be saved and restored
later while code that is only interested in storing the saved state
can stay independent of the consumer's type.

This patch only contains the splitting of compact_mutation into
compact_mutation and compact_mutation_state. The next patches will add
the missing functionality that is needed to make compact_mutation_state
truly reusable across pages.
2018-03-13 10:34:34 +02:00
Botond Dénes
7bd500049d Add the CompactedFragmentsConsumer
Undust the commented CompactMutationConsumer concept, make it usable and
rename it to CompactedFragmentsConsumer (as we not have flat readers).
2018-03-13 10:34:34 +02:00
Botond Dénes
f1171803b5 Use the last_replicas stored in the page_state
Pass the last_replicas from the page_state as the preferred_replicas
for query() and save the returned last_replicas as the last_replicas
field of the next page_state. The circle is now complete. The first page
of any query will pass an empty list as the preferred replicas (having
no previous paging_state) so the replicas will be selected according to
the load-balancing strategy. Any subsequent page will use the last
replicas from the last page as the preferred ones for the current one.
Thus if all goes well all pages of a query will hit the same replicas.
2018-03-13 10:34:34 +02:00
Botond Dénes
536a32bb5e query_singular(): return the used replicas
This patch implements the last_replicas returning part of the query()
signature changes for singular queries. It allows for client code to
save the last returned replicas and pass it to query() on the next page
as the preferred-replicas parameter, thus faciliate the read requests
for the next page hitting the same replicas.
2018-03-13 10:34:34 +02:00
Botond Dénes
aaf67bcbaa Consider preferred replicas when choosing endpoints for query_singular()
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
2018-03-13 10:34:34 +02:00
Botond Dénes
eac597d726 Add preferred and last replicas to the signature of query()
preferred_replicas are added to the parameters and last_replicas are
added to the return type. The preferred replicas will be used as a hint
for the selection of the replicas to send the read requests to. The last
replicas (returned) are the replicas actually selected for the read.
This will allow queries to consistently hit the same replicas for each
page thus reusing readers created on these replicas.
For convenience a query() overload is provided that doesn't take or
return the preferred and last replicas.

This patch only adds the parameters and propagates them down to
query_singular() and query_partition_key_range(). The code to actually
use these preferred-replicas will be added in later patches.
This reason for separating this is to reduce noise and improve
reviewability for those functional changes later.
2018-03-13 10:34:34 +02:00
Botond Dénes
f281b3e923 Add last_replicas to paging_state
Helps paged queries consistently hit the same replicas for each
subsequent page. Replicas that already served a page will keep the
readers used for filling it around in a cache. Subsequent page request
hitting the same replicas can reuse these readers to fill the pages
avoiding the work of creating these readers from scratch on every page.
In a mixed cluster older coordinators will ignore this value.
The value of last_replicas may change between pages as nodes may become
available/unavailable or the coordinator may decide to send the read
requests to different replicas at its discretion.
Replicas are identified by an opaque uuid which should only make sense
to the storage-proxy.
2018-03-13 10:34:34 +02:00
Nadav Har'El
fa284f6307 Add query UUID to read command
This patch adds the parameter to read_command which is needed for
caching of readers during multiple pages of a paged queries, which
we will introduce in the next patches.

The query_uuid is a UUID of a previously saved reader, which
the replica is now asked to recall and resume (if this saved reader is
no longer in the cache, it is fine, a new reader will be started).

Additionally a helper flag is_first_page is added so that the replica
can avoid doing any cache lookups (and incrementing miss counters) for
the first page.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-03-13 10:34:34 +02:00
Nadav Har'El
ec7c56d18a Add query UUID to paging state
This patch adds to the "paging_state", the opaque cookie that clients are
supposed to provide when asking for the next page on a paged query, a
unique id field. This new field will be used to tell that a new request
for a page really continues the previous page, and doesn't just by chance
start at the same position the previous page stopped.

We need to support setups with mixed versions - a client may get a paging
state from a coordinator running a new version of Scylla and send it to
a different coordinator running an old version - or vice versa. So the new
uuid field is set up to have a default uuid of UUID() (a recognizable
invalid uuid 0), so new versions receiving no uuid from an old version will
set this invalid uuid, and old versions receiving a uuid from a new version
will simply ignore it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-03-13 10:34:34 +02:00
Avi Kivity
78a9ab827e Merge seastar upstream
* seastar 42159d4...bcfbe0c (1):
  > core: fix directory scanning by returning actual entry type

Fixes #3274 (hopefully).
2018-03-12 20:58:44 +02:00
Duarte Nunes
36b8c1043d Merge 'Reduce dependencies on messaging_service.hh' from Avi
Refactor some includes to reduce dependencies on messaging_service.hh,
which can change quite a lot as it includes many unrelated items itself.

Tests: build

* tag 'includes/messaging_service.hh/v1' of https://github.com/avikivity/scylla:
  tests: reduce dependencies in test_services.hh
  migration_manager: remove dependency on messaging_service.hh in header
  messaging_service: move msg_addr into its own header file
2018-03-12 18:49:13 +00:00
Avi Kivity
bd7881066a tests: reduce dependencies in test_services.hh
Convert storage_service_for_test to a pimpl implementation to
reduce dependencies.  Tests that depended on those includes were
fixed to include their dependencies directly.
2018-03-12 20:05:23 +02:00
Avi Kivity
5f2600a71d migration_manager: remove dependency on messaging_service.hh in header
Use the new msg_addr.hh header to remove a dependency on
messaging_service.hh.
2018-03-12 20:05:23 +02:00
Avi Kivity
dd12214628 messaging_service: move msg_addr into its own header file
Make it possible to use msg_addr without depending on messaging_service.hh.
2018-03-12 20:05:23 +02:00
Avi Kivity
af383228fb locator: remove empty file locator.cc
Empty but for compiler-time-consuming includes.
Message-Id: <20180312073018.21646-1-avi@scylladb.com>
2018-03-12 10:32:26 +01:00
Avi Kivity
29d0a46220 locator: add copyright and license statements to production_snitch_base.cc
Message-Id: <20180312073104.21840-1-avi@scylladb.com>
2018-03-12 10:30:48 +01:00
Asias He
8624467e26 utils: Remove utils/utils.cc
It is used to make sure the header compiles in the early days.
Message-Id: <531fc6570805bd163afedd53f5d71e1b79a477d1.1520840644.git.asias@scylladb.com>
2018-03-12 09:47:40 +02:00
Duarte Nunes
0ccf1c581a Merge 'Reduce gratuitous inclusions of system_keyspace.hh' from Avi
Try to avoid recompilations by reducing inclusions of system_keyspace.hh
in other header files.

Tests: unit (release)

* tag 'system_keyspace.hh/v1' of https://github.com/avikivity/scylla:
  storage_service: remove system_keyspace.hh include
  locator: de-inline reconnectable_snitch_helper
  locator: de-inline production_snitch_base
  cql3: remove #include of system_keyspace.hh
2018-03-11 22:56:20 +00:00
Avi Kivity
cd668061fc storage_service: remove system_keyspace.hh include
Re-distribute include among the files that really need it.
2018-03-11 18:53:49 +02:00
Avi Kivity
b946f8b308 locator: de-inline reconnectable_snitch_helper
Reduce dependencies by de-inlining reconnectable_snitch_helper. A
new home is found in production_snitch_base.cc, which is somewhat
related.
2018-03-11 18:31:05 +02:00
Avi Kivity
84004a2574 locator: de-inline production_snitch_base
De-inlining allows us to remove some dependencies, and those functions
are too complex to inline anyway.

A few always-throwing functions get the [[noreturn]] attribute to
avoid damaging code generation.
2018-03-11 18:22:49 +02:00
Avi Kivity
4f6b892aa1 cql3: remove #include of system_keyspace.hh
We include system_keyspace for just the string "system" (and a related
is_system_keyspace() function). Replace with a forward-declared functions.
2018-03-11 18:02:23 +02:00
Avi Kivity
7441c7153f Merge seastar upstream
* seastar 08e02dc...42159d4 (9):
  > memory: avoid unconditional calls to __tls_init
  > io_tester: bring back information about think time
  > Merge "Avoid continuations in I/O Scheduler path" from Glauber
  > Merge "Extend io_tester to support CPU loads" from Glauber
  > tutorial: fix undue complication in semaphore get_units() example
  > Tutorial: in HTML target, inline code snippets shouldn't be gray
  > tutorial: add build target for split HTML file
  > tutorial: mention seastar::thread as option for object lifetime management
  > tutorial: document new seastar::future::wait()
2018-03-11 15:45:42 +02:00
Avi Kivity
9569ba5e38 Update scylla-ami submodule
* dist/ami/files/scylla-ami 3aa87a7...5170011 (3):
  > scylla_install_ami: install enhanced networking NIC drivers
  > scylla_install_ami: set kernel-ml as default kernel
  > scylla_install_ami: fix NIC down with enhanced networking on new base AMI
2018-03-11 15:45:05 +02:00
Raphael S. Carvalho
fb8ce14a36 sstables: don't set clustering components twice when loading sstable
already called in update_info_for_opened_data() which is called by
open_data(); no need for clustering components to be set early
either.

found it when auditing the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180310225213.26017-1-raphaelsc@scylladb.com>
2018-03-11 10:10:35 +02:00
Tomasz Grabiec
3937352a9a doc: Fix row_cache.md
Dropped unfinished sentence and added missing "after".
Message-Id: <1520615404-18458-1-git-send-email-tgrabiec@scylladb.com>
2018-03-10 16:27:04 +02:00
Raphael S. Carvalho
87035bd8d1 sstables: fix min and max timestamp when negative timestamp is specified
unsigned type was incorrectly used for keeping track of min and max
timestamp, so a negative number would be treated as a very high
number that would *incorrectly* end up as max timestamp in sstable
metadata.

Fixes #3000.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180308162217.18963-1-raphaelsc@scylladb.com>
2018-03-08 18:31:30 +02:00
Avi Kivity
596a9d0fb3 Merge "Make reader concurrency dual-restricted by count and memory" from Botond
"
Refs #2692
Fixes #3246

The current restricting algorithm [1] restricts the active-reader queue
based on the memory consumption of the existing active readers. When
this memory consumption is above the limit new readers are not admitted.
The inactive reader queue on the other hand has a fixed length.
This caused performance regressions on two workloads:
* read-only: since the inactive-reader queue length is severly limited
  (compared to the previous situation) reads will timeout at loads
  comfortably handled before.
* mixed: since the memory consumption happens only at admission time
  (already created active readers are not limited) memory consumption
  growed significantly causing problems when compactions kicked in.

The solution is to reintroduce the old limit of 100 active concurrent
user-reads while still keeping the memory-based limit as well. For
workloads that don't consume a lot of memory or on large boxes with lots
of memory the count-based limit will be reached which is reverting to the
old well-known behaviour. For memory-hungry workloads or on small boxes
with little memory the memory based-limit will kick in sooner avoiding
memory overconsumption.

[1] introduced by bdbbfe9390
"

* 'restricted-reader-dual-limit/v3' of https://github.com/denesb/scylla:
  Modify unit tests so that they test the dual-limits
  Use the reader_concurrency_semaphore to limit reader concurrency
  Add reader_concurrency_semaphore
  Add reader_resource_tracker param to mutation_source
  mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
2018-03-08 14:36:05 +02:00
Botond Dénes
341ddd096a Modify unit tests so that they test the dual-limits 2018-03-08 14:12:12 +02:00
Botond Dénes
1259031af3 Use the reader_concurrency_semaphore to limit reader concurrency 2018-03-08 14:12:12 +02:00
Botond Dénes
dfa04c3fea Add reader_concurrency_semaphore
This semaphore implements the new dual, count and memory based active
reader limiting. As purely memory-based limiting proved to cause
problems on big boxes admitting a large number of readers (more than any
disk could handle) the previous count-based limit is reintroduced in
addition to the existing memory-based limit.
When creating new readers first the count-based limit is checked. If
that clears the memory limit is checked before admitting the reader.
reader_conccurency_semaphore wraps the two semaphores that implement
these limits and enforces the correct order of limit checking.
This class also completely replaces the restricted_reader_config struct,
it encapsulates all data and related functinality of the latter, making
client code simpler.
2018-03-08 14:12:12 +02:00
Botond Dénes
872fd369ba Add reader_resource_tracker param to mutation_source
Soon, reader_resource_tracker will only be constructible after the
reader has been admitted. This means that the resource tracker cannot be
preconstructed and just captured by the lambda stored in the mutation
source and instead has to be passed in along the other parameters.
2018-03-08 14:12:09 +02:00
Botond Dénes
d5bb8a47fc mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
2018-03-08 10:29:16 +02:00
Avi Kivity
0ebfe448e3 Merge "Row-level eviction" from Tomasz
"
This series switches granularity of memory-pressure-induced eviction in cache
from a partition to a row.

Since 9b21a9b cache can store partial partitions with row granularity but they
were still evicted as a unit. This is problematic for the following reasons:

 - more is evicted than necessary, which decreases cache efficiency. In the
   worst case, whole cache gets evicted at once

 - evicting large amounts of memory (large partitions) at once may impact
   latency badly

Fixes #2576.

See the documentation added in patch titled "doc: Document row cache eviction"
for details on how eviction works.

Open issues to be fixed incrementally:

  - range tombstones are not evictable

  - cache update still has partition granularity, which
    causes bad latency on memtable flush with large partitions
"

* tag 'tgrabiec/row-level-eviction-v3' of github.com:scylladb/seastar-dev: (43 commits)
  doc: Document row cache eviction
  tests: cache: Add tests for row-level eviction
  tests: cache: Check that data is evictable after schema change
  tests: cache: Move definitions to the top
  tests: perf_cache_eviction: Switch eviction counter to row granularity
  tests: row_cache_alloc_stress: Avoid quadratic behavior
  cache: Introduce unlink_from_lru()
  cache: Add row-level stats about cache update from memtable
  mvcc: Propagate information if insertion happened from ensure_entry_if_complete()
  cache: Track number of rows and row invalidations
  cache: Evict with row granularity
  cache: Track static row insertions separately from regular rows
  tests: mvcc: Use apply_to_incomplete() to create versions
  tests: mvcc: Fix test_apply_to_incomplete()
  tests: cache: Do not depend on particular granularity of eviction
  tests: cache: Make sure readers touch rows in test_eviction()
  mvcc: Store complete rows in each version in evictable entries
  mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest()
  tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction
  cache: Ensure all evictable partition_versions have a dummy after all rows
  ...
2018-03-07 17:57:07 +02:00
Tomasz Grabiec
4caeed7e40 doc: Document row cache eviction 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
180a877db3 tests: cache: Add tests for row-level eviction 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
9fab5068c6 tests: cache: Check that data is evictable after schema change 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
f0e0c79a70 tests: cache: Move definitions to the top 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
1e4f9eb2c1 tests: perf_cache_eviction: Switch eviction counter to row granularity 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
48f91b4605 tests: row_cache_alloc_stress: Avoid quadratic behavior
Partitions corresponding to keys have 40k rows. With row-level
eviction touching them inside the loop became a serious performance
issue, because touch() now needs to walk over all rows.
2018-03-07 16:52:59 +01:00
Tomasz Grabiec
641bcd0b35 cache: Introduce unlink_from_lru()
Will be used in row_cache_alloc_stress to unlink partitions which we
don't want to get evicted, instead of reapeatedly calling touch() on
them after each subsequent population. After switching to row-level
LRU, doing so greatly increases run time of the test due to quadratic
behavior.
2018-03-07 16:52:59 +01:00
Tomasz Grabiec
b9d22584bb cache: Add row-level stats about cache update from memtable 2018-03-07 16:52:58 +01:00
Tomasz Grabiec
7c34cd04e2 mvcc: Propagate information if insertion happened from ensure_entry_if_complete()
It's needed by users to update statistics, different ones depending on
if the row already existed or not.
2018-03-07 16:50:55 +01:00
Raphael S. Carvalho
aa75684ee7 sstables: Warn when an extra-large partition is written
Based on https://issues.apache.org/jira/browse/CASSANDRA-9643

For compaction_large_partition_warning_threshold_mb option set to 1,
follow an example output:

WARN  2018-02-22 19:52:11,029 [shard 0] sstable - Writing large
row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445}
(1276758 bytes)

Fixes #2209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>
2018-03-07 15:49:46 +00:00
Duarte Nunes
9254a9a6fe db/system_keyspace: Move dependency on db/schema_tables to source file
And add missing dependencies to header file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180307111304.2914-1-duarte@scylladb.com>
2018-03-07 14:45:36 +02:00
Asias He
73d8e2743f dht: Fix log in range_streamer
The address and keyspace should be swapped.

Before:
  range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
  took 56 seconds

After:
  range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
  took 56 seconds

Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
2018-03-07 11:49:58 +02:00
Tomasz Grabiec
6ba272a610 debug: scylla_row_cache_report: Remove duplicated phrase from printout
Message-Id: <1520412164-10746-1-git-send-email-tgrabiec@scylladb.com>
2018-03-07 11:15:57 +02:00
Tomasz Grabiec
ad7e2f7460 cache: Add back parition count argument to row_cache_update_one_batch_end probe
sebug/scylla_row_cache_report.stp expects it.

Removed in c4974392b7.
Message-Id: <1520412152-10680-1-git-send-email-tgrabiec@scylladb.com>
2018-03-07 11:15:56 +02:00
Vladimir Krivopalov
8028f90460 Add support for JSON output format for perf_fast_forward results.
The JSON output is arranged in a way that makes it easier to upload
results to ElasticSearch.
All the tests results are placed under the perf_forward_data_output/ directory
For test groups, we create separate subdirectories where we save results
from runs of tests in those groups.
For each test run, we store results in a separate file named:
    <dash-separated-param-list>.<run-number>.json
where
    <dash-separated-param-list> is a dash-separated list of parameters of the current
    test, e.g., 1-64 (for read-skip pattern).

    <run-number> is the number of run of this test with the specified
    parameters. This is needed as the same list of parameters can be
    used more than once (for instance, when cache is enabled).
    Those numbers start with 1, i.e., 1, 2, 3.

So, the path to a resulting JSON file may look like:
    perf_fast_forward_output/large-partition-skips/64-4096.1.json

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-03-06 12:09:00 -08:00
Vladimir Krivopalov
e810fc4e09 Wrap output for customization. Move all output handling to a single managing class.
Instead of passing the output parameters to std::cout straight away, use
helper wrappers. This will allow us to add more formats for gathered
tests results.

Introduce helper writer classes hierarchy that can be extended to
support different output formats (JSON, XML, etc).

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-03-06 09:49:05 -08:00
Tomasz Grabiec
da901b93fc cache: Track number of rows and row invalidations 2018-03-06 11:50:29 +01:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
dce9185fc9 cache: Track static row insertions separately from regular rows
So that row eviction counter, which doesn't look at the static row,
can be in sync with row insertion counter.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
19951ede7d tests: mvcc: Use apply_to_incomplete() to create versions
So that the test doesn't depend on internal invariants.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
ed6271fc87 tests: mvcc: Fix test_apply_to_incomplete()
It should use evictable entries instead of non-evictable ones,
because they are required by apply_to_incomplete().
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
f2bdac2874 tests: cache: Do not depend on particular granularity of eviction 2018-03-06 11:50:28 +01:00
Tomasz Grabiec
c306c1050e tests: cache: Make sure readers touch rows in test_eviction()
With row-level eviction just creating a reader won't necessarily
update the LRU.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
ab407d99cc mvcc: Store complete rows in each version in evictable entries
For row-level eviction we need to ensure that each version has
complete rows so that eviction from older versions doesn't affect the
value of the row in newer snapshots.

This is achieved by copying the row from an older version before
applying the increment in the new version.

Only affects evictable entries, memtables are not affected.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
29d167bf01 mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest()
To avoid duplication of logic between cache reader and
ensure_entry_if_complete().
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
fb2107416b tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction
In hope of catching more issues.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
bee875fa7d cache: Ensure all evictable partition_versions have a dummy after all rows
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.

It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
5320705300 cache: Propagate cache_tracker to places manipulating evictable entries
cache_tracker reference will be needed to link/unlink row entries.

No change of behavior in this patch.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
30df3ddd7d cache: Do not evict from cache_entry destructor
We will need to propagate a cache_tracker reference to evict(). Instead
of evicting from destructor, do so before cache_entry gets unlinked
from the tree. Entries which are not linked, don't need to be
explicitly evicted.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
4efab6f6a6 cache: Use on_evicted() in cache_tracker::clear()
In preparation for switching LRU to row level.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
2118bdce01 cache: Extract cache_entry::on_evicted() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
24c5949518 cache: cache_tracker: Rename on_merge() to on_partition_merge() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
d66e864310 cache: cache_tracer: Rename on_erase() to on_partition_erase() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
3dc9000c51 mutation_partition: Introduce rows_entry::is_last_dummy()
Will be needed by row evictor, which needs to treat last dummies
specially (not evict them).
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
e571bd5a2e mvcc: Add partition_entry::versions_from_oldest() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
654d4b76c0 anchorless_list: Introduce all_elements_reversed() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
d9a38c1c85 mutation_partition: Add API to walk from rows_entry to cache_entry
Will be needed on row eviction, to unlink containers when they become
fully evicted.
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
0ccae80332 intrusive_set_external_comparator: Introduce container_of_only_member() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
758dfd404b intrusive_set_external_comparator: Use auto_unlink on nodes
Needed for row-level eviction, which doesn't have a reference to the
container.
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
1a85c6d556 intrusive_set_external_comparator: Introduce iterator_to() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
bbe771e28f tests: Add more tests for continuity merging 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
9893e8e5f7 mvcc: Make each version have independent continuity
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.

Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:

 v2:                  <key=2, cont=1>
 v1: <key=1, cont=1>

In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:

   v2:                  <key=2, cont=1>

Here, the range (-inf, 1) would be marked as continuous, which is not true.

To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.

It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().

MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.

The example from the beginning will become:

 v2: <key=1, cont=0>  <key=2, cont=1>
 v1: <key=1, cont=1>

When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
2018-03-06 11:50:25 +01:00
Tomasz Grabiec
bd1e730053 tests: cache: Add test for merging and reading randomly populated versions 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
1b959cb6e9 tests: cache: Take parameters by const& 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d2744b6ad8 tests: mvcc: Don't set mutations in versions directly
Simply copying mutations which are not fully continuous may violate
MVCC invariants, like the one about non-overlapping continuity which
will be added later. Use apply_to_incomplete() instead.

This unfortunately reduces strenght of the test, since the continuity
of the entry is now completely determined by the first version. We should
use populate() instead, but it doesn't exist yet. It could be extracted
from cache_streamed_mutation, but that's not an easy change.

This is alleviated by adding a similar test to row_cache_test_g, in a
later patch.
2018-03-06 11:32:09 +01:00
Tomasz Grabiec
2a0ece5205 mvcc: Allow dereferencing partition_snapshot_row_weakref 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d0e1a3c63e mvcc: partition_snapshot_row_weakref: Introduce is_in_latest_version() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
2f956499a7 mvcc: Drop unused _evictable flag from partition_version_ref 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
313f2c2bb0 cache: Document intent of maybe_update_continuity() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
3214883a25 cache: Extract cache_streamed_mutation::ensure_population_lower_bound() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d9f0c1f097 tests: cache: Fix invalidate() not being waited for
Probably responsible for occasional failures of subsequent assertion.
Didn't mange to reproduce.

Message-Id: <1520330967-584-1-git-send-email-tgrabiec@scylladb.com>
2018-03-06 12:14:04 +02:00
Asias He
25aa59f2f1 gossip: Fix force_after in wait_for_gossip
In commit 8af0b501a2 (gossip: wait for stabilized gossip on bootstrap)

The force_after variable was changed from int32_t to stdx::optional<int32_t>

-            if (force_after > 0 && total_polls > force_after) {
+            if (force_after && total_polls > *force_after) {

Checking force_after > 0 was dropped which is wrong because force_after
is set to -1 by default. So the if branch will always be executed after
1 poll.

We always see:

   [shard 0] gossip - Gossip not settled but startup forced by
   skip_wait_for_gossip_to_settle. Gossp total polls: 1

even if skip_wait_for_gossip_to_settle is not set at all.

Fixes #3257
Message-Id: <845d219cea6101a7c507c13879c850a5c882e510.1520297548.git.asias@scylladb.com>
2018-03-06 10:11:02 +02:00
Vladimir Krivopalov
2cbdb91070 Remove unused io/ directory
Commit 9309a2ee6f ("Remove obselete
files") removed all of the callers but forgot to remove the directory.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <dcdd6ac66e88fac29cc2b0a12936688e71c1d267.1520314939.git.vladimir@scylladb.com>
2018-03-06 08:08:02 +02:00
Asias He
8900e830a3 storage_service: Add missing return in pieces empty check
If pieces.empty is empty, it is bogus to access pieces[0]:

   sstring move_name = pieces[0];

Fix by adding the missing return.

Spotted by Vlad Zolotarov <vladz@scylladb.com>

Fixes #3258
Message-Id: <bcb446f34f953bc51c3704d06630b53fda82e8d2.1520297558.git.asias@scylladb.com>
2018-03-06 08:04:39 +02:00
Vladimir Krivopalov
acdce55572 Inject CryptoPP namespace where Crypto++ byte typedef is used.
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the CryptoPP:: namespace.
To make Scylla code compile with both old and new versions, bring the
namespace in so that the code works regardless of the scope of `byte`
definition.

Fixes #3252

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <60e7bfe868b778b1c9bbe15d7247db64b61bd406.1520272198.git.vladimir@scylladb.com>
2018-03-05 20:43:07 +02:00
Avi Kivity
eb598876e5 build: remove broken and unneeded xxhash include path
"-I$full_builddir/{mode}/xxhash" doesn't resolve to a valid path, because
full_builddir is a Python variable, not a Ninja variable.  In build.ninja
it appears as "-I/release/xxhash".

Since the build nevertheless works, we can remove the broken flag instead
of fixing it.
Message-Id: <20180305135919.13634-1-avi@scylladb.com>
2018-03-05 15:34:30 +01:00
Duarte Nunes
0c05fc0bff tests/flush_queue_test: Don't assume continuations run immediately
This patch fixes an issue with test_propagation(), where the test
assumed that after the future returned from wait_for_pending(0)
resolved, the continuations set for the post operation had already
run, which is not true.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180305131908.7667-1-duarte@scylladb.com>
2018-03-05 15:22:33 +02:00
Avi Kivity
1dae29b48d test: mutation_reader_test: fix no-timeout case in reader_wrapper
reader_wrapper's _timeout defaults to now(), which means to time
out immediately rather than no timeout.

Fix by switching to a time_point, defaulting to no_timeout, and
provide a compatible constructor (with a duration parameter) for
callers that do want a duration-based timeout.

Tests: mutation_reader_test (debug, release)
Message-Id: <20180305111739.31972-1-avi@scylladb.com>
2018-03-05 12:40:07 +01:00
Avi Kivity
a9942bd84a Merge seastar upstream
* seastar f841d2d...08e02dc (3):
  > future: make future::wait() a supported function
  > scripts: perftune.py: don't allow cpu-mask that does't include any IRQ CPU
  > Tutorial: show nice dashes in HTML
2018-03-05 12:58:15 +02:00
Vlad Zolotarov
e3ca390333 tests: gce_snitch_test: drop the property file related message
The message in question is printed with printf() which is bad by itself.
And most importantly this test uses a single .property file so this message
doesn't add any interesting information to begin with. Therefore it makes
more sense to drop it than to fix it.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1519661059-13325-1-git-send-email-vladz@scylladb.com>
2018-03-04 16:16:37 +02:00
Takuya ASADA
3229a87fee dist/debian: Drop scylla-fstrim cron job from Debian 8/9
Since we installs scylla-fstrim systemd unit files on Debian 8/9, no need to
install cron job, so drop them.

Fixes #3249

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519950212-16231-2-git-send-email-syuu@scylladb.com>
2018-03-04 16:13:06 +02:00
Takuya ASADA
759b4de7a5 dist/debian: drop systemd unit files on Ubuntu 14.04
Ubuntu 14.04 uses upstart as init program, don't need systemd unit files,
so drop them.

Fixes #3245

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519950212-16231-1-git-send-email-syuu@scylladb.com>
2018-03-04 16:13:05 +02:00
Vladimir Krivopalov
e9e9ec2d16 Guidelines for preparing patches in HACKING.md
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <93bf4d5c04848daf2157d1343748410995b224db.1520045191.git.vladimir@scylladb.com>
2018-03-04 16:12:00 +02:00
Piotr Jastrzebski
29eb9f30bc Fix memtable::clear_gently to work in debug mode.
It was getting into an infinite loop because
need_preempt was always returning true.

Tests: units (release,debug)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a324e7f576b247124080830455c920bdad1f617b.1520025213.git.piotr@scylladb.com>
2018-03-04 14:11:54 +02:00
Vladimir Krivopalov
99bd5180ba Fix Scylla compilation with Crypto++ v6.
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the `CryptoPP::` namespace.

This fix brings in the CryptoPP namespace so that the `byte` typedef is
seen with both old and new versions of Crypto++.

Fixes #3252.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <799d055be710231884d101a52c0be8ed8b0a9806.1520125889.git.vladimir@scylladb.com>
2018-03-04 10:23:00 +02:00
Duarte Nunes
45d762703c Merge 'CQL syntax refinements for access-control' from Jesse
This patch series ties up some loose ends around CQL syntax for access-control statements.

The USER-based syntax statements are all backwards compatible. ROLE-specific statements have a new syntax which is described in "cql: Make role syntax for consistent". Other statements (like GRANT) have been updated to accept role names (instead of the more restrictive `username` rule).

Fixes #3217.

Tests: unit (debug)

* 'jhk/roles_syntax/v2' of https://github.com/hakuch/scylla:
  tests: Rename test for consistency
  cql: Eliminate uses of legacy `username` rule
  cql: Elaborate error for quoted user names
  cql: Allow role names to be string literals
  cql: Make role syntax more consistent
  tests: Add CQL syntax tests for access-control
2018-03-02 15:11:14 +00:00
Raphael S. Carvalho
954efcd209 storage_service: log sstable integrity checker status
INFO  2018-02-27 16:02:36,246 [shard 0] storage_service - SSTable data integrity checker is enabled.

Fixes #3071.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180228174253.9190-1-raphaelsc@scylladb.com>
2018-03-01 20:57:06 +01:00
Jesse Haber-Kucharsky
90af3d889a tests: Rename test for consistency
Now we have `cql_auth_query_test` and `cql_auth_syntax_test`.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
464f41d2bb cql: Eliminate uses of legacy username rule
All users of `username` are replaced with `userOrRoleName`, except in
USER-specific (legacy) statements: CREATE USER, ALTER USER, DROP USER.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
b84e22acdd cql: Elaborate error for quoted user names
Since quoted names are allowed for role names, we add a more descriptive
error message when a quoted name is (erroneously) used for a user name.

This behavior is consistent with Apache Cassandra.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
b5264d8bf7 cql: Allow role names to be string literals
This behavior matches that of Apache Cassandra. When a role name is
specified as a string literal (single quotes), the case is preserved.
2018-03-01 12:06:59 -05:00
Jesse Haber-Kucharsky
d7f2035dea cql: Make role syntax more consistent
This patch changes the syntax for CQL statements related to roles to
favor a form like

    CREATE ROLE sam WITH PASSWORD = 'shire' AND LOGIN = false;

instead of

    CREATE ROLE sam WITH PASSWORD 'shire' NOLOGIN;

This new syntax has the benefit of not imposing any ordering constraints
on the modifiers for roles and being consistent with other parts of the
CQL grammar. It is also consistent with syntax in Apache Cassandra.

The old USER-based statements (CREATE USER and ALTER USER) still have
the old forms for backwards compatibility.

A previous change modified the USER-related statements to allow for the
OPTIONS option. However, this was a mistake; only the PASSWORD option
should have been allowed. This patch also corrects this mistake.
2018-03-01 12:04:40 -05:00
Jesse Haber-Kucharsky
62bfc3939c tests: Add CQL syntax tests for access-control
These are quick-running tests for verifying the accepted forms of CQL
statements (and fragments) related to access-control: users, roles, and
permissions.

Establishing the allowed forms of statements is helpful for reference,
but also makes syntax changes (like those expected in later patches)
clearer and more safe.
2018-03-01 11:46:37 -05:00
Tomasz Grabiec
91ccf82ce4 mvcc: Improve printout of partition_snapshot_row_cursor
Multiline output is easier to read by humans.
Also, print continuity.

Message-Id: <1519909484-24531-1-git-send-email-tgrabiec@scylladb.com>
2018-03-01 13:44:00 +00:00
Takuya ASADA
101e909483 dist/debian: install scylla-housekeeping upstart script correctly on Ubuntu 14.04
Since we splited scylla-housekeeping service to two different services for systemd, we don't share same service name between systemd and upstart anymore.
So handle it independently for each distribution, try to install
/etc/init/scylla-housekeeping.conf on Ubuntu 14.04.

Fixes #3239

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519852659-10688-1-git-send-email-syuu@scylladb.com>
2018-03-01 10:36:11 +02:00
Takuya ASADA
69e3760920 dist/redhat: support CentOS/ppc64le
Support POWER architecture on Scylla.
Since DPDK is not fully supported on POWER (no PMD supported on it yet),
disabled it for now.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180228203048.21593-1-syuu@scylladb.com>
2018-03-01 09:59:39 +02:00
Tomasz Grabiec
30635510a2 intrusive_set_external_comparator: Fix _header having undefined color on move
swap_tree() doesn't change the color of the header, and becasue header
was not initialized, it is undefined (can be both red or black). One
problem this causes is that algo::is_header() expects the header to be
always red. It is used by unlink(), which for trees which have a black
header would infinite-loop.

The fix is to initialize the header.

Fixes #3242.

Message-Id: <1519815091-13111-1-git-send-email-tgrabiec@scylladb.com>
2018-02-28 13:56:58 +02:00
Botond Dénes
ee307751e6 token_metadata: make get_host_id() and get_endpoint_for_host() const
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <febcb558848f8e06661bba592263e55e3192ed47.1519741336.git.bdenes@scylladb.com>
2018-02-27 16:29:13 +02:00
Duarte Nunes
76e6423910 database: Truncate views when truncating the base table
Fixes #3200

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180211124218.41373-1-duarte@scylladb.com>
2018-02-27 15:54:43 +02:00
Amnon Heiman
57d46c6959 scylla-housekeeing: need to support both debian/ubuntu variations
Debian and ubuntu list files come in two variations.
The housekeeping should support both.

This patch change the regexp that match the os in the repository file.
After the introduction of the second list variation, the os name can be in the middle of the path not only at the end.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180227092543.19538-1-amnon@scylladb.com>
2018-02-27 11:40:47 +02:00
Botond Dénes
d088c7724e Make serialization-deserialization of range symmetric
Currently serializing and deserializing singular ranges is asymetric.
When serializing a range we use the start() and end() functions to
obtain _start and _end respectively. However for singular ranges end()
will return _start and therefore the serialized range will have two
engaged optionals for bounds whereas the in-memory version will have only
one. The immediate consequence of this is that after serializing and
deserializing a range it will not compare equal to the original
serialized range. Needless to say this is *very* suprising behaviour.

To remedy the issue we fix the wrapping_range's constructor to not set
_end to the passed in value when the range is singular.
This way the on-wire format can stay compatible to how the range is
percieved by client code (when is_singular(): start() == end()) but
constructing the range from the wire-format will yield a range that will
always compare equal to the original one.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <e5f20b7b45f65ca1f7b347dcccd2ac462869e7ff.1519652739.git.bdenes@scylladb.com>
2018-02-26 20:24:55 +02:00
Avi Kivity
d973445a94 Merge "sstable/schema extensions" from Calle
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)

Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).

Table/view tables already contain an extension element, so
we utilize this to persist config.

schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.

Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.

Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"

* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
  extensions: Small unit test
  sstables: Process extensions on file open
  sstables::types: Add optional extensions attribute to scylla metadata
  sstables::disk_types: Add hash and comparator(sstring) to disk_string
  schema_tables: Load/save extensions table
  cql: Add schema extensions processing to properties
  schema_tables: Require context object in schema load path
  schema_tables: Add opaque context object
  config_file_impl: Remove ostream operators
  main/init: Formalize configurables + add extensions to init call
  db::config: Add extensions as a config sub-object
  db::extensions: Configuration object to store various extensions
  cql3::statements::property_definitions: Use std::variant instead of any
  sstables: Add extension type for wrapping file io
  schema: Add opaque type to represent extensions
  sstables::compress/compress: Make compression a virtual object
2018-02-26 17:15:29 +02:00
Paweł Dziepak
5dfa36c526 lsa: add basic sanitizer
LSA being an allocator built on top of the standard may hide some
erroneous usage from AddressSanitizer. Moreover, it has its own classes
of bugs that could be caused by incorrect user behaviour (e.g. migrator
returning wrong object size).

This patch adds basic sanitizer for the LSA that is active in the debug
mode and verifies if the allocator is used correctly and if a problem is
found prints information about the affected object that it has collected
earlier. Theat includes the address and size of an object as well as
backtrace of the allocation site. At the moment the following errors are
being checked for:
 * leaks, objects not freed at region destructor
 * attempts to free objects at invalid address
 * mismatch between object size at allocation and free
 * mismatch between object size at allocation and as reported by the
   migrator
 * internal LSA error: attempt to allocate object at already used
   address
 * internal LSA error: attempt to merge regions containing allocated
   objects at conflicting addresses

Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>
2018-02-26 14:35:13 +02:00
Botond Dénes
c4b5249a46 backlog_controller::adjust(): fix heap-overflow
Make sure idx will not be equal to _control_points.size() (and thus
overflow the vector) when looking for the first control-point with
a backlog not smaller then the current one, by stopping when it's equal
to _control_points.size() - 1.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <47841592792573d820650d570fa1ab7e58bdac2c.1518700405.git.bdenes@scylladb.com>
2018-02-26 13:47:38 +02:00
Avi Kivity
8fe2414b11 Merge seastar upstream
* seastar 383ccd6...f841d2d (8):
  > Merge "Randomize task queue in debug mode" from Duarte
  > tutorial: document seastar::thread
  > tutorial: add missing seastar namespace
  > tutorial: note about asynchronous functions throwing exceptions
  > thread: stop backtraces on aarch64 from underflowing the stack
  > Revert "core:🧵 ARM64 version of annotating the frame"
  > core:🧵 ARM64 version of annotating the frame
  > core/future-util: Release exception in repeater
2018-02-26 12:54:35 +02:00
Calle Wilund
e75d3dc997 extensions: Small unit test
Test basic operation of schema and sstable extensions
2018-02-26 10:43:37 +00:00
Paweł Dziepak
b103139e4f configure.py: do not ignore optimisation flags
Release mode flags are properly propagated through seastar --optflags
flag, but debug mode flags aren't. This is problematic since they are
used to enable additional debugging features.

After this patch we will end up with some duplicate flags, but that's
not really a problem.

Message-Id: <20180223173617.15199-1-pdziepak@scylladb.com>
2018-02-25 17:09:07 +02:00
Botond Dénes
206e7d40d4 restricted_mutation_reader: switch to std::variant
Tests: unit-tests(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a8930b764171db131d9d8d5fe4035014ecb452f4.1519391304.git.bdenes@scylladb.com>
2018-02-25 14:35:57 +02:00
Paweł Dziepak
6b66e4833b mvcc: avoid ubsan warning about uninitialised boolean
Message-Id: <20180223160133.21383-1-pdziepak@scylladb.com>
2018-02-23 16:54:23 +00:00
Jesse Haber-Kucharsky
82c8104c72 cql_test_env: Ignore error if user already exists
When a `cql_test_env` points to a data directory that was previously
populated with `cql_test_env`, then the "tester" user will already
exist. This is not an error, so we can just ignore the exception.

Fixes #3224.

Tests: unit (debug)
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <7729e5a98d8020a7ed1b6d12d8726559f0850f9d.1519315698.git.jhaberku@scylladb.com>
2018-02-22 19:30:50 +01:00
Raphael S. Carvalho
f59f423f3c Make sstable loading faster by not invoking all shards for each sstable
Before 312bd9ce25, boot had to call all shards for each sstable
such that they would agree/disagree on their deletion, an atomic
deletion manager requirement.

After its removal, we can afford to call only the shards that own
a given sstable.

Reducing the operation on each sstable from (SSTABLES) * (SHARD_COUNT)
to usually (SSTABLES). It may be the same as before after resharding,
but resharding is an one-off operation.

Boot time should be significantly reduced for nodes with a high smp
count and column family using leveled strategy (which can end up with
thousands of sstables).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180220032554.17776-1-raphaelsc@scylladb.com>
2018-02-22 09:39:56 +00:00
Amnon Heiman
edcfab3262 dist/docker: Add support for housekeeping
This patch takes a modified version of the Ubuntu 14.04 housekeeping
service script and uses it in Docker to validate the current version.

To disable the version validation, pass the --disable-version-check flag
when running the container.

Message-Id: <20180220161231.1630-1-amnon@scylladb.com>
2018-02-21 09:26:02 +02:00
Duarte Nunes
e75f7c41d9 Merge 'Proper clean-up on closing index_reader' from Vladimir
With the changes introduced in #2981 and #3189, the lifetime management
of the objects used by index_reader became more complicated.
This patchset addresses the immediate problems caused by lack of proper
handling.

The more holistic approach to this will take more time and is to be
implemented under #3220. The current fix, however, should be good
enought as a stop-gap solution.

* 'issues/3213/v3' of https://github.com/argenet/scylla:
  Close promoted index streams when closing index_readers.
  Support proper closing of prepended_input_stream.
2018-02-21 01:02:16 +00:00
Vladimir Krivopalov
c996191411 Close promoted index streams when closing index_readers.
Promoted index input streams must be explicitly closed when closing the
index_reader in order to ensure all the pending read-aheads are
completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:15 -08:00
Vladimir Krivopalov
8d52d809f7 Support proper closing of prepended_input_stream.
When the stream is being closed, the call is forwarded to the stored
data_source.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-02-20 16:04:05 -08:00
Vladimir Krivopalov
721bd3eef6 Added missing 'override' to skip() in buffer_input_stream and prepended_input_stream.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <4e91bead8de7f6fa9b3bfdab8bda73efdb22749d.1519152303.git.vladimir@scylladb.com>
2018-02-20 19:49:11 +00:00
Pekka Enberg
f1f691b555 Merge "Add the GoogleCloudSnitch" from Vlad
"This series adds the GoogleCloudSnitch.

 Fixes #1619"

* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
  config: uncomment/add the supported snitches description
  tests: added gce_snitch_test
  locator::gce_snitch: implementation of the GoogleCloudSnitch
  locator::snitch_base: properly log the failure during the snitch startup
2018-02-19 15:58:56 +02:00
Paweł Dziepak
d97eebe82d tests/cql3: increase TTL to avoid spurious failures
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.

Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
2018-02-19 15:40:19 +02:00
Pekka Enberg
bd365a10d3 Merge "Add an API to get all active repairs" from Amnon
"This series adds an API to return the active repairs by their IDs.

 After this series a call to:

   curl -X GET --header "Accept: application/json" "http://localhost:10000/storage_service/active_repair/"

 Will return an array with the ids of the active repairs.

 Fixes #3193"

* 'amnon/get_active_repairs_v3' of github.com:scylladb/seastar-dev:
  API: Add get active repair api
  repair: Add a get_active_repairs function to return the active repair
2018-02-19 15:32:17 +02:00
Amnon Heiman
4a8f67aa01 conf: Remove unsupported 'stream_throughput_outbound_megabits_per_sec' option
stream_throughput_outbound_megabits_per_sec is not supported and is
found in the unsupported part of scylla.yaml.

This patch removes it from the supported part of the file.

Fixes #2876

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180219111421.30687-1-amnon@scylladb.com>
2018-02-19 15:16:23 +02:00
Duarte Nunes
d394b30882 tests/flush_queue_test: Ensure queue is closed before being destroyed
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217172008.27551-1-duarte@scylladb.com>
2018-02-19 13:10:28 +00:00
Duarte Nunes
294326b5b1 tests/commitlog_test: Close file
Operations on a append_challenged_posix_file_impl schedule asynchronous
operations when they are executed, which capture the file object. To
synchronize with them and prevent use-after-free, we need to call
close() before destroying the file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217170556.27330-1-duarte@scylladb.com>
2018-02-19 13:10:14 +00:00
Duarte Nunes
ac55210677 tests/logalloc_test: Ensure regions are reclaimed in order
This test relied on task execution order to work correctly. Namely, it
relied on parent regions being reclaimed before child regions
(reclaiming is an asynchronous process started by a call to
start_reclaiming()). This order is necessary because child regions
don't know about parent regions when calculating the biggest region
that should be reclaimed.

We fix this by forcing the reclaim order.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217121655.26057-1-duarte@scylladb.com>
2018-02-19 13:09:59 +00:00
Duarte Nunes
f665f1ab97 db/commitlog: Close the segment file
Operations on a segment's underlying append_challenged_posix_file_impl,
such as truncate(), schedule asynchronous operations when they are
executed, which capture the file object. To synchronize with them and
prevent use-after-free, we need to call close() and only delete the
segment and file when the returned future resolves.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216235754.24257-1-duarte@scylladb.com>
2018-02-19 13:09:41 +00:00
Duarte Nunes
7004f6c7ff db/commitlog: Actually prevent new requests during shutdown
When shutting down the commitlog we try to block all new requests by
acquiring all available resources. We were, however, letting go of the
semaphore permits too early, before closing the gate and shutting down
the active segments.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234826.24111-1-duarte@scylladb.com>
2018-02-19 13:09:26 +00:00
Duarte Nunes
9ce0be60d4 utils/flush_queue: Remove unused function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216234502.23931-1-duarte@scylladb.com>
2018-02-19 13:09:11 +00:00
Duarte Nunes
4fdcd6c92f tests/serialized_action_test: Don't rely on task execution order
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180216191050.21902-1-duarte@scylladb.com>
2018-02-19 13:08:58 +00:00
Duarte Nunes
03608d269e Merge 'On the road to roles' from Jesse
This series takes Scylla most of the way to supporting roles, and
eliminates old user-based code. All the old user-based CQL statements
and functionality should exist as they did before, except now they are
backed internally by roles.

While all the functionality for supporting roles should be present,
role-specific features like granting a role to another role still warn
as "unimplemented". This will continue until the next series addresses
the final touches. These remaining items are:

- A slightly revised CQL syntax consistent with Apache Cassandra's
  revised role syntax.

- A user is automatically granted permissions on resources they create.

Users running a previous version of Scylla should be able to seamlessly
upgrade to a version of Scylla with this series merged. When a newly
upgraded node starts, it detects the presence of old metadata and copies
it to the new metadata tables if no nondefault new metadata yet exists.
A new gossiper feature flag, ROLES, also ensures that access-control
data is not modified while a cluster is in a partially-upgraded state.
If, when the cluster is in a partially upgraded state, a client connects
to an un-upgraded node then likely the change will not be propogated to
the new metadata table. We will document that changes to access-control
are not supported while upgrading in order to account for both cases
(a client connecting to an upgraded and a non-upgraded node).

All unit tests pass (except those which also fail on `master`).

I've run auth-related dtests and they all pass, except for tests which
depend on the old security model and which are therefore invalid.
Upstream dtests have been updated to account for this new security model,
and I will open an appropriate pull request to to similarly update our
own version.

I have also done a test-run cluster upgrade procedure with ccm
consisting of a 3 node cluster. I began by creating the cluster from
`master` and increasing the replication factor of the `system_auth`
keyspace to 3 and repairing the nodes. I then created several users and
granted them permissions on some resources. I then stopped a node,
updated its hardlinked executable to Scylla built from this patch series
, and restarted the node. I observed the migration of legacy data
starting and finishing. Connecting to the node, I observed all the new
roles functionality was working correctly. I verified that attempting to
change access-control information failed with a message about an
upgrading cluster. I repeated the process, node by node, with the
remaining two nodes and finally observed that the entire cluster had
upgraded and that I could modify access-control information freely.  I
will encapsulate this test into a dtest if possible.

Fixes #1941.

* 'jhk/switch_to_roles/v6' of https://github.com/hakuch/scylla: (83 commits)
  cql3: Remove some unimplemented warnings
  cql3: Prevent unhandled exception for anonymous user
  auth: Add alias for set of role names
  auth: Revoke permissions on dropped role resources
  auth: Move definition to corresponding .cc file
  cql3: Fix life-time of `user` from `client_state`
  auth: Migrate legacy data on boot
  auth: Check protected resources of the role-manager
  auth: Protect authenticator resources
  service/client_state: Correct erroneous comment
  client_state: Fix error message
  cql3: Fix error handling for GRANT and REVOKE
  auth: Remove unnecessary `sstring` allocation
  cql3: Rename variables to reflect roles
  auth: Decouple authorization and role management
  auth: Add code to expand a resource family
  cql: Also add `username` col. for LIST PERMISSIONS
  cql3: Fix error handling in LIST PERMISSIONS
  auth: Change error messages to pass dtests
  cql3: Handle errors more precisely for roles
  ...
2018-02-16 13:57:29 +00:00
Tomasz Grabiec
9c3e56fb16 tests: row_cache: Improve test for snapshot consistency on eviction
Reproduces https://github.com/scylladb/scylla/issues/3215.
Message-Id: <1518710592-21925-1-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:23 +00:00
Tomasz Grabiec
b0b57b8143 mvcc: Do not move unevictable snapshots to cache
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.

The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.

Fixes #3215.

Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:07 +00:00
Paweł Dziepak
1e218e2b80 Merge "Fixes for exception safety in cache and LSA" from Tomasz
"Fixes two issues:
  - update may abort if allocation of an empty partition_version fails
  - LSA region construction is not exception safe, it may leave the misconstructed
    region registered if allocation inside region_group::add() fails."

* tag 'tgrabiec/exception-safety-cache-update-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add test for exception safety of updates from memtable
  tests: flat_reader_assertions: Improve failure message
  cache: Handle exceptions from make_evictable()
  tests: Disable failure injection around background compactor
  lsa: Disable allocation failure injection inside merge()
  lsa: Make region deregistration robust against duplicates
  lsa: Make region allocation exception safe
2018-02-15 10:32:08 +00:00
Tomasz Grabiec
b3415880b2 tests: row_cache: Add test for exception safety of updates from memtable 2018-02-15 10:13:02 +01:00
Jesse Haber-Kucharsky
2348c303df cql3: Remove some unimplemented warnings
While there are some small remaining features for roles, all the old
user-based statements still exist as they did before (except now they're
backed by roles) and should not log warnings.
2018-02-14 14:16:00 -05:00
Jesse Haber-Kucharsky
114cfd4e5a cql3: Prevent unhandled exception for anonymous user
Since `validate` is called after `check_access`, an anonymous user would
not get the expected error message about restrictions on anonymous
users.
2018-02-14 14:16:00 -05:00
Jesse Haber-Kucharsky
a83af20311 auth: Add alias for set of role names
This shortens some type names considerably.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
39a44e3494 auth: Revoke permissions on dropped role resources
Previously, when a table or keyspace was dropped, the
authorizer (through a `migration_listener`) automatically dropped all
permissions granted on that resource.

Likewise, when a role is granted permissions and the role is dropped,
all permissions granted to the role are dropped.

In this change, we now treat role resources just like table and keyspace
resources: if a permission is granted on a role (like "GRANT AUTHORIZE
ON ROLE qa TO phil") and the "qa" role is dropped, then all permissions
on the "qa" role resource are also dropped.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
e6d9d53eca auth: Move definition to corresponding .cc file 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
89b5bf2d7a cql3: Fix life-time of user from client_state 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
fbc97626c4 auth: Migrate legacy data on boot
This change allows for seamless migration of the legacy users metadata
to the new role-based metadata tables. This process is summarized in
`docs/migrating-from-users-to-roles.md`.

In general, if any nondefault metadata exists in the new tables, then
no migration happens. If, in this case, legacy metadata still exists
then a warning is written to the log.

If no nondefault metadata exists in the new tables and the legacy tables
exist, then each node will copy the data from the legacy tables to the
new tables, performing transformations as necessary. An informational
message is written to the log when the migration process starts, and
when the process ends. During the process of copying, data is
overwritten so that multiple nodes racing to migrate data do not
conflict.

Since Apache Cassandra's auth. schema uses the same table for managing
roles and authentication information, some useful functions in
`roles-metadata.hh` have been added to avoid code duplication.

Because a superuser should be able to drop the legacy users tables from
`system_auth` once the cluster has migrated to roles and is functioning
correctly, we remove the restriction on altering anything in the
"system_auth" keyspace. Individual tables in `system_auth` are still
protected later in the function.

When a cluster is upgrading from one that does not support roles to one
that does, some nodes will be running old code which accesses old
metadata and some will be running new code which access new metadata.

With the help of the gossiper `feature` mechanism, clients connecting to
upgraded nodes will be notified (through code in the relevant CQL
statements) that modifications are not allowed until the entire cluster
has upgraded.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
8be0165713 auth: Check protected resources of the role-manager
A new function `auth::service::is_protected` checks the
protected-resource set of all access-control modules (including the
role-manager).
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
8440140465 auth: Protect authenticator resources
A typo meant that only the authorizer resources were protected.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
617e432540 service/client_state: Correct erroneous comment 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
e27cfd4dda client_state: Fix error message
Now that resources are not just keyspaces and tables, the word "schema"
doesn't make sense.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f9f03bc2e1 cql3: Fix error handling for GRANT and REVOKE
This change gets rid of duplicated code for checking if the grantee or
revokee exist by moving this functionality to the auth. service.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
e18adbcb3e auth: Remove unnecessary sstring allocation
The authorizer now accepts parameters by `string_view`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c1a03dbf54 cql3: Rename variables to reflect roles 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
5be16247cc auth: Decouple authorization and role management
auth: Decouple authorization and role management

Access control in Scylla consists of three main modules: authentication,
authorization, and role-management.

Each of these modules is intended to be interchangeable with alternative
implementations. The `auth::service` class composes these modules
together to perform all access-control functionality, including caching.

This architecture implies two main properties of the individual
access-control modules:

- Independence of modules. An implementation of authentication should
  have no dependence or knowledge of authorization or role-management,
  for example.

- Simplicity of implementing the interface. Functionality that is common
  to all implementations should not have to be duplicated in each
  implementation. The abstract interface for a module should capture
  only the differences between particular implementations.

Previously, the authorization interface depended on an instance of
`auth::service` for certain operations, since it required aggregation
over all the roles granted to a particular role or required checking if
a given role had superuser.

This change decouples authorization entirely from role-management: the
authorizer now manages only permissions granted directly to a role, and
not those inherited through other roles.

When a query needs to be authorized, `auth::service::get_permissions`
first uses the role manager to check if the role has superuser. Then, it
aggregates calls to `auth::authorizer::authorize` for each role granted
to the role (again, from the role-manager) to determine the sum-total
permission set. This information is cached for future queries.

This structure allows for easier error handling and
management (something I hope to improve in the future for both the
authorizer and authenticator interfaces), easier system testing, easier
implementation of the abstract interfaces, and clearer system
boundaries (so the code is easier to grok).

Some authorizers, like the "TransitionalAuthorizer", grant permissions
to anonymous users. Therefore, we could not unconditionally authorize an
empty permission set in `auth::service` for anonymous users. To account
for this, the interface of the authorizer has changed to accept an
optional name in `authorize`.

One additional notable change to the authorizer is the
`auth::authorizer::list`: previously, the filtering happened at the CQL
query layer and depended on the roles granted to the role in question.
I've changed the function to simply query for all roles and I do the
filtering in `auth::system` in-memory with the STL. This was necessary
to allow the authorizer to be decoupled from role-management. This
function is only called for LIST PERMISSIONS (so performance is not a
concern), and it significantly reduces demand on the implementation.

Finally, we unconditionally create a user in `cql_test_env` since
authorization requires its existence.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
0ac7d9922d auth: Add code to expand a resource family
This will be useful for the next change, where it is used for
refactoring LIST PERMISSIONS.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
d0ddb354d0 cql: Also add username col. for LIST PERMISSIONS
the value for the `role` column is equal to the value for the `username`
column.

This change makes LIST PERMISSIONS backwards compatible with clients
that expect the `username` column to exist. This functionality also
exists in Apache Cassandra.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
cccfe269cf cql3: Fix error handling in LIST PERMISSIONS
This patch replaces duplicated code for checking the existence of a user
with the same mechanism for doing so as elsewhere: by checking for
`auth::nonexistent_role` being thrown during the course of checking
access-control.

This patch also ensures that exceptions thrown while querying the list
of permissions on a resource get handled correctly.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
13ba128967 auth: Change error messages to pass dtests
The fixed dtests which only failed due to differences in wording and
grammar for error messages are:

- altering_nonexistent_user_throws_exception_test
- cant_create_existing_user_test
- dropping_nonexistent_user_throws_exception_test
- users_cant_alter_their_superuser_status_test
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f372bbb4bc cql3: Handle errors more precisely for roles
This patch ensures that all the CQL statements for managing roles
correctly catch exceptions in the underlying `role_manager` and re-throw
them as top-level exceptions (like "invalid request").

This patch also refines exception handling so that only the applicable
errors are explicitly caught. This should allow easier auditing in the
future and help to reveal faulty assumptions.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
ce3be07556 auth: Move resource existence checks
Previously, a "data" auth. resource knew how to check it's own existence by
accessing a global variable.

This patch accomplishes two things: it adds existence checking to all
kinds of resources, and moves these checks outside of `auth::resource`
itself and into `auth::service` (so that global variables are no longer
accessed).
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
cf5f6aa4c5 auth: Fix fragile variable life-times
According to the Seastar convention, a parameter passed to a function
taking a reference parameter must live for the duration of the execution
of the returned future.

When possible, variables are statically allocated. When this is not
possible, we use `do_with`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
5f323a3530 cql3: Check only filtered permissions
When a user executes GRANT or REVOKE, Scylla ensures that they
themselves are granted the permissions they are changing.

The code previously checked a static list of permissions, which we could
have replaced with `auth::permissions::ALL`. Even better, we now expand
the set of filtered permissions into an iterable container.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
f4fc12fbf0 enum_set: Add iterator
Sometimes it is useful to be able to query for all the members of an
`enum_set`, rather than just add, remove, and query for membership. (The
patch following this one makes use of this in the auth. sub-system).

We use the bitset iterator in Seastar to help with the implementation.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
bbe09a4793 enum_set: Throw on bad mask
`super_enum::valid_is_valid_sequence` determines if the numeric index
corresponding to an enumeration value is valid. This is important,
because it is undefined behavior to cast an invalid index into an
enumeration value.

This function is used to check the validity of the `enum_set` mask when
an `enum_set` is constructed in `enum_set::from_mask`. If the mask has
set bits that correspond to invalid enumeration indicies, then we throw
`bad_enum_set_mask`.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
1cf6dd85fb tests: Add basic tests for enum_set
This is motivated by a small addition to `enum_set` and `super_enum`
that follows this patch.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
7db675b298 cql3: Remove std::move on return value
This prevents guaranteed return-value optimization (RVO).
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
357f3afb60 auth: Remove outdated "TODO"
Authorization never happens at this level of the stack, though it
formally did.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
b1d9d0e4ff auth: Reorder authorizer args for consistency 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c1504cd4ff auth: Pass resource by const ref.
This has the dual benefit of not enforcing copying on implementations of
the abstract interface and also limiting unnecessary copies.

As usual with Seastar, we follow the convention that a reference
parameter to a function is assumed valid for the duration of the
`future` that is returned. `do_with` helps here.

By adding some constants for root resources, we can avoid using
`seastar::do_with` at some call-sites involving `resource` instances.
2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
45631604b0 auth: Use string_view for paramters 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
c4f686c10f auth: Put definitions inside namespace 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
3de8b4c898 auth/resource: Don't store exn. argument 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
7fd3539d94 cql3: Avoid redundant return when throwing 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
81f38edc61 auth/service: Rename function for consistency 2018-02-14 14:15:59 -05:00
Jesse Haber-Kucharsky
ac3c68b0ac auth/role_manager.hh: Unify doc. style 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
0c6bd791c2 auth/role_manager: Remove unnecessary exn. info
We can add it back on an as-needed basis. The other exceptions in the
module do not make similar information available.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
0590dcf6cd auth/authorizer: Add missing const 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a3eaf9e697 auth: Remove unused "performer" argument
This argument used to be used for access-control checks, but this has
all moved to the CQL layer.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
5fe464d999 auth/default_authorizer: Move access-checks to CQL
All authorization checking lives in the CQL layer. The individual
authenticator, authorizer, and role-manager enforce no access-checks.

It may be a good idea to move these checks a level downward in the
future for ease of testing, but for now we aim for consistency.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
4d2c4177df cql3/list_permissions_statement: Fix formatting
Something strange must have happened with somebody's editor.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
45c6d13812 auth: Remove useless try-catch block
This looks to have been a typo in the original porting work.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
2dc9f00fe3 cql3: Use authenticated_user-specific overload
This prevents us from accidentally accessing a non-existent value.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
68ba6a481b auth: Add has_role helper 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
f8bbbfd8f9 auth: Check role existence when querying perms 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a0f0e07554 auth: Check for unsupported authentication options
While it's undefined behavior to pass an unsupported option to a
specific authenticator directly, the `auth::service` layer will check
options and throw this exception. It is turned into a
`invalid_request_exception` by the CQL layer.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
e6363e15de auth/resource: Construct from ctor
The motivation behind this change is the idea that constructing a new
instance of an object is the job of the constructor.

One big benefit of this structure (with the addition of helpers for
convenience) is that calls for emplacing instances (like
`std::make_shared`, or `std::vector::emplace_back`) work without any
difficulty. This would not be true for static construction functions.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
12d6f5817d auth: Switch to std::optional
Now that Scylla is a C++17 application, we should no longer use
`std::experimental::optional` (which is a distinct type from
`std::optional`).
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a633777378 auth/authorizer.hh: Use default keyword 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
739f0e2dbd auth: Move static member function decl. up 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
2e1c3823d0 auth/authorizer: Delete unused member function 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
59c100b37f auth: Use virtual and override
According to previous discussions on the mailing-list with Avi, using
both has the benefits of making virtual functions stand out and also
warning about functions which unintentionally do not override.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
4d9f957dc2 auth/authenticator.hh - Use default keyword 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
f78d89968e auth/authorizer.hh: Replace documentation 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a66896dd8f auth/authenticator.hh: Replace documentation 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
053b6b4d04 auth: Unify formatting
The goal is for all files in `auth/` to conform to the Seastar/Scylla
`coding-style.md` document.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
a4c7aee238 auth: Fix includes 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
de33124c39 Don't store authenticated_user in shared_ptr
All we require are value semantics.

`client_state` still stores `authenticated_user` in a `shared_ptr`, but
the behavior of that class is complex enough to warrant its own
discussion/design/refactor.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
f7b4f62dab auth/authenticated_user: Add some documentation 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
e11de26d50 auth: Simplify authenticated_user interface
The most important change is replacing `auth::authenticated_user::name`
with a public `std::optional<sstring>` member. Anonymous users have no
name. This replaces the insecure and bug-prone special-string of
"anonymous" for anonymous users, which does unfortunate things with the
authorizer.

The new `auth::is_anonymous` function exists for convenience since
checking the absence of a `std::optional` value can be tedious.

When a caller really wants a name unconditionally, a new stream output
function is also available.
2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
308a0be5c2 auth/authenticated_user: Make ctor explicit 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
9ac6035f5d auth/authenticated_user: Use std::optional 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
0d1ea0a357 auth/authenticated_user: Mark functions noexcept 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
6cb3b06112 auth/authenticated_user: Remove outdated comment 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
64f844b870 auth/authenticated_user: Hide internal constant 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
15a2b93970 auth/authenticated_user: Use default ctors 2018-02-14 14:15:58 -05:00
Jesse Haber-Kucharsky
fa94ee5a3a auth/authenticated_user: Move defns into namespace 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
4fad30ef42 auth/authenticated_user: Remove whitespace 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
2dd632f6e8 auth/authenticated_user: Use string_view in ctor 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
fa159c0ac4 auth: Mark authenticated_user final 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
f18dd25e7e cql3: Fix DROP ROLE IF EXISTS
Checking if the role to be dropped has superuser requires that the role
exists, which means `auth::nonexistent_role` was thrown even when IF
EXISTS was specified.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
b69c27d210 auth/standard_role_manager: Avoid string copies 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
bcc1fbad3a auth/service.hh: Fix documentation for errors
There is a distinct difference between throwing an exceptional
immediately and returning an exceptional future.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
741d215516 auth: Switch to roles from users
This is a large change, but it's a necessary evil.

This change brings us to a minimally-functional implementation of roles.
There are many additional changes that are necessary, including refined
grammar, bug fixes, code hygiene, and internal code structure changes.
In the interest of keeping this patch somewhat read-able, those changes
will come in subsequent patches. Until that time, roles are still marked
"unimplemented".

IMPORTANT: This code does not include any mechanism for transitioning a
cluster from user-based access-control to role-based access control. All
existing access-control metadata will be ignored (though not deleted).

Specific changes:

- All user-specific CQL statements now delegate to their roles
  equivalent. The statements are effectively the same, but CREATE USER
  will include LOGIN automatically. Also, LIST USERS only lists roles
  with LOGIN.

- A call to LIST PERMISSIONS will now also list permissions of roles
  that have been granted to the caller, in addition to permissions which
  have been granted directly.

- Much of the logic of creating, altering, and deleting roles has been
  moved to `auth::service`, since these operations require cooperation
  between the authenticator, authorizer, and role-manager.

- LIST USERS actually works as expected now (fixes #2968).
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
41f893d676 Don't use "experimental" optional
We're in C++17 country now.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
903ea32f30 auth/standard_role_manager: Fix life-time bug
It worked most of the time, but changes in other areas of the code must
have triggered the conditions necessary to make it fail.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
8878ce456c cql3/statements: Use convenient type alias 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
36b283f7ea auth: Allow empty role updates 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
34280c18bb tests: Rename helper function for clarity 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
635dc3d5ed auth: Include missing header 2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
f2b78499fe auth: Fix logic in service::role_has_superuser
The previous code has an off-by-one error since the iterator is
incremented unconditionally prior to being compared to the end of the
collection.

This new version is also shorter thanks to `seastar::do_until`.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
28a840db72 auth: Add error handling for incompatible modules
The components of access-control (authentication, authorization, and
role-management) are designed as abstract interfaces, but due to
decisions of Apache Cassandra, certain implementations are dependent on
other particular implementations.

This change throws a new exception,
`auth::incompatible_module_combination`, when a dependency is not
satisfied.
2018-02-14 14:15:57 -05:00
Jesse Haber-Kucharsky
b3dc90d5d2 auth: Refactor authentication options
The set of allowed options is quite small, so we benefit from a static
representation (member variables) over a dynamic map.

We also logically move the "OPTIONS" option to the domain of the
authenticator (from user management), since this is where it is applied.

This refactor also aims to reduce compilation time by moving
`authentication_options` into its own header file.

While changes to `user_options` were necessary to accommodate the new
structure, that class will be deprecated shortly in the switch to roles.
Therefore, the changes are strictly temporary.
2018-02-14 14:15:57 -05:00
Tomasz Grabiec
1039850515 tests: flat_reader_assertions: Improve failure message 2018-02-14 16:42:49 +01:00
Tomasz Grabiec
27b114fe45 cache: Handle exceptions from make_evictable()
cache_entry constructor was marked noexcept, yet make_evictable() may
fail in rare cases due to allocation in add_version(). Lift the
annotation and make sure that construction has strong exception
guarantees for the moved-in state so that it can be retried without
data loss inside allocating section.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
74986f31e8 tests: Disable failure injection around background compactor
Failure could be injected into the compactor if the main code under
test defers before reaching allocation failure point, and compactor
gets hit. This is not what the test is supposed to stress, and it
causes abort when memtable_snapshot_source is destroyed, so disable
failure injection there.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
7e0ff8a920 lsa: Disable allocation failure injection inside merge()
Fixes termiantion in tests due to throw from merge(), which is noexcept.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
66701c1671 lsa: Make region deregistration robust against duplicates 2018-02-14 16:42:49 +01:00
Tomasz Grabiec
cf876bbe2d lsa: Make region allocation exception safe
We were not unregisterring in case add() fails.
2018-02-14 16:42:49 +01:00
Paweł Dziepak
6c1503241d Merge seastar upstream
* seastar 2b0a81d...383ccd6 (9):
  > future-util: relax concept requirements for do_for_each()
  > seastar-addr2line: improve UX for bactraces read from stdin
  > noncopyable_function: Lift the noexcept guarantee
  > queue: doxygen documentation
  > queue: documentation
  > build: reinstate -Wsign-compare
  > iotune: don't compare sign and unsigned types
  > future-util: Remove unused local in with_scheduling_group()
  > tests/test-utils: Add macro for running tests within a seastar thread
2018-02-14 14:37:42 +00:00
Amnon Heiman
827723cec8 API: Add get active repair api
This patch adds an API to return an array of the ids of current active repairs.

After this patch a call to:
curl http://localhost:10000/storage_service/active_repair/

Will return the active repairs ids

Fixes #3193

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-02-14 11:43:41 +02:00
Amnon Heiman
3f2eae35fd repair: Add a get_active_repairs function to return the active repair
This patch adds a function that returns an array with the ids of the
active repairs by filtering the RUNNING ones in the repair tracker status.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-02-14 11:43:37 +02:00
Duarte Nunes
6f7233fbaf cql3/statements/truncate_statement: Prevent MV from being truncated
To truncate an MV, one must truncate the base table.

Fixes #3188

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180209162720.32757-1-duarte@scylladb.com>
2018-02-13 11:37:27 +00:00
Duarte Nunes
771852e731 Merge 'Fix possible stall in calculate_pending_ranges' from Asias
When the cluster is large or the num_tokens is big, calculate_pending_ranges
can take long time to complete. It now runs in the gossip thread so it can
block the gossip processing. Another problem is it runs in a plain for loop and
can cause the reactor stall.

User see this stall with decommission operations.

I can reproduce up to 4 seconds stall within a two-node cluster each with
`--num-tokens 3072` during decommission.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout

Fixes #3203

* tag 'asias/issue_3203_v2.1' of github.com:scylladb/seastar-dev:
  storage_service: Do not wait for update_pending_ranges in handle_state_leaving
  token_metadata: Handle affected_ranges with do_for_each
  token_metadata: Split token_metadata::calculate_pending_ranges
  token_metadata: Futurize calculate_pending_ranges
  storage_service: Futurize storage_service::do_update_pending_ranges
  token_metadata: Speed up token_metadata::get_endpoint
2018-02-13 11:12:22 +00:00
Asias He
74b4035611 storage_service: Do not wait for update_pending_ranges in handle_state_leaving
The call chain is:

storage_service::on_change() -> storage_service::handle_state_leaving()
-> storage_service::update_pending_ranges()

Listeners run as part of gossip message processing, which is
serialized. This means we won't be processing any gossip messages until
update_pending_ranges completes. update_pending_ranges takes time to
complete.

Since we do not wait for update_pending_ranges to complete any more,
multiple update_pending_ranges operations can run at the same time, use
serialized_action to serialize it.

Tested with update_cluster_layout_tests.py
2018-02-13 19:00:43 +08:00
Asias He
c17ce79977 token_metadata: Handle affected_ranges with do_for_each
affected_ranges can be very large in a large cluster or node with big
num_tokens account. calculate_natural_endpoints takes more time to
process in this case as well.

Futurize calculate_pending_ranges_for_leaving and handle the loop with
do_for_each to give some time for the reactor to breath, so it does not
block.
2018-02-13 19:00:43 +08:00
Asias He
60143a7517 token_metadata: Split token_metadata::calculate_pending_ranges
token_metadata::calculate_pending_ranges is a complicated function.
Split it into 3 parts for leaving operation, moving opeartion,
bootstrap opeartion.
2018-02-13 19:00:43 +08:00
Asias He
1834dd023f token_metadata: Futurize calculate_pending_ranges
Now, do_update_pending_ranges is futurized. We can finally futurize
token_metadata::calculate_pending_ranges in order to convert the loops
inside it to do_for_each insead of plain for loops to avoid reactor
stall.
2018-02-13 19:00:43 +08:00
Asias He
33c43b78c7 storage_service: Futurize storage_service::do_update_pending_ranges
Preparation work for the futurizing of the time consuming
token_metadata::calculate_pending_ranges.

In addition, we use do_for_each for the loop. It is better than the
plain for loop because the reactor can yield to avoid stalls in cases
there are tons of keyspaces.
2018-02-13 19:00:43 +08:00
Asias He
96266fc76a token_metadata: Speed up token_metadata::get_endpoint
token_metadata::calculate_pending_ranges ->
abstract_replication_strategy::calculate_natural_endpoints
-> token_metadata::get_endpoint()

With std::map

   INFO  2018-02-09 14:58:32,960 [shard 0] token_metadata - In
   calculate_pending_ranges: affected_ranges.size=6145 stars
   Reactor stalled for 4000 ms on shard 0.
   Backtrace:
     0x00000000004b12cb
     0x00000000004b1561
     /lib64/libpthread.so.0+0x00000000000123af
     0x0000000001159e25
     0x00000000011581eb
     0x000000000114f122
     0x000000000119f8c7
     0x00000000011985a4
     0x00000000011a7e16
     0x0000000001364741
     0x00000000013fe9fd
     0x00000000013ff792
     0x00000000014024b2
     0x000000000141a66f
     0x000000000141d7be
     0x00000000010ed234
     0x000000000112fdaa
     0x00000000011301f4
     0x000000000043543d
   INFO  2018-02-09 14:58:35,993 [shard 0] token_metadata - In
   calculate_pending_ranges: affected_ranges.size=6145 ends

With std::unordered_map

    INFO  2018-02-09 14:47:50,251 [shard 0] token_metadata - In
    calculate_pending_ranges: affected_ranges.size=6145 stars
    INFO  2018-02-09 14:47:51,585 [shard 0] token_metadata - In
    calculate_pending_ranges: affected_ranges.size=6145 ends
2018-02-13 19:00:42 +08:00
Duarte Nunes
ac6abf8021 Merge 'CQL clustering column secondary indexing support' from Pekka
"This patch series adds support for clustering column secondary indexing.

Fixes #2961

Tests: unit-tests (release)"

* 'penberg/cql-2i-clustering-key-indexing/v2' of github.com:penberg/scylla:
  tests/cql_query_test: Add indexed clustering key query test
  cql3: Fix clustering column secondary indexing
  cql3/statements: Add values() helper to restrictions
  cql3/restrictions: Fix multi_column_restriction::values()
  cql3/restrictions: Fix single_column_primary_key_restrictions::values()
2018-02-12 18:49:34 +00:00
Amnon Heiman
d88c27614e scylla-housekeeping: add configuration for api-address
This patch makes the api address and port configurable.

Fixes #2332

Message-Id: <20180204095628.1210-1-amnon@scylladb.com>
2018-02-12 15:26:46 +02:00
Amnon Heiman
449f9af0db API: Use stream_range_as_array to return token endpoints
The token_to_endpoint map can get big that trying to convert it to a
vector will cause large allocation warning.

This patch replace the implementation, so the return json array will be
created directly from the map by using stream_range_as_array helper
function.

Fixes #3185

Message-Id: <20180207153306.30921-1-amnon@scylladb.com>
2018-02-12 15:24:07 +02:00
Avi Kivity
e77ecda1da tests: avoid signed/unsigned compares
Container indices are size_t, and in other places we gratuituously
declare a limit as unsigned and the loop index as signed.

Tests: unit (release)
Message-Id: <20180212121642.10525-1-avi@scylladb.com>
2018-02-12 12:25:21 +00:00
Avi Kivity
87f10bc853 sstables: continuous_data_consumer: make _remain an unsigned type
All of the adjustments to _remain already ensure it is greater than 0,
and indeed a negative _remain doesn't make sense.

Switching to an unsigne types allows us to re-enable -Wsign-compare.

Tests: unit (release)
Message-Id: <20180212121636.10463-1-avi@scylladb.com>
2018-02-12 12:25:21 +00:00
Avi Kivity
55168592ad compaction_manager: fix use-after-free of column_family
Commit cce1a2bce8 ("Use the CPU scheduler")
placed some compaction manager code in a scheduling_group. Unfortunately,
downstream code relied on the callers not deferring, so it can rely
on the column_family's existence. That doesn't happen if the column_family
is removed quickly, as with_scheduling_group() always defers.

Fix applying the scheduling group after we've taken the lock and guaranteed
the stability of the column_family object.

Fixes #3196.
Message-Id: <20180211165155.18179-1-avi@scylladb.com>
2018-02-11 17:53:35 +00:00
Avi Kivity
3f5a8229ac tests: fix for sstable::get_index_reader() removal
71495691aa removed sstable::get_index_reader(),
but forgot to update its callers in tests/.  Update the callers to construct
a temporary shared_index_list and create the index_reader directly.

This is none too clean, but shared_index_lists needs to be retired, and then
the changes in this patch can go away too.

Tests: unit (release)
Message-Id: <20180211164739.17862-1-avi@scylladb.com>
2018-02-11 17:53:08 +00:00
Vladimir Krivopalov
71495691aa Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable.
With the changes introduced in #2981, it is no longer safe to share
index_entries among multiple sstable_mutation_readers.
The original intent behind sharing index_entries among index_readers was
to avoid re-reading same pages twice as we have two index readers -
lower and upper bound - for every sstable_mutation_reader. In fact, the
shared entries were held at the sstable object level so index_readers
from different sstable_mutation_readers could have accessed them.

Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos),
index_entry can be accessed in a way that modifies its state if we need
to read more promoted index blocks. It is safe to keep sharing them
between two index_readers within the same sstable_mutation_reader as the
invariant is maintained that readers can be only moved forward.
We cannot safely assume, however, that this invariant holds for multiple
sstable_mutation_readers as it may happen that one of them has read and
thrown away some promoted index blocks that another one needs. So we
restrict sharing to per-sstable_mutation_reader level.

Fixes #3189.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>
2018-02-10 15:08:45 +02:00
Duarte Nunes
d757c87107 cql3/query_processor: Remove prepared statements upon dropping a view
Fixes #3198

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180209143652.31852-1-duarte@scylladb.com>
2018-02-09 16:30:28 +00:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Duarte Nunes
456b678e0b database.hh: Fix data query stage argument type
Fixes a merge gone wrong.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180208163338.25238-1-duarte@scylladb.com>
2018-02-08 16:35:10 +00:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Avi Kivity
6298655178 Merge "Inline and optimise more aggressively" from Paweł
"We have noticed in the past that the compiler is too conservative when it comes
to deciding which functions to inline. Since inlining functions enables further
optimisations such as const folding in some cases the difference in performance
was significant enough to force us to add [[gnu::always_inline]] attribute in
numerous places. However, this is neither a partical nor an elegant solution.

A better way to deal with the problem is to adjust the compiler tunables that
control the heuristics used for making inlining decisions. In particular,
inline-unit-growth seems to affect the performance of the emitted code most.

Apart from making the compiler more eager to inline functions bumping the
optimisation level to -O3 also seems to have a positive impact on the
performance.

Fixes #1644.

Tests: unit-test (release)

Performance tested with gcc 7.3.

Macrobenchmark
perf_simple_query
Flags: -c4 --duration 60
All results are medians.

         ./before    ./after   diff
 read   338662.12  405377.80  19.7%
 write  387378.89  466744.15  20.5%

Microbenchmarks
single run duration:      1.000s
number of runs:           5

BEFORE
test                                      iterations      median         mad         min         max
combined.one_row                              858933   536.389ns     0.819ns   534.823ns   537.208ns
combined.single_active                          8469    77.131us    11.000ns    77.118us    77.145us
combined.many_overlapping                       1199   664.105us   160.807ns   663.818us   668.527us
combined.disjoint_interleaved                   8100    75.522us    22.254ns    75.500us    75.732us
combined.disjoint_ranges                        8288    72.580us    10.571ns    72.568us    72.599us
memtable.one_partition_one_row               1216233   825.581ns     0.446ns   821.450ns   826.027ns
memtable.one_partition_many_rows              127336     7.855us     2.153ns     7.853us     7.898us
memtable.many_partitions_one_row               57919    17.356us     6.028ns    17.259us    17.362us
memtable.many_partitions_many_rows              4751   210.496us   102.339ns   210.393us   211.188us

AFTER
test                                      iterations      median         mad         min         max
combined.one_row                             1002321   450.292ns     0.313ns   447.202ns   450.605ns
combined.single_active                          9605    67.086us     8.620ns    67.073us    67.115us
combined.many_overlapping                       1476   519.554us     5.334ns   519.549us   519.953us
combined.disjoint_interleaved                   9280    64.363us     5.328ns    64.335us    64.369us
combined.disjoint_ranges                        9481    61.893us     3.620ns    61.885us    61.903us
memtable.one_partition_one_row               1432668   699.775ns     0.106ns   696.023ns   699.918ns
memtable.one_partition_many_rows              153692     6.536us     6.885ns     6.501us     6.543us
memtable.many_partitions_one_row               63319    15.879us     5.080ns    15.793us    15.884us
memtable.many_partitions_many_rows              5659   176.717us    66.770ns   176.650us   177.778us"

* tag 'optimise-and-inline/v2' of https://github.com/pdziepak/scylla:
  configure.py: set optimisation level to -O3
  configure.py: set inline-unit-growth to 300
  configure.py: flag_supported: support flags with spaces
  configure.py: rename warning_supported to flag_supported
  configure.py: pass optimisation flags to seastar/configure.py
  cql3/select_statement: do not capture stack variables by reference
2018-02-08 17:45:41 +02:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Paweł Dziepak
eb5b76ea50 configure.py: set optimisation level to -O3 2018-02-08 14:46:11 +00:00
Paweł Dziepak
bc65659a46 configure.py: set inline-unit-growth to 300
It has been discovered that the compiler is too conservative when
deciding which functions to inline. In particular, the limiting tunable
turned out to be inline-unit-growth which limits inlining in large
translation units.
2018-02-08 14:46:11 +00:00
Paweł Dziepak
89063a9cc0 configure.py: flag_supported: support flags with spaces 2018-02-08 14:46:11 +00:00
Paweł Dziepak
8f4b30b572 configure.py: rename warning_supported to flag_supported
warning_supported() can be used to detect support of any compiler flag,
not just warnings.
2018-02-08 14:46:11 +00:00
Paweł Dziepak
a8372b87eb configure.py: pass optimisation flags to seastar/configure.py 2018-02-08 14:46:11 +00:00
Paweł Dziepak
b635fec9bf cql3/select_statement: do not capture stack variables by reference
Default capture by reference considered harmful in async code.
2018-02-08 14:46:10 +00:00
Avi Kivity
ee763d889a Merge seastar upstream
* seastar 6d02263...2b0a81d (7):
  > configure.py: add -Wno-stringop-overflow
  > configure.py: add --optflags for specifying optimisation flags
  > build: add protobuf-compiler to docker dev image
  > build: update docker builder to newer Fedora
  > json_element: stream_object to get its parameter by value
  > json_element: stream range object
  > build: add yaml-cpp-devel installation to Dockerfile
2018-02-08 16:45:01 +02:00
Raphael S. Carvalho
312bd9ce25 Remove SSTable's atomic deletion manager
Not used anymore, can be deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:38:45 -02:00
Raphael S. Carvalho
1472cfcc19 Stop using SSTable's atomic deletion manager
The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:27:17 -02:00
Raphael S. Carvalho
b78881c0e9 database: split column_family::rebuild_sstable_list
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:18:18 -02:00
Glauber Costa
4272279bbb controllers: unify the I/O and CPU controllers
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.

Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one.  We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.

In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:30 -05:00
Glauber Costa
7b6f188e27 controllers: allow a static priority to override the controller output
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.

That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
6f295a2a8a controllers: update control points for memtable I/O controller
Right now CPU and I/O controllers have slightly different control points
for no good reason. Let's use the CPU controller ones as the standard, as
we have been using it in the field for longer and trust it more.

The end goal is to fully integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
b895d495cc controllers: allow memtable I/O controller to have shares statically set
This is so it looks more like the CPU controller. The end goal is to integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c099c98676 controllers: retire auto_adjust_flush_quota
It no longer makes sense now that we have the full scheduler +
controllers.  In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
2c1d5cf966 database: remove cpu_flush_quota metric
We can now grab that from the CPU scheduler, that exports both runtime
and shares.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c4974392b7 allow update_cache and clear_gently to use the entire task quota.
We have had a quota of partitions to process in clear_gently /
update_cache, so that we don't overwork. However, with those things now
being in their own task group there is no harm in allowing it to run
until we reach a natural preemption point.

While we are at it, clear_gently did not check for need_preempt()
before, so this patch fixes it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
a3a4d0a17a row_cache: actually use the scheduling group for update_cache
We have moved clear_gently from using a seastar::thread's scheduling_group to
using the CPU scheduler's. However, update_cache was forgotten.

This patch fixes that and gets rid of the old group just in case.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
ce94e6deb7 database: place data_query execution stage into scheduling_group
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.

Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
2018-02-07 17:19:29 -05:00
Avi Kivity
2ee163d32b config: mark background_writer_scheduling_quota as Unused
Since the background writer flush quota config is no longer used, mark
it Unused.
2018-02-07 17:19:29 -05:00
Avi Kivity
ac525c9124 memtable, database: make memtable::clear_gently() inherit scheduling_group
Instead of using a private thread_scheduling_group, make clear_gently use
its caller's scheduling_group to control resource usage.
2018-02-07 17:19:29 -05:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Glauber Costa
98549775fa sstable_tests: make sure min_threshold is set explicitly
The SSTable tests are a bit fragile now because they rely on min_threshold
having a particular value. That is the default value, but if I change that
default - which I am planning to do - the test breaks.

Right now the test is not broken, but if we are planning on relying on a
property having a particular value in tests, we should explicitly set it.

So I am proactively chaning min_threshold in the tests to have the value
of 4 explicitly, so we can change that in the future without breaking anything.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180207155513.12498-1-glauber@scylladb.com>
2018-02-07 18:45:52 +01:00
Tomasz Grabiec
d398aa913e cache: Fix calculation of active_reads()
Message-Id: <1518023341-27855-1-git-send-email-tgrabiec@scylladb.com>
2018-02-07 17:20:00 +00:00
Takuya ASADA
2c2173917c dist/common/scripts/scylla_raid_setup: skip blkdiscard when disk is not supported TRIM
Since we unconditionally running blkdiscard on disks, we may get ioctl error
message on some disks which does not support TRIM.

This can be ignore but it's bad UX, so let's skip running blkdiscard when TRIM
is not supported on the disk.

Fixes #2774

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517992904-13838-1-git-send-email-syuu@scylladb.com>
2018-02-07 13:30:05 +02:00
Calle Wilund
264b9d2da0 sstables: Process extensions on file open
Allowing them to wrap/replace an opened file, and add to/read from
scylla metadata.
2018-02-07 10:11:46 +00:00
Calle Wilund
b0c0c3c0ad sstables::types: Add optional extensions attribute to scylla metadata
Allowing storing key:value pairs.
2018-02-07 10:11:46 +00:00
Calle Wilund
68fc076f80 sstables::disk_types: Add hash and comparator(sstring) to disk_string 2018-02-07 10:11:46 +00:00
Calle Wilund
97f9f572f8 schema_tables: Load/save extensions table
Parses the extension map in tables/views using the registered extension.
If a schema row contains an unknown extension, we just preserve the data
in a placeholder.
2018-02-07 10:11:46 +00:00
Calle Wilund
dcc75263c6 cql: Add schema extensions processing to properties
Automatically accept registered schema extensions into the properties
set, and when building, generate the corresponding extension object into
the resulting schema.
2018-02-07 10:11:46 +00:00
Calle Wilund
2b56bbfa7d schema_tables: Require context object in schema load path
Requires "workaround" fix for schema_registry and frozen_mutation, since
the former is a free-float thread local, and the latter is a pure data
carrier. frozen_schema can take a parameter for unfreeze, but schema
registry requires being told which the system extensions are.
2018-02-07 10:11:46 +00:00
Calle Wilund
c2b49ec2e2 schema_tables: Add opaque context object
To allow carrying extensions and potentially more
2018-02-07 10:11:46 +00:00
Calle Wilund
2ee68ce0d4 config_file_impl: Remove ostream operators
We don't generate default strings for command line, so these are not
needed as such, and conflict with other operators in to_string.hh
2018-02-07 10:11:46 +00:00
Calle Wilund
6e31842049 main/init: Formalize configurables + add extensions to init call
Move the configurables to init so tests can link this as well. 
Add extensions object to db config in main and provide to 
configurables. These can then add extensions at this phase.
2018-02-07 10:11:46 +00:00
Calle Wilund
c19d8dd602 db::config: Add extensions as a config sub-object
The idea being that we should have config be a global, immutable
singleton, set up by startup/test then owned/referenced by db etc. 

Extensions are read-only in this context, so init code should set it up
before handing to the config. Or keep a ref to the ext param.
2018-02-07 10:11:46 +00:00
Calle Wilund
78174c6c59 db::extensions: Configuration object to store various extensions
A singular, yet not static global, container for schema/sstable 
extensions.
2018-02-07 10:11:46 +00:00
Calle Wilund
3e8cfbf2a0 cql3::statements::property_definitions: Use std::variant instead of any
Formalizing what stuff we actually keep in the props. And c++17.
2018-02-07 10:11:46 +00:00
Calle Wilund
0dcf287230 sstables: Add extension type for wrapping file io 2018-02-07 10:11:45 +00:00
Calle Wilund
3ab760b375 schema: Add opaque type to represent extensions
A virtual opaque object meant to represent the "extensions" mapping
in schema_tables::tables/views
2018-02-07 10:11:45 +00:00
Calle Wilund
74758c87cd sstables::compress/compress: Make compression a virtual object
Make a "compressor" an actual class, that can be implemented and
registered via class registry. 

For "common" compressors, the objects will be shared, but complex
implementors can be semi-stateful. 

sstable compression is split into two parts: The "static" config
which is shared across shards, and a "local" one, which holds 
a compressor pointer. The latter is encapsulated, along with 
actual compressed data writers, in sstables/compress.cc.

For compression (write), compression writer is instansiated 
with the settings active in table metadata. 

For decompression (read), compression reader is instansiated
with the settings stored in sstable metadata, which can 
differ from the currently active table metadata. 

v2:
* Structured patch sets differently (dependencies)
* Added more comments/api descs
* Added patch to move all sstable compression into compress.cc,
  effectively separating top-level virtual compressor object
  from sstable io knowledge
v3:
* Rebased
v4: 
* Moved all sstable compression logic/knowledge into  
  compress.cc (local compression). Merged the two patches 
  (separation just confuses reader).
2018-02-07 10:11:45 +00:00
Pekka Enberg
3e4c6cc4da tests/cql_query_test: Add indexed clustering key query test 2018-02-06 16:57:27 +02:00
Pekka Enberg
0128f802ed cql3: Fix clustering column secondary indexing
Fix clustering column indexing by lifting the limitation of only
considering non-primary key restrictions in
select_statement::find_index_partition_ranges().
2018-02-06 16:57:27 +02:00
Pekka Enberg
1fdc13d230 cql3/statements: Add values() helper to restrictions
Add values() helper to restrictions class so that we can easily obtain
restriction values for all indexed restrictions.
2018-02-06 16:57:27 +02:00
Paweł Dziepak
6ccd317c38 Merge "Do not evict from memtable snapshots" from Tomasz
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.

Tests: unit (release)"

* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test for eviction with non-evictable snapshots
  mutation_partition: Define + operator on tombstones
  tests: mvcc: Check that partition is fully discontinuous after eviction
  tests: row_cache: Add test for memtable readers surviving flush and eviction
  memtable: Make printable
  mvcc: Take partition_entry by const ref in operator<<()
  mvcc: Do not evict from non-evictable snapshots
  mvcc: Drop unnecessary assignment to partition_snapshot::_version
  tests: Use partition_entry::make_evictable() where appropriate
  mvcc: Encapsulate construction of evictable entries
2018-02-06 14:46:24 +00:00
Tomasz Grabiec
3c51cc79d5 tests: mvcc: Add test for eviction with non-evictable snapshots 2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d37131d320 mutation_partition: Define + operator on tombstones 2018-02-06 14:24:19 +01:00
Tomasz Grabiec
ec5fe5b207 tests: mvcc: Check that partition is fully discontinuous after eviction
evict() should remove everything, including range tombstones, so whole
clustering range should be marked as discontinuous.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
c1b82e60e3 tests: row_cache: Add test for memtable readers surviving flush and eviction
Reproduces https://github.com/scylladb/scylla/issues/3186
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d85d651e0f memtable: Make printable
Useful when debugging test failures.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
06b7b54c3d mvcc: Take partition_entry by const ref in operator<<()
Some users will only have const&.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
50f5bee12e mvcc: Do not evict from non-evictable snapshots
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
c391bff1d2 mvcc: Drop unnecessary assignment to partition_snapshot::_version
merge_partition_versions() is responsible for merging versions
unpinned by the current snapshot. If that fails, we don't need to set
_version back since versions must be still referenced by someone else,
this snapshot is not a unique owner.

This change makes it easier to add tracking of evictability.
2018-02-06 14:24:18 +01:00
Tomasz Grabiec
439cbada2c tests: Use partition_entry::make_evictable() where appropriate 2018-02-06 14:24:18 +01:00
Raphael S. Carvalho
09f4ee808f sstables/compress: Fix race condition in segmented offset reading of shared sstable
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.

So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.

We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.

Tests: release mode.

Fixes #3148.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
2018-02-06 12:10:10 +02:00
Tomasz Grabiec
d899ae0f02 mvcc: Encapsulate construction of evictable entries
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
2018-02-05 17:54:03 +01:00
Vlad Zolotarov
bc90aa79b3 config: uncomment/add the supported snitches description
Uncomment desscriptions of Ec2SnitchXXX which are supported for a long
time already.
Add the description of the new GoogleCloudSnitch.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 10:37:13 -05:00
Vlad Zolotarov
d312aeebf3 tests: added gce_snitch_test
Tests the GoogleCloudSnitch.
Uses the dummy GCE meta server that would be listening on 127.0.0.1:80 by default.
To change the IP of the dummy server one can use the DUMMY_META_SERVER_IP
environment macro.
To use the real GCE meta server (from inside the GCE VM) one should define
the USE_GCE_META_SERVER environment macro.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 10:37:08 -05:00
Vlad Zolotarov
8ae2996bf8 locator::gce_snitch: implementation of the GoogleCloudSnitch
This is a snitch that should be used when Scylla runs in GCE VMs in both
single and multi data center (DC) configurations.

This snitch interacts with the GCE (instance metadata) API as
described here: https://cloud.google.com/compute/docs/storing-retrieving-metadata)
similarly to how ec2_snitchXXX interacts with the AWS API.

However unlike ec2_multi_region_snitch the GCE snitch only gets the instance's zone and sets
the DC and the RACK based on it, e.g. for us-central1-a the DC is set to 'us-central'
and the RACK - to 'a'.

GCE snitch doesn't have to learn the internal and external IPs of the instance because in
GCE instances from different regions can interact using internal IPs (in the AWS they can't).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 09:57:03 -05:00
Vlad Zolotarov
0a8549abf1 locator::snitch_base: properly log the failure during the snitch startup
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-05 09:49:54 -05:00
Avi Kivity
a94564a637 Merge seastar upstream
* seastar 21badbd...6d02263 (4):
  > build: detect name of ninja executable
  > queue: pop_eventually/push_eventually should throw when called after abort
  > build: compile libfmt out-of-line
  > core/gate: Ensure with_gate leaves gate on exception
2018-02-05 14:42:07 +02:00
Tomasz Grabiec
d21fbc26c7 tests: range_tombstone_list: Do not depend on argument evaluation order
next_pos() calls could be reordered resulting in invalid tombstones being
generated.
Message-Id: <1517833688-20022-1-git-send-email-tgrabiec@scylladb.com>
2018-02-05 12:31:37 +00:00
Tomasz Grabiec
d2baa49313 tests: Do not produce invalid range tombstones
Upper bound should not be smaller than lower bound. Found by
asserting on valid bounds.
Message-Id: <1517833602-19732-1-git-send-email-tgrabiec@scylladb.com>
2018-02-05 12:29:03 +00:00
Takuya ASADA
6d134c0c2b dist/redhat: block installing Scylla on older kernel
We uses AmbientCapabilities directive on systemd unit, but it does not work
on older kernel, causes following error:
"systemd[5370]: Failed at step CAPABILITIES spawning /usr/bin/scylla: Invalid argument"

It only works on kernel-3.10.0-514 == CentOS7.3 or later, block installing rpm
to prevent the error.

Fixes #3176

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517822764-2684-1-git-send-email-syuu@scylladb.com>
2018-02-05 12:57:17 +02:00
Duarte Nunes
46099e4f58 tests/role_manager_test: Stop role_manager
Not stopping them may cause the tests to fail due to an asynchronous
process being scheduled and accessing freed data.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180202221640.28609-1-duarte@scylladb.com>
2018-02-05 09:39:59 +00:00
Avi Kivity
6919c7434e Merge seastar upstream
* seastar 19efbd9...21badbd (4):
  > reactor: change adjustment method for tasks becoming active
  > Merge 'Update ARM port' from Avi
  > http: Do not wait for close connection on stop if listen did not completed
  > core/future-util: Don't allow rvalues in do_for_each()
2018-02-04 14:28:28 +02:00
Avi Kivity
2173e74212 tests: de-template cql_query_test
cql_query_test contains many continuations that are generic lambdas:

  foo().then([] (auto x) { ... })

These templates prevent Eclipse's indexer from inferring the type of x,
and so everything below that point is one big error as far as Eclipse is
concerned.

De-template these lambdas by specifying the real types.

Unfortunately, compile time decrease was not observed.

Tests: cql_query_test (release)
Message-Id: <20180204113503.23297-1-avi@scylladb.com>
2018-02-04 11:48:52 +00:00
Takuya ASADA
dc2b17b3da dist/redhat: link yaml-cpp statically
To avoid incompatibility between distribution provided libyaml-cpp, link it
statically.

Fixes #3173

Message-Id: <1517546935-15858-2-git-send-email-syuu@scylladb.com>
2018-02-03 16:34:36 +02:00
Takuya ASADA
82f217d62a configure.py: make --static-yaml-cpp works properly for Scylla
We are doing static linking of libyaml-cpp for libseatar well, but
mistakenly not for Scylla, need to fix.

Message-Id: <1517546935-15858-1-git-send-email-syuu@scylladb.com>
2018-02-03 16:34:32 +02:00
Amnon Heiman
836876d81a main: stop prometheus server when shutting down
This patch adds a enging().on_exit cleanup for the prometheus server,
similar to other components in the system.

It will stop the server when sutting down.

Fixes #2520
Message-Id: <20180201132647.17638-1-amnon@scylladb.com>
2018-02-02 11:03:51 +01:00
Tomasz Grabiec
582dd36303 Merge 'Fixes for exception safety in memtable range reads' from Paweł
These patches deal with the remaining exception safety issues in the
memtable partition range readers. That includes moving the assignment
to iterator_reader::_last outside of allocating section to avoid
problems caused by exception-unsafe assignment operator. Memory
accotuning code is also moved out of the retryable context to improve
the code robustness and avoid potential problems in the future.

Fixes #3172.

Tests: unit-test (release)

* https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety/v1:
  memtable: do not update iterator_reader::_last in alloc section
  memtable: do not change accounting state in alloc section
  tests/memtable: add more reader exception safety tests
2018-02-02 11:00:58 +01:00
Paweł Dziepak
c2a5fd520f cql3/role-management: avoid static local shared_ptr
Even if shared_ptr is const it doesn't mean that its internal state is
immutable and it still cannot be freely shared across shards.

Fixes assertion failure in build/debug/tests/cql_roles_query_test.

Message-Id: <20180201125221.30531-1-pdziepak@scylladb.com>
2018-02-01 16:28:36 +02:00
Paweł Dziepak
ea50806172 tests/mutation_reader: avoid static local lw_shared_ptr
Shared pointer don't like being shared across shards.

Fixes assertion failure in build/debug/tests/mutation_reader_test.
Message-Id: <20180201125017.30259-1-pdziepak@scylladb.com>
2018-02-01 13:53:55 +01:00
Duarte Nunes
992de302a2 tests/row_cache_test: Test hash caching
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
d28bdb25c5 tests/memtable_test: Test hash caching
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
78508e8e43 tests/mutation_test: Use xxHash instead of MD5 for some tests
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
6cb0bbd978 tests/mutation_test: Test xx_hasher alongside md5_hasher
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
20132fe1b5 schema: Remove unneeded include
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
d7af8ff0e0 service/storage_proxy: Enable hash caching
Set the option that enables the underlying memtable and cache readers
to request caching of a cell's hash, for requests that require a
digest.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
0bab3e59c2 service/storage_service: Add and use xxhash feature
We add a cluster feature that informs whether the xxHash algorithm is
supported, and allow nodes to switch to it. We use a cluster feature
because older versions are not ready to receive a different digest
algorithm than MD5 when answering a data request.

If we ever should add a new hash algorithm, we would also need to
add a new cluster feature for that algorithm. The alternative would be
to add code so a coordinator could negotiate what digest algorithm to
use with the set of replicas it is contacting.

Fixes #2884

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
440ea56010 message/messaging_service: Specify algorithm when requesting digest
While not strictly needed, specify which algorithm to use when request
a digest from a remote node. This is more flexible than relying on a
cluster wide feature, although that's what we'll do in subsequent
patches. It also makes the verb more consistent with the data request.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
1ee7413b6e storage_proxy: Extract decision about digest algorithm to use
Introduce the digest_algorithm() function, which encapsulates the
decision of which digest algorithm to use. Right now it is set to MD5,
but future patches will change this.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
712c051de6 cache_flat_mutation_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
ec5b7fb553 partition_snapshot_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. A downside of
this approach is that more work will be done when there are multiple
versions of a row that contain values for the same cell, but we expect
these cases to be rare and the upside of caching a cell's hash to
compensate for the extra work.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
4ea2f52ddb query::partition_slice: Add option to specify when digest is requested
Having this option enables us to communicate from the upper to the
lower layers whether a digest was requested, so that we can pre-calculate
and cache a cell's hash in the readers that have access to the actual
in-memory cells (within the memtable and the row cache).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
42f407ad9e row: Use cached hash for hash calculation
This entails doing the cell hash calculation slightly differently,
where the cell is hashed individually, the resulting hash being added
to the running one.

Instead of propagating a flag all through the call chain, we detect
whether we are in the new mode by the employed hash algorithm.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:49 +00:00
Duarte Nunes
d773e4b9d4 mutation_partition: Replace hash_row_slice with appending_hash
This enables us to only branch once per row on the actual hash
algorithm, instead of once per row data item.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:49 +00:00
Duarte Nunes
99a3e3aa76 mutation_partition: Allow caching cell hashes
We add storage to a row to hold the cached hashes of each individual
cell. We don't store the hash in each cell because that would a)
change the cell equality function, and b) require us to change a cell
in a potentially fragmented buffer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:47 +00:00
Duarte Nunes
71ba99d53e mutation_partition: Force vector_storage internal storage size
This patch forces the size of vector_storage's internal storage to 5,
meaning that the underlying managed_vector will ensure it doesn't need
to externally allocate a buffer to hold the row, if only its first 5
cells are set.

We define this size explicitly so we can change the vector's value
type in upcoming patches without affecting the optimization.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
996e47a6f9 test.py: Increase memory for row_cache_stress_test
Cells and rows will require more memory when we start caching the cell
hash.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
7ba63b1521 atomic_cell_hash: Add specialization for atomic_cell_or_collection
Replace the atomic_cell_or_collection::feed_hash() member function
with the specialization of appending_hash, and use that instead.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
b2e1a91f4d query-result: Use digester instead of md5_hasher
Use the digester class instead of md5_hasher to encapsulate the
decision of which hash algorithm to use.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
a0d748c71c range_tombstone: Replace feed_hash() member function with appending_hash
Replace range_tombstone::feed_hash() with the specialization of
appending_hash, so that we can use the general feed_hash() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
12507fb9ce keys: Replace feed_hash() member function with appending_hash
Replace the feed_hash() member function of partition_key and
clustering_key_prefix with the specialization of appending_hash,
so that we can use the general feed_hash() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
6b4b429883 query-result: Introduce class result_options
Introduce class result_options to carry result options through the
request pipeline, which at this point mean the result type and the
digest algorithm. This class allows us to encapsulate the concrete
digest algorithm to use.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
041acb7aea query: Add class to encapsulate digest algorithm
This patch paves the way for us to encapsulate the actual digest
algorithm used for a query. The digester class dispatches to a
concrete implementation based on the digest algorithm being used. It
wraps the xxHash algorithm to provide a 128 bit hash, which is the
size of digest expected by the inter-node protocol.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
839ed4e3a4 md5_hasher: Extract hash size
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
5f6aab832b digest_algorithm: Add xxHash option
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
c803ae24fc digest: Introduce xxHash hash algorithm
This patch introduces xx_hasher, a class conforming to the Hasher
concept, which will be used to calculate the data digest in subsequent
patches. It is expected to be an order of magnitude faster than md5.

We use the 64 bit variant of the algorithm, the 128 bit one still
being under development.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
4f0295a35c CMakeLists: Add xxhash directory
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
edb9193c9c configure.py: Configure xxhash
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
102cf40bb7 Add xxhash (fast non-cryptographic hash) as submodule
Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Note:
  xxhash repo should be cloned to Scylla organization, and that
  git url should be used instead.
2018-02-01 00:22:50 +00:00
Paweł Dziepak
20c460d8f0 tests/memtable: add more reader exception safety tests 2018-01-31 16:05:35 +00:00
Paweł Dziepak
c945bdc7f6 memtable: do not change accounting state in alloc section
Allocating sections can be retried so code that has side effects (like
updating flushed bytes accouting) has no place there.
2018-01-31 16:04:31 +00:00
Paweł Dziepak
d803370868 memtable: do not update iterator_reader::_last in alloc section
iterator_reader::_last is a part of the state that survives allocating
section retries, therefore, it should not be modified in the retryable
context.
2018-01-31 16:03:16 +00:00
Avi Kivity
4463e9071a Merge "Adding the API V2 Swagger definition file" from Amnon
"This series adds the base for the V2 Swagger definition file.
After the series, the definition file will be at:
http://localhost:10000/v2

It can be used with the swagger ui, by replacing the url in the search
path."

* 'amnon/swagger_20' of github.com:scylladb/seastar-dev:
  Register the API V2 swagger file
  Adding the header part of the swagger2.0 API
2018-01-31 14:47:50 +02:00
Duarte Nunes
cf6110d840 tests/cell_locker_test: Ensure timeout test finishes in useful time
Use saturating_substract to prevent a really long timeout and having
the test hang.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180130221336.1773-1-duarte@scylladb.com>
2018-01-31 11:34:08 +01:00
Duarte Nunes
01a8e5abb9 Merge 'Materialized views: add local locking' from Nadav
"Before this patch set, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, with two
different rows in the view table, instead of just one with the latest
data. In this series we add locking which serializes the two conflicting
updates, and solves this problem.

I explain in more detail why such locking is needed, and what kinds of
locks are needed, in the third patch."

* 'master' of https://github.com/nyh/scylla:
  Materialized views: serialize read-modify-update of base table
  Materialized views: test row_locker class
  Materialized views: implement row and partition locking mechanism
2018-01-30 17:40:12 +00:00
Tomasz Grabiec
cdd31918d0 Merge 'Make memtable reads exception safe' from Paweł
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.

The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.

In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.

The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.

This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.

The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.

Fixes #3123.
Fixes #3133.

Tests: unit-tests (release)

BEFORE
test                                      iterations      median         mad         min         max
memtable.one_partition_one_row               1155362   869.139ns     0.282ns   868.465ns   873.253ns
memtable.one_partition_many_rows              127252     7.871us    15.252ns     7.851us     7.886us
memtable.many_partitions_one_row               58715    17.109us     2.765ns    17.013us    17.112us
memtable.many_partitions_many_rows              4839   206.717us   212.385ns   206.505us   207.448us

AFTER
test                                      iterations      median         mad         min         max
memtable.one_partition_one_row               1194453   839.223ns     0.503ns   834.952ns   842.841ns
memtable.one_partition_many_rows              133785     7.477us     4.492ns     7.473us     7.507us
memtable.many_partitions_one_row               60267    16.680us    18.027ns    16.592us    16.700us
memtable.many_partitions_many_rows              4975   201.048us   144.929ns   200.822us   201.699us

        ./before_sq  ./after_sq  diff
 read     337373.86   353694.24  4.8%
 write    388759.99   394135.78  1.4%

* https://github.com/pdziepak/scylla.git memtable-exception-safety/v2:
  tests/perf: add microbenchmarks for memtable reader
  flat_mutation_reader: add allocation point in push_mutation_fragment
  linearization_context: remove non-trivial operations from fast path
  lsa: split alloc section into reserving and reclamation-disabled parts
  lsa: optimise disabling reclamation and invalidation counter
  mutation_fragment: allow creating clustering row in place
  paratition_snapshot_reader: minimise amount of retryable code
  memtable: drop memtable_entry::read()
  tests/memtable: add test for reader exception safety
2018-01-30 18:33:27 +01:00
Paweł Dziepak
1406ac5088 tests/memtable: add test for reader exception safety 2018-01-30 18:33:26 +01:00
Paweł Dziepak
ea7248056f memtable: drop memtable_entry::read() 2018-01-30 18:33:26 +01:00
Paweł Dziepak
0420ca48a5 paratition_snapshot_reader: minimise amount of retryable code
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
b1cb7d214e mutation_fragment: allow creating clustering row in place
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
dcd79af8ed lsa: optimise disabling reclamation and invalidation counter
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.

However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.

In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
d825ae37bf lsa: split alloc section into reserving and reclamation-disabled parts
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.

Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.

In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
eb2e88e925 linearization_context: remove non-trivial operations from fast path
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.

All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
2018-01-30 18:33:25 +01:00
Paweł Dziepak
a1278b4d6a flat_mutation_reader: add allocation point in push_mutation_fragment
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.

push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
2018-01-30 18:33:25 +01:00
Paweł Dziepak
486e0d8740 tests/perf: add microbenchmarks for memtable reader 2018-01-30 18:33:25 +01:00
Avi Kivity
00d70080af Merge "Consume promoted index incrementally" from Vladimir
"This patchset makes index_reader consume promoted index incrementally
on demand as the reader advances through the current partition instead
of storing the entire promoted index which can be huge.

When the current page is parsed, data for promoted indices are turned
into input streams that are only read and parsed if a particular
position within a partition is seeked for. This avoids potentially large
allocations for big partitions."

* 'issues/2981/v10' of https://github.com/argenet/scylla:
  Use advance_past for single partition upper bound.
  Remove obsolete types and methods.
  Simplify continuous_data_consumer::consume_input() interface.
  Parse promoted index entries lazily upon request rather than immediately.
  Add helper input streams: buffer_input_stream and prepended_input_stream.
  Support skipping over bytes from input stream in parsers based on continuous_data_consumer
  Add performance tests for large partition slicing using clustering keys.
2018-01-30 18:22:28 +02:00
Nadav Har'El
2ea1922a4d Materialized views: serialize read-modify-update of base table
Before this patch, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, in two
different rows added to the view table, instead of just one with the latest
data. In this patch we we add locking which serializes the two conflicting
updates, and solves this problem. The locking for a single base-table
column_family is implemented by the row_locker class introduced in a
previous patch.

A long comment in the code of this patch explains in more detail why
this locking is needed, when, and what types of locks are needed: We
sometimes need to lock a single clustering row, sometimes an entire
partition, sometimes an exclusive lock and sometimes a shared lock.

Fixes #3168

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:21:43 +02:00
Nadav Har'El
52e91623ce Materialized views: test row_locker class
This is a unit test for the row_locker facility. It tests various
combination of shared and exclusive locks on rows and on partitions,
some should succeed immediately and some should block.

This tests the row_locker's API only, it does not use or test anything
in Materialized Views.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:19:43 +02:00
Nadav Har'El
31d0a1dd0c Materialized views: implement row and partition locking mechanism
This patch adds a "row_locker" class providing locking (shard-locally) of
individual clustering rows or entire partitions, and both exclusive and
shared locks (a.k.a. reader/writer lock).

As we'll see in a following patch, we need this locking capability for
materialized views, to serialize the read-modify-update modifications
which involve the same rows or partitions.

The new row_locker is significantly different from the existing cell_locker.
The two main differences are that 1. row_locker also supports locking the
entire partition, not just individual rows (or cells in them), and that
2. row_locker supports also shared (reader) locks, not just exclusive locks.
For this reason we opted for a new implementation, instead of making large
modificiations to the existing cell_locker. And we put the source files
in the view/ directory, because row_locker's requirements are pretty
specific to the needs of materialized views.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:16:27 +02:00
Takuya ASADA
bec2b015e3 dist/debian: link yaml-cpp statically
To avoid incompatibility between distribution provided libyaml-cpp, link it
statically.

Fixes #3164

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517313320-10712-1-git-send-email-syuu@scylladb.com>
2018-01-30 14:22:02 +02:00
Botond Dénes
b7d902a9e9 database: remove unused concurrency config members
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b257c7e9d403c55aaec34fc48863c18f9c9ae11a.1517314398.git.bdenes@scylladb.com>
2018-01-30 14:21:25 +02:00
Botond Dénes
71be2e1d0d test.py: don't fail if test's exit code is not 0 on --help
test.py invokes all test executables once with --help to determine
whether it needs a -- to seperate scylla args or not. For this check it
doesn't matter what exit code the test exits with, so don't fail if it's
not 0.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d05be7c3819349e3b22b6249bb83fbf9269d14cb.1517314408.git.bdenes@scylladb.com>
2018-01-30 14:21:01 +02:00
Piotr Jastrzebski
d9415e8ed0 Remove unused consume_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Tests: units (release)

Message-Id: <fec7f2d01d42921270c90198a7b77b76960ff705.1517310923.git.piotr@scylladb.com>
2018-01-30 13:24:55 +02:00
Duarte Nunes
1e3fae5bef db/schema_tables: Only drop UDTs after merging tables
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.

When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.

Fixes #3068

Tests: unit-tests (release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
2018-01-30 12:07:04 +01:00
Avi Kivity
e1f4b06295 Merge seastar upstream
* seastar 770c450...19efbd9 (3):
  > configure.py: add --static-yaml-cpp option to link libyaml-cpp statically
  > Merge 'Avoid kernel stalls due to fsync' from Avi
  > rwlock: add exception-safe lock/unlock alternative
2018-01-30 11:44:00 +02:00
Pekka Enberg
da06339b13 scripts/find-maintainer: Find subsystem maintainer
This patch adds a scripts/find-maintainer script, similar to
script/get_maintainer.pl in Linux, which looks up maintainers and
reviewers for a specific file from a MAINTAINERS file.

Example usage looks as follows:

$ ./scripts/find-maintainer cql3/statements/create_view_statement.cc
CQL QUERY LANGUAGE
  Tomasz Grabiec <tgrabiec@scylladb.com>   [maintainer]
  Pekka Enberg <penberg@scylladb.com>      [maintainer]
MATERIALIZED VIEWS
  Duarte Nunes <duarte@scylladb.com>       [maintainer]
  Pekka Enberg <penberg@scylladb.com>      [maintainer]
  Nadav Har'El <nyh@scylladb.com>          [reviewer]
  Duarte Nunes <duarte@scylladb.com>       [reviewer]

The main objective of this script is to make it easier for people to
find reviewers and maintainers for their patches.
Message-Id: <20180119075556.31441-1-penberg@scylladb.com>
2018-01-30 09:42:35 +00:00
Vladimir Krivopalov
b91c3fd47e Use advance_past for single partition upper bound.
Instead of advancing to the next partition, try first find the more
precise position using promoted index blocks.
advance_past() only seeks within currently available PI blocks (or reads
the first batch, if never read before) and uses the position if found,
otherwise resorts to advance_to_next_partition()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:45 -08:00
Vladimir Krivopalov
6f8c6a0933 Remove obsolete types and methods.
These types and methods are no longer in use since the index_reader is
now consuming promoted index incrementally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:35 -08:00
Vladimir Krivopalov
0a7a56edd5 Simplify continuous_data_consumer::consume_input() interface.
Remove redundant input parameter as continuous_data_consumer derivatives
would only use themselves as a context. So take it internally and make
the function regular (non-template) and having no parameters.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:26 -08:00
Vladimir Krivopalov
7e15e436de Parse promoted index entries lazily upon request rather than immediately.
Now promoted index is converted into an input_stream and skipped over
instead of being consumed immediately and stored as a single buffer.
The only part that is read right away is the deletion time as it is
likely to be there in the already read buffer and reading it should both
be cheap and prevent from reading the whole promoted index if only
deletion time mark is needed.

When accessed, promoted index is parsed in chunks, buffer by buffer, to
limit memory consumption.

Fixes #2981

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:15 -08:00
Vladimir Krivopalov
9fdf4b24b5 Add helper input streams: buffer_input_stream and prepended_input_stream.
buffer_input_stream is a simple input_stream wrapping a single
temporary_buffer.

prepended_input_stream suits for the case when some data has been read
into a buffer and the rest is still in a stream. It accepts a buffer and
a data_source and first reads from the buffer and then, when it ends,
proceeds reading from the data_source.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:57:04 -08:00
Vladimir Krivopalov
5dca3100ed Support skipping over bytes from input stream in parsers based on continuous_data_consumer
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:56:55 -08:00
Vladimir Krivopalov
ebdcffab1a Add performance tests for large partition slicing using clustering keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-01-29 11:56:35 -08:00
Takuya ASADA
5f835be3aa dist/common/scripts/scylla_io_setup: check data_file_directories existance before running iotune
Currently we don't check data_file_directories existance before running iotune,
therefore it's shows unclear error message.
To make the message better, check the directory existance on scylla_io_setup.

Fixes #3137

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1517200647-6347-1-git-send-email-syuu@scylladb.com>
2018-01-29 18:11:12 +02:00
Avi Kivity
3ce5ad3c7c Merge seastar upstream
* seastar d03896d...770c450 (10):
  > tls_test: Fix echo test not setting server trust store
  > tls: Do not restrict re-handshake to client
  > tls: Actually verify client certificate if requested
  > rwlock: add method for determining if an rwlock is locked
  > metrics: Add missing `break` to metric_value::operator+()
  > memory: fix error injector throwing from noexcept memory allocator functions
  > systemwide_memory_barrier: don't use mprotect() on ARM
  > sharded: Add const version of sharded::local()
  > Add const overloads of front() and back() to the circular_buffer.
  > Remove unused lambda captures

Fixes #3072
2018-01-29 15:28:44 +02:00
Botond Dénes
12b1520415 exponential_backoff_retry::do_until_value(): restore indentation
Deferred from previous patch.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a10053f6c0ed8a24a74e51f1df4e9a5acf59922d.1517222195.git.bdenes@scylladb.com>
2018-01-29 10:50:01 +00:00
Botond Dénes
e0c082616a exponential_backoff_retry::do_until_value(): fix use-after-move
The exponential_backoff_retry instance is captured by move and is then
indirectly moved again as repeat_until_value() moves the lambda its
passed into its internal state. This caused problems as internal
lambdas store references to the instance and these references go stale
after the move.
To fix this keep hold of the existential_backoff_retry instance in an
enclosing do_with() to make it safe for internal lambdas to reference
it.

Indentation will be fixed by the next patch.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <adc49d25a6176756d60e092f3713c0c897732382.1517222195.git.bdenes@scylladb.com>
2018-01-29 10:50:01 +00:00
Duarte Nunes
bfe5a8e96f utils/managed_vector: Return reference to emplaced element
We are in 2018, after all.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180126105417.54285-1-duarte@scylladb.com>
2018-01-26 13:49:56 +01:00
Duarte Nunes
269a4aec23 test.py: Rename streamed_mutation_test
96c97ad1db changed the name of the test,
but didn't update the test.py file.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-01-26 01:04:23 +01:00
Tomasz Grabiec
1219120c00 Merge cleanup of non-flat mutation readers from Piotr
Removes uses of obsolete mutation_reader and streamed_mutation.
Superseded by flat_mutation_reader.

* seastar-dev.git haaawk/cleanup:
  Rename streamed_mutation* files to mutation_fragment*
  Delete unused streamed_mutation
  Delete unused  consume_all(streamed_mutation&)
  Delete unused fill_buffer_from<streamed_mutation>
  Delete unused do_consume_streamed_mutation_flattened
  streamed_mutation: delete operator<<
  streamed_mutation: delete unused make_forwardable
  Delete unused streamed_mutation_opt
  Delete unused check_order_of_fragments
  Delete unused streamed_mutation_from_mutation
  Move test_abandoned_flat_mutation_reader_from_mutation to
  Change test_abandoned_streamed_mutation_from_mutation
  test_mutation_merger_conforms_to_mutation_source: use flat reader
  Delete unused consume(streamed_mutation&)
  Delete unused mutation_from_streamed_mutation(streamed_mutation_opt)
  Delete unused mutation_from_streamed_mutation(streamed_mutation&)
  Delete test_mutation_from_streamed_mutation_from_mutation
  Delete unused freeze(streamed_mutation)
  Delete test_freezing_streamed_mutations
  streamed_mutation: delete unused transform
  test_schema_upgrader_is_equivalent_with_mutation_upgrade: use flat reader
  streamed_mutation: delete unused consume_mutation_fragments_until
  Delete unused merge_mutations
  Delete test_mutation_merger
  Delete unused make_empty_streamed_mutation
  Delete unused streamed_mutation_from_forwarding_streamed_mutation
  Delete unused streamed_mutation_assertions
  Turn test_streamed_mutation_fragments_have_monotonic_positions
  Delete run_conversion_to_mutation_reader_tests
  Delete unused assert_that(streamed_mutation_opt)
  Delete unused assert_that(streamed_mutation)
  Delete unused mutation_reader
  perf_fast_forward: delete unused consume_all
  Delete unused consume(mutation_reader&, Consumer)
  Remove unused mutation_reader_assertions
  Remove unused query_state::reader
  Delete unused make_reader_returning
  Delete unused make_reader_returning_many
  Delete unused make_empty_reader
  Delete unused mutation_reader_from_flat_mutation_reader
  Delete unused flat_mutation_reader_from_mutation_reader
  Delete tests for mutation readers converters
  dummy_incremental_selector: use flat reader
  Delete unused streamed_mutation_from_flat_mutation_reader
  perf_fast_forward: use flat reader in test_forwarding_with_restriction
  perf_fast_forward: use flat reader in slice_partitions
  perf_fast_forward: use flat reader in slice_rows_single_key
  perf_fast_forward: use flat reader in test_reading_all
  perf_fast_forward: use flat reader in slice_rows
  perf_fast_forward: add consume_all_with_next_partition
  perf_fast_forward: use flat reader in scan_with_stride_partitions
  perf_fast_forward: use flat reader in scan_rows_with_stride
  perf_fast_forward: add assert_partition_start
  perf_fast_forward: add consume_all(flat_mutation_reader&)
  partition_checksum::compute_legacy: use only flat reader
  row_cache: rename make_flat_reader to make_reader
  row_cache: Delete unused make_reader
  test_mvcc: use flat reader
  test_cache_population_and_clear_race: use flat reader
  test_cache_population_and_update_race: use flat reader
  test_continuity_flag_and_invalidate_race: use flat reader
  test_update_failure: use flat reader
  row_cache_test: use flat reader in verify_has
  row_cache_test: use flat reader in has_key
  test_sliced_read_row_presence: use flat reader
  test_lru: use flat reader
  test_update_invalidating: use flat reader
  test_scan_with_partial_partitions: use flat reader
  test_cache_populates_partition_tombstone: use flat reader
  test_tombstone_merging_in_partial_partition: use flat reader
  consume_all,populate_range: use flat reader
  test_readers_get_all_data_after_eviction: use flat reader
  test_tombstones_are_not_missed_when_range_is_invalidated: use flat reader
  test_exception_safety_of_reads: use flat reader
  test_exception_safety_of_transitioning_from_underlying_read_to_read_from_cache: use flat reader
  test_exception_safety_of_partition_scan: use flat reader
  test_concurrent_population_before_latest_version_iterator: use flat reader
  test_concurrent_populating_partition_range_reads: use flat reader
  test_random_row_population: use flat reader
  test_continuity_is_populated_when_read_overlaps_with_older_version: use flat reader
  test_continuity_population_with_multicolumn_clustering_key: use flat reader
  test_continuity_is_populated_for_single_row_reads: use flat reader
  flat_mutation_reader_assertions: add produces_compacted
  test_concurrent_setting_of_continuity_on_read_upper_bound: use flat reader
  test_reading_from_random_partial_partition: use flat reader
  test_tombstone_merging_of_overlapping_tombstones_in_many_versions: use flat reader
  test_concurrent_reads_and_eviction: use flat reader
  test_eviction: use flat reader
  test_random_partition_population: use flat reader
  test_single_key_queries_after_population_in_reverse_order: use flat reader
  test_query_of_incomplete_range_goes_to_underlying: use flat reader
  test_cache_delegates_to_underlying_only_once_with_single_partition: use flat reader
  test_cache_uses_continuity_info_for_single_partition_query: use flat reader
  test_cache_delegates_to_underlying_only_once_empty_single_partition_query: use flat reader
  test_cache_delegates_to_underlying_only_once_empty_full_range: use flat reader
  test_cache_works_after_clearing: use flat reader
  test_cache_delegates_to_underlying: use flat reader
  cache_flat_mutation_reader_test: use flat reader
  row_cache_alloc_stress: use flat reader
2018-01-24 21:54:08 +01:00
Piotr Jastrzebski
1f9df7aade Fix master
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 21:00:51 +01:00
Piotr Jastrzebski
96c97ad1db Rename streamed_mutation* files to mutation_fragment*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
d590a063c6 Delete unused streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
6f468802f4 Delete unused consume_all(streamed_mutation&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
970a863950 Delete unused fill_buffer_from<streamed_mutation>
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
28c36d8884 Delete unused do_consume_streamed_mutation_flattened
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
6c6068f1da streamed_mutation: delete operator<<
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
f907073bde streamed_mutation: delete unused make_forwardable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
a346b32584 Delete unused streamed_mutation_opt
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
7161781586 Delete unused check_order_of_fragments
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
41b23a619e Delete unused streamed_mutation_from_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
795102a0f8 Move test_abandoned_flat_mutation_reader_from_mutation to
flat_mutation_reader_test.cc

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
6b78956563 Change test_abandoned_streamed_mutation_from_mutation
to test_abandoned_flat_mutation_reader_from_mutation

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
9e06711805 test_mutation_merger_conforms_to_mutation_source: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
916a9c339c Delete unused consume(streamed_mutation&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
d9cbb9fedc Delete unused mutation_from_streamed_mutation(streamed_mutation_opt)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
759271f866 Delete unused mutation_from_streamed_mutation(streamed_mutation&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
a39ddc8cf6 Delete test_mutation_from_streamed_mutation_from_mutation
It tests mutation_from_streamed_mutation that is no longer
used and will be removed in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
a1cf4b4cae Delete unused freeze(streamed_mutation)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
0f78e9c24a Delete test_freezing_streamed_mutations
It tests freeze(streamed_mutation) which is no longer used
and will be removed in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
05ae4f5d15 streamed_mutation: delete unused transform
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
1c12884fba test_schema_upgrader_is_equivalent_with_mutation_upgrade: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
eec6c2efb5 streamed_mutation: delete unused consume_mutation_fragments_until
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ca905d38b1 Delete unused merge_mutations
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8abbabef30 Delete test_mutation_merger
merge_mutations is no longer used and will be removed
by the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
b82f00fafb Delete unused make_empty_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
fb42022f03 Delete unused streamed_mutation_from_forwarding_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
5959337234 Delete unused streamed_mutation_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8bdc74c9e2 Turn test_streamed_mutation_fragments_have_monotonic_positions
into test_mutation_reader_fragments_have_monotonic_positions

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
a546cfd0d5 Delete run_conversion_to_mutation_reader_tests
It's no longer needed because converters are no longer used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
05ed42c08d Delete unused assert_that(streamed_mutation_opt)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
912a38d60b Delete unused assert_that(streamed_mutation)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
61f0ac257f Delete unused mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
a944a1f7f1 perf_fast_forward: delete unused consume_all
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
9ce48bc5fc Delete unused consume(mutation_reader&, Consumer)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
7729bc5e7b Remove unused mutation_reader_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
5636a97c81 Remove unused query_state::reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
37285ad7fa Delete unused make_reader_returning
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
864db78fcf Delete unused make_reader_returning_many
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ff4ffc1c64 Delete unused make_empty_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
0b8aedcc59 Delete unused mutation_reader_from_flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
c9575078a1 Delete unused flat_mutation_reader_from_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
f20c19b0e6 Delete tests for mutation readers converters
The converters are not used anywhere any longer and
will be deleted in the next patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
88ca42fa69 dummy_incremental_selector: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8aaf5dc900 Delete unused streamed_mutation_from_flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
93355372a0 perf_fast_forward: use flat reader in test_forwarding_with_restriction
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
252909c8ab perf_fast_forward: use flat reader in slice_partitions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
7d082e6ea7 perf_fast_forward: use flat reader in slice_rows_single_key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
177aa88dc1 perf_fast_forward: use flat reader in test_reading_all
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
899e471222 perf_fast_forward: use flat reader in slice_rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
e66c73839e perf_fast_forward: add consume_all_with_next_partition
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
b9bfa49088 perf_fast_forward: use flat reader in scan_with_stride_partitions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
f75c58915d perf_fast_forward: use flat reader in scan_rows_with_stride
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
52021dc605 perf_fast_forward: add assert_partition_start
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
5c213b9cbc perf_fast_forward: add consume_all(flat_mutation_reader&)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ee6f2ca554 partition_checksum::compute_legacy: use only flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
39ec13133f row_cache: rename make_flat_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
0f45df96ca row_cache: Delete unused make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
0d76091a28 test_mvcc: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
425c1624cd test_cache_population_and_clear_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
dc97acb778 test_cache_population_and_update_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
1bead9747a test_continuity_flag_and_invalidate_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
4266b9759e test_update_failure: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
d5366026b1 row_cache_test: use flat reader in verify_has
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
56b0157831 row_cache_test: use flat reader in has_key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
06bca9f4d5 test_sliced_read_row_presence: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
6c3d9cdb9f test_lru: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
a979869a15 test_update_invalidating: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
781d9a324d test_scan_with_partial_partitions: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f199aab1ad test_cache_populates_partition_tombstone: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
9755f7677c test_tombstone_merging_in_partial_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
2e1b12b6ce consume_all,populate_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
d08f4a40b2 test_readers_get_all_data_after_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f99992261f test_tombstones_are_not_missed_when_range_is_invalidated: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
50fb2a57b6 test_exception_safety_of_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f0af5a1321 test_exception_safety_of_transitioning_from_underlying_read_to_read_from_cache: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
98b97be19a test_exception_safety_of_partition_scan: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
5010c082f6 test_concurrent_population_before_latest_version_iterator: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
f8964f3aff test_concurrent_populating_partition_range_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
3e1da7525e test_random_row_population: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
e6cf785829 test_continuity_is_populated_when_read_overlaps_with_older_version: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
2b61411c7b test_continuity_population_with_multicolumn_clustering_key: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
561f5fbb5a test_continuity_is_populated_for_single_row_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
b4cfe4dde2 flat_mutation_reader_assertions: add produces_compacted
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
a1b6557877 test_concurrent_setting_of_continuity_on_read_upper_bound: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
6bbd0c7301 test_reading_from_random_partial_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
327eb8fbbd test_tombstone_merging_of_overlapping_tombstones_in_many_versions: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
07df1a6f87 test_concurrent_reads_and_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
63f45d522e test_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
57d19a390a test_random_partition_population: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
e9e8121ffe test_single_key_queries_after_population_in_reverse_order: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
9acbb1e0f4 test_query_of_incomplete_range_goes_to_underlying: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
7456c31e10 test_cache_delegates_to_underlying_only_once_with_single_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
4a3f5249ce test_cache_uses_continuity_info_for_single_partition_query: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
869443e11f test_cache_delegates_to_underlying_only_once_empty_single_partition_query: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
4cc9a0d852 test_cache_delegates_to_underlying_only_once_empty_full_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
5cdc77b66e test_cache_works_after_clearing: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
5091474f14 test_cache_delegates_to_underlying: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
e290f46d2d cache_flat_mutation_reader_test: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
4119a61155 row_cache_alloc_stress: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00
Piotr Jastrzebski
c0c88b3d4e Fix master
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:53:11 +01:00
Amnon Heiman
a0a1961b6d database: correct the label creation for database reads
The labels in database active_reads metrics where not define correctly.

Label should be created so it will be possible to select based on their
value.

The current implementation define a label "class" with three instances:
user, streaming, system.

Fixes: #2770

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180123125206.23660-1-amnon@scylladb.com>
2018-01-24 20:09:40 +01:00
Piotr Jastrzebski
c394dd9288 row_cache_test: add tests for small_buffer
When a buffer of a flat reader is small then the reader can't
handle range_tombstones correctly.

This is not a problem on a production when the buffer is large.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:09:11 +01:00
Piotr Jastrzebski
19e1f7c285 cache_flat_mutation_reader: fix tombstones handling with small buffer
Before when the buffer was so small that it could fit only a single
range_tombstone, cache_flat_mutation_reader would keep returning
the same tombstone over and over again.

The fix is to set _lower_bound to the next fragment we want to return.

Fixes #3139

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:09:11 +01:00
Tomasz Grabiec
6654fa6df7 row_cache: Drop unnecessary assignment to _lower_bound on exception
We no longer drain cached tombstones since commit
41ede08a1d, so this adjustment of
lower_bound is not needed.

Message-Id: <1516796248-11290-1-git-send-email-tgrabiec@scylladb.com>
2018-01-24 16:39:34 +02:00
Tomasz Grabiec
bf4a90fa51 flat_mutation_reader: Fix use-after-scope on timeout
timeout parameter was captured by reference, and could be accessed out
of scope in case the repeat loop deferred.

Fixes debug-mode failure of flat_mutation_reader_test.

Message-Id: <1516699230-19545-1-git-send-email-tgrabiec@scylladb.com>
2018-01-23 11:39:44 +02:00
Raphael S. Carvalho
2c181b69c9 sstables: fix wildly inaccurate sstable key estimation after dynamic index sampling
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.

The estimation is done as follow:
    uint64_t get_estimated_key_count() const {
        return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
                _components->summary.header.min_index_interval;
    }

The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.

From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.

Tests: units (release)

Fixes #3113.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
2018-01-23 10:42:24 +02:00
Asias He
5bae9b4e22 gossip: Check get_application_state_ptr in get_host_id
Check the pointer returned from get_application_state_ptr before use it.

Refs #2136

Message-Id: <e2ea32993754a79837dd97a7c5c601461dc5e1d1.1516581663.git.asias@scylladb.com>
2018-01-22 12:56:20 +02:00
Avi Kivity
1193e7d2e2 Merge "CAST from integers to decimal" from Daniel
"It turned out that decimal numbers that were obtained as cast from integers
should always contain just one decimal place 0.

This can be recognised especially when calculating avg(.) over such numbers
because result contains just one decimal point.

Fixes #3111."

* 'danfiala/integers-to-decimal' of github.com:hagrid-the-developer/scylla:
  tests: Add test that decimal obtained as CAST from integer always contain one decimal place.
  types: Decimal that is obtained from integer always contain one decimal place.
2018-01-21 20:21:00 +02:00
Daniel Fiala
4b31348463 tests: Add test that decimal obtained as CAST from integer always contain one decimal place.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-21 19:09:03 +01:00
Daniel Fiala
39a08cac6b types: Decimal that is obtained from integer always contain one decimal place.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-21 17:37:24 +01:00
Alexys Jacob
bd3517efd8 scyllatop: PEP8 python coding style compliance
this patch fixes the following remarks:
./defaults.py:2:9: E126 continuation line over-indented for hanging indent
./fake.py:15:1: E305 expected 2 blank lines after class or function definition, found 1
./livedata.py:49:17: F402 import 'metric' from line 5 shadowed by loop variable
./scyllatop.py:44:1: E305 expected 2 blank lines after class or function definition, found 1

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180119162939.17866-1-ultrabug@gentoo.org>
2018-01-21 17:15:29 +02:00
Alexys Jacob
604bc40d8a dist: migrate gentoo variant setup scripts from /sbin/service to /sbin/rc-service
the 'service' binary has been removed from gentoo as per news 2017-10-13:
https://gitweb.gentoo.org/data/gentoo-news.git/plain/2017-10-13-openrc-service-binary-removal/2017-10-13-openrc-service-binary-removal.en.txt

this patch updates the scylla setup related scripts where it was used and
make use of the 'rc-service' binary instead

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180119161310.15435-1-ultrabug@gentoo.org>
2018-01-21 17:15:26 +02:00
Glauber Costa
0c00667206 streaming big: keep write_monitor alive until the end of flush
After the new compaction controller code, the monitor has to be kept
alive until the sstable is added to the SSTable set.

This is correctly handled for all the writers, except the streaming big.
That flusher is a big confusing, as it builds an sstable list first and
only later adds the elements in the list to the sstable set. The
monitors are destroyed at the end of phase 1, so we will SIGSEGV later
when calling add_sstable().

The fix for this is to make sure the lifetime of the monitors are tied
to the lifetime of the sstables being handled big the big streaming
flush process.

Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test

Fixes #3131
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180118202230.17107-1-glauber@scylladb.com>
2018-01-21 14:09:43 +02:00
Amnon Heiman
1715ccf978 Register the API V2 swagger file
This adds a registration of the V2 swagger file.
V2 uses the Swagger 2.0 format, the initial definitions is empty and can
be reached at:

http://localhost:10000/v2

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-01-21 14:00:27 +02:00
Amnon Heiman
4ccf76c62b Adding the header part of the swagger2.0 API
In Swagger 2.0 all the API is exported as a single file.
The header part of the file, contains general information. It is stored
as an external file so it will be easy to modify when needed.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-01-21 14:00:27 +02:00
Avi Kivity
c743d1258d Merge "Reverse order of version merging in MVCC" from Tomasz
"Changes merging in MVCC to apply newer version to older instead of older to
newer.

Before (v0 = oldest):

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.
     This fixes latency spikes seen in perf_cache_eviction.

Fixes #2715."

* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
  mvcc: Reverse order of version merging
  anchorless_list: Introduce last()
  mvcc: Implement partition_entry::upgrade() using squashed()
  mvcc: Extract version merging functions
  mutation_partition: Add rows_entry::set_dummy()
  position_in_partition: Introduce after_key()
2018-01-21 13:56:57 +02:00
José Guilherme Vanz
380bc0aa0d Swap arguments order of mutation constructor
Swap arguments in the mutation constructor keeping the same standard
from the constructor variants. Refs #3084

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>
2018-01-21 12:58:42 +02:00
Raphael S. Carvalho
20179c415b service/storage_proxy: dont copy schema to primary_key::less_compare_clustering ctor
schema is expensive to copy, and it's done in a possible hot path.
bumped into it when reading code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180120211217.7273-1-raphaelsc@scylladb.com>
2018-01-20 23:16:15 +02:00
Duarte Nunes
a66c8d7973 row_cache: Don't require external_updater to be copyable
No good reason to copy it around, and even less reason to impose that
constraint on callers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180118181142.15408-1-duarte@scylladb.com>
2018-01-19 13:00:49 +01:00
Tomasz Grabiec
16e06b5b46 Merge "remove ability to create a non-flat mutation reader" from Piotr
* seastar-dev.git haaawk/flat_reader_clean_up_mutation_source_v3:
  test_range_queries: create flat reader from source
  run_sstable_resharding_test: create flat reader from source
  make_sstable_containing: create flat reader from source
  test_cache_delegates_to_underlying_only_once_multiple_mutation: use
    flat reader
  Migrate materalized views to flat_mutation_reader
  test_can_write_and_read_non_compound_range_tombstone_as_compound: use
    flat reader
  test_writing_combined_stream_with_tombstones_at_the_same_position: use
    flat reader
  Add flat_mutation_reader::peek()
  Add flat_mutation_reader_assertions::produces_range_tombstone
  Accept clustering_row_ranges in
    flat_mutation_reader_assertions::produces
  Add flat_mutation_reader_assertions::produces_eos_or_empty_mutation
  Add flat_mutation_reader_assertions::fast_forward_to overload
  test_query_only_static_row: use flat reader
  Move mutation_rebuilder to header
  test_streamed_mutation_forwarding_is_consistent_with_slicing: use flat
    reader
  test_clustering_slices: use flat reader
  test_streamed_mutation_forwarding_guarantees: use flat reader
  test_streamed_mutation_forwarding_across_range_tombstones: use flat
    reader
  test_streamed_mutation_slicing_returns_only_relevant_tombstones: use
    flat reader
  Add flat_mutation_reader_assertions::is_buffer_full
  test_fast_forwarding_across_partitions_to_empty_range: use flat reader
  Remove unused mutation_source::operator()
  mutation_source: rename make_flat_mutation_reader to make_reader
  Clean up imports in tests
2018-01-19 12:43:50 +01:00
Piotr Jastrzebski
eeef0e0f07 Clean up imports in tests
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 09:30:57 +01:00
Piotr Jastrzebski
d266eaa01e mutation_source: rename make_flat_mutation_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 09:30:12 +01:00
Piotr Jastrzebski
380d5c3402 Remove unused mutation_source::operator()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
bf06c78415 test_fast_forwarding_across_partitions_to_empty_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
872b1c9122 Add flat_mutation_reader_assertions::is_buffer_full
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
7ad640a64b test_streamed_mutation_slicing_returns_only_relevant_tombstones: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
6bdfe2a870 test_streamed_mutation_forwarding_across_range_tombstones: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
94480d3e05 test_streamed_mutation_forwarding_guarantees: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
873e3014fb test_clustering_slices: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
494fabc925 test_streamed_mutation_forwarding_is_consistent_with_slicing: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
667ce36981 Move mutation_rebuilder to header
It will be used in tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
c7ce24be06 test_query_only_static_row: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
5a5a5149e3 Add flat_mutation_reader_assertions::fast_forward_to overload
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
82bdc54588 Add flat_mutation_reader_assertions::produces_eos_or_empty_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
f0716d34df Accept clustering_row_ranges in flat_mutation_reader_assertions::produces
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
16e2bc8741 Add flat_mutation_reader_assertions::produces_range_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:56:37 +01:00
Piotr Jastrzebski
36771c5c2a Add flat_mutation_reader::peek()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 08:55:48 +01:00
Raphael S. Carvalho
f779877f43 tests/sstable_test: fix tests by not triggering compiler bug with c++17
$ gcc --version
gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)

The following code

struct S
{
    S(int i = 42);
};

void f()
{
    S( {} );
}

produces this assembly with g++ --std=c++14

  lea rax, [rbp-1]
  mov esi, 0
  mov rdi, rax
  call S::S(int)

and this one with g++ --std=c++17

  lea rax, [rbp-1]
  mov esi, 42
  mov rdi, rax
  call S::S(int)

For more details about compiler bug, check:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83937

NOTE: clang isn't affected by it.

Test relied on braced initialization of compressor (an enum class)
working properly when used as argument to compression_parameters's
ctor. Braced-initilization of an integer based type should be zero,
but default argument (lz4) was used instead, which means compression
was enabled when it shouldn't.

The course of action is to workaround the bug by explicitly setting
compressor type to none.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180119013655.32564-1-raphaelsc@scylladb.com>
2018-01-19 09:27:39 +02:00
Tomasz Grabiec
60d3c25c02 mvcc: Reverse order of version merging
Change merging to apply newer version to older instead of older to
newer.

Before:

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or equivalent:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.

Fixes #2715.
2018-01-18 13:52:08 +01:00
Pekka Enberg
fab73dbdc3 cql3/restrictions: Fix multi_column_restriction::values()
Fix multi_column_restriction::values() similar to
single_column_primary_key_restrictions::values().
2018-01-18 14:38:06 +02:00
Tomasz Grabiec
1292315579 anchorless_list: Introduce last() 2018-01-18 11:32:49 +01:00
Tomasz Grabiec
5331b7b8e2 mvcc: Implement partition_entry::upgrade() using squashed()
To reduce duplication of version merging logic.
2018-01-18 11:32:49 +01:00
Tomasz Grabiec
88aff526df mvcc: Extract version merging functions 2018-01-18 11:32:49 +01:00
Tomasz Grabiec
da0c48a987 mutation_partition: Add rows_entry::set_dummy() 2018-01-18 11:32:49 +01:00
Tomasz Grabiec
bbd9ef6b59 position_in_partition: Introduce after_key() 2018-01-18 11:32:48 +01:00
Pekka Enberg
8b0b9b43b8 cql3/restrictions: Fix single_column_primary_key_restrictions::values()
This patch changes single_column_primary_key_restrictions::values() to
return values obtained via components() instead of the serialized form
that's returned by representation(). We need this to turn clustering key
restriction keys into partition keys for clustering key indexed queries.
2018-01-18 12:14:44 +02:00
Piotr Jastrzebski
0d382e89d7 test_writing_combined_stream_with_tombstones_at_the_same_position: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:59 +01:00
Piotr Jastrzebski
d6aede88d3 test_can_write_and_read_non_compound_range_tombstone_as_compound: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:59 +01:00
Piotr Jastrzebski
4c74b8c7e7 Migrate materalized views to flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:35 +01:00
Piotr Jastrzebski
b99dd17dcd test_cache_delegates_to_underlying_only_once_multiple_mutation: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-17 19:51:03 +01:00
Glauber Costa
378f2ba8e4 mutation_reader_test: adjust sleep time to timeout clock and duration
Raphael recently caught this test failing. I can't really reproduce it,
but it seems to me that it is a timing issue: we execute two different
statements, each one should timeout after 10ms. After 20ms, we make sure
that they both timed out.

They don't (in his system), which is explained by the fact that we are
no longer using high resolution clocks for the timeouts. Expirations for
lowres clocks will only happen at every 10ms, and in the worst case we
will miss twoa.

So the fix I am proposing here is to just account for potential
innacuracies in the clocks and calculations by waiting a bit longer.

Ideally, we would use the manual clock for this. But in this case, this
would mean adding template parameters to pretty much all of the
mutation_reader path.

Currently, not only the test failed, it also had an use-after-free
SIGSEGV. That happens because we give up on the reader while the
timeouts is still to happen.

It is the caller responsibility to ensure the lifetime of the reader is
correct. Dealing with that cleanly would require a cancelation mechanism
that we don't have, so we'll just add an assertion that will fail more
gracefully than the SIGSEGV.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-17 17:17:40 +01:00
Glauber Costa
01274774c3 mutation_reader_test: propagate timeouts to fast_forward_to
We are not propagating timeouts to fast_forward_to in the
mutation_reader_test. This is not currently causing any issue, but I
noticed it while chasing one - so let's fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-17 17:17:40 +01:00
Tomasz Grabiec
ab6ec571cb test.py: set BOOST_TEST_CATCH_SYSTEM_ERRORS=no
This will make boost UTF abort execution on SIGABRT rather than trying
to continue running other test cases. This doesn't work well with
seastar integration, the suite will hang.
Message-Id: <1516205469-16378-1-git-send-email-tgrabiec@scylladb.com>
2018-01-17 16:15:27 +00:00
Vladimir Krivopalov
73b6e9fbb1 main: Fix warnings when running "scylla --version"
Print Scylla version, if requested, before running Seastar application.

Fixes #3124

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <bbd0f303f612327446ce1f10ebd17ebed8d76048.1516144651.git.vladimir@scylladb.com>
2018-01-17 16:56:10 +02:00
Takuya ASADA
f3c8574135 dist/debian: follow gcc-7.2 package naming changes on 3rdparty repo for Debian 9
Switch to renamed gcc-7.2 package on Debian 9, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516191853-2562-1-git-send-email-syuu@scylladb.com>
2018-01-17 14:38:41 +02:00
Takuya ASADA
15e266eea4 dist/debian: fix package name typo on Debian 8
Correct package name is scylla-gcc72-g++-7, not scylla-g++-7.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516189354-5880-1-git-send-email-syuu@scylladb.com>
2018-01-17 13:45:24 +02:00
Duarte Nunes
dc74ba21ab tests/sstable_utils: Inline make_local_key()
Or the compiler complains about it not being used in some units where
the header is included.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180116235557.96046-1-duarte@scylladb.com>
2018-01-17 12:17:17 +01:00
Avi Kivity
6d7d02315e dist/redhat: support nowait aio even on old distributions
Since we sometimes recommend that the user update to a newer kernel,
it's good to compile support for features that the new kernel supports.
Rather than play games with build-time dependencies, just #define
those features in. It's ugly, but better than depending on third-party
repositories and handling package conflicts.
Message-Id: <20180115143129.22190-1-avi@scylladb.com>
2018-01-17 12:13:44 +01:00
Paweł Dziepak
5efa713344 Merge "revive the round-robin load balancing #2" from Vlad
"The previous series handled a passing of the copy of the client_state from process_request(...)
to the process_request_one(...). However the modified copy of the client_state is returned by the
process_request_one(...) back to the process_request(...) and handling of this direction was missing
in the previous series.

This series completes the #2351 fix."

* 'fix-round-robin-cont-v2' of https://github.com/vladzcloudius/scylla:
  transport::cql_server::process_request_one: return only the required information instead of the whole client_state object
  service::client_state: move auth_state from cql_server::connection to service::client_state
  transport::cql_server: don't cache sasl_challenge object in the cql_server::connection
  service::client_state::merge(): remove not needed timestamp merge
2018-01-16 16:56:05 +00:00
Avi Kivity
4ad212dc01 Merge "Fix memory leak on zone reclaim" from Tomek
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120"

* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
  tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
  lsa: Expose max_zone_segments for tests
  lsa: Expose tracker::non_lsa_used_space()
  lsa: Fix memory leak on zone reclaim
2018-01-16 15:54:03 +02:00
Duarte Nunes
176fefdebc tests/sstable_utils: Don't assume seastar test context
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180116131722.86230-1-duarte@scylladb.com>
2018-01-16 15:42:33 +02:00
Tomasz Grabiec
f20958ae3d tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
Reproducer for https://github.com/scylladb/scylla/issues/3129
2018-01-16 13:17:20 +01:00
Tomasz Grabiec
5c85e9c2db lsa: Expose max_zone_segments for tests 2018-01-16 13:17:20 +01:00
Tomasz Grabiec
99708cc498 lsa: Expose tracker::non_lsa_used_space()
So that it can be used in unit tests.
2018-01-16 13:17:20 +01:00
Tomasz Grabiec
e5f8176c32 lsa: Fix memory leak on zone reclaim
_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120
2018-01-16 13:17:11 +01:00
Takuya ASADA
912a14eb9b dist/debian: follow renaming of gcc-7.2 packages on Ubuntu 14.04/16.04
Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2,
so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516103292-26942-1-git-send-email-syuu@scylladb.com>
2018-01-16 13:52:05 +02:00
Tomasz Grabiec
b5d5bf5bc4 database: Invalidate only affected ranges from flush_streaming_mutations()
Invalidating whole range causes larger latency spikes.

Regression from 2.0 introduced in d22fdf4261.

Refs #3119

Tests: units (release)

Message-Id: <1516046938-26855-1-git-send-email-tgrabiec@scylladb.com>
2018-01-16 11:17:57 +02:00
Asias He
5107b6ad16 storage_service: Do not wait for restore_replica_count in handle_state_removing
The call chain is:

storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()

Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.

In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.

Tested with update_cluster_layout_tests.py

Fixes #2886

Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
2018-01-16 11:01:31 +02:00
Avi Kivity
0cd656ec68 Revert "Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported."
This reverts commit ef3324129a. It breaks cqlsh, and
further was sneaked into mainline in an unrelated patchset rather than merged
on its own.
2018-01-16 10:58:08 +02:00
Piotr Jastrzebski
767e105b24 make_sstable_containing: create flat reader from source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-16 09:33:05 +01:00
Piotr Jastrzebski
a64aa3fae3 run_sstable_resharding_test: create flat reader from source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-16 09:32:35 +01:00
Piotr Jastrzebski
2e9f03099c test_range_queries: create flat reader from source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-16 09:32:35 +01:00
Asias He
3c8ed255ac storage_service: Set NORMAL status after token_metadata is replicated
Commit 2d5fb9d109 (gms/gossiper: Replicate changes incrementally to
other shards) changes the way we replicate _token_metadata and
endpoint_state_map. Before they are replicated at the same time, after
they are not any more. This causes a shard in NORMAL status can still be
with a empty _token_metadata.

We saw errors:

   [shard 12] token_metadata - sorted_tokens is empty in first_token_index!

during CorruptThenRepairNemesis.

Fix by setting the gossip status to NORMAL after replication of
_token_metadata, so that once a node is in NORMAL, we can do repair. The
commit 69c81bcc87 (repair: Do not allow repair until node is in NORMAL
status) prevents the early repair operation by checking if a node is in
NORMAL status.

Fixes #3121

Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com>
2018-01-16 09:41:22 +02:00
Raphael S. Carvalho
2b0b703615 tests: sstable_mutation_test: fix sstable write in tests due to use of non-local keys
that's required after fa5a26f12d on because sstable write fails when sharding
metadata is empty due to lack of keys that belong to current shard.

make_local_key* were moved to header to avoid compiling sstable_utils.cc into
all those tests that rely on simple_schema.hh, which is a lot.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180116052052.7819-1-raphaelsc@scylladb.com>
2018-01-16 09:28:12 +02:00
Vlad Zolotarov
d06b577b86 transport::cql_server::process_request_one: return only the required information instead of the whole client_state object
client_state used in the process_request_one(...) contains all sorts of information irrelevant
to the caller (process_request(...)), e.g. Tracing state. Therefore instead of returning
the whole client_state object (which becomes even a bigger problem if process_one(...) and process_request_one(...)
are executed on different shards) we will return only the pieces of information we really need.

To do that we introduce a new class - processing_result, which is cross-shard-access-ready to begin with.
We are going to return a instance of this new class from the process_request_one(...).

Fixes #2351

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 13:09:57 -05:00
Vlad Zolotarov
6cba14c272 service::client_state: move auth_state from cql_server::connection to service::client_state
Move the requests-handling-related state into the client_state. This is needed to properly
define the interface between the process_request(...) and process_request_one(...).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 13:09:56 -05:00
Vlad Zolotarov
c2509d290a transport::cql_server: don't cache sasl_challenge object in the cql_server::connection
The benefit of such a caching is rather limited because it's likely to be used exactly once
and then destroyed anyway (in case of a successful authentication).
If the authentication has failed no harm is going to be done if we create this object again when
needed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 13:09:49 -05:00
Vlad Zolotarov
88932cbcf0 service::client_state::merge(): remove not needed timestamp merge
Since the connection::_client_state is the only generator of new timestamps
now there is no need for this merge.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-15 12:54:20 -05:00
Avi Kivity
93076d25b6 Merge "mutation_source: remove support for creation with mutation_reader" from Piotr
"After this patchset it's only possible to create a mutation_source with a function that produces flat_mutation_reader."

* 'haaawk/mutation_source_v1' of ssh://github.com/scylladb/seastar-dev:
  Merge flat_mutation_reader_mutation_source into mutation_source
  Remove unused mutation_reader_mutation_source
  Remove unused mutation_source constructor.
  Migrate make_source to flat reader
  Migrate run_conversion_to_mutation_reader_tests to flat reader
  flat_mutation_reader_from_mutations: add support for slicing
  Remove unused mutation_source constructor.
  Migrate partition_counting_reader to flat reader
  Migrate throttled_mutation_source to flat reader
  Extract delegating_reader from make_delegating_reader
  row_cache_test: call row_cache::make_flat_reader in mutation_sources
  Remove unused friend declaration in flat_mutation_reader::impl
  Migrate make_source_with to flat reader
  Migrate make_empty_mutation_source to flat reader
  Remove unused mutation_source constructor
  Migrate test_multi_range_reader to flat reader
  Remove unused mutation_source constructors
2018-01-15 18:15:53 +02:00
Paweł Dziepak
f6434c9941 tests/perf: add microbenchmarks for the combined reader
Message-Id: <20180111120153.3911-1-pdziepak@scylladb.com>
2018-01-15 17:49:47 +02:00
Avi Kivity
3e0e4a9b56 Merge seastar upstream
* seastar a7a3e6f...d03896d (11):
  > Update dpdk submodule
  > Merge "C++17 aligned allocations" from Avi
  > Prometheus should check that the iterator is valid before using it
  > future-util: failure to allocate internal state is unrecoverable
  > Merge "Introduce simple microbenchmarking framework" from Paweł
  > tutorial: document debuging ignored exceptions
  > Revert "Merge "Introduce simple microbenchmarking framework" from Paweł"
  > Merge "Introduce simple microbenchmarking framework" from Paweł
  > tests/futures: add more tests for parallel_for_each()
  > Add a prometheus.md file
  > prometheus: Support metric family name parameter
2018-01-15 16:16:08 +02:00
Duarte Nunes
83e983d4d0 mutation_partition: Remove unused operator==()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013546.67260-1-duarte@scylladb.com>
2018-01-15 11:16:35 +02:00
Duarte Nunes
9d1d9883ff mutation_partition: Remove unused for_each_cell() overload
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013618.67351-1-duarte@scylladb.com>
2018-01-15 11:16:34 +02:00
Duarte Nunes
b607662d2e collection_type_impl: Make for_each_cell static
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013532.67200-1-duarte@scylladb.com>
2018-01-15 11:16:33 +02:00
Avi Kivity
fe788e0a5d mutation_reader: adjust FragmentProducer concept for timeout
forward_to() no accepts a timeout parameter, and the concept should
reflect it, or it breaks the build when concepts are enabled.
2018-01-14 18:09:37 +02:00
Avi Kivity
90dc409c83 Merge "Support for MIN/MAX aggregation functions over date-types" from Dan
"Added support for min/max functions over date/timestamp/timeuuid.

There was one issue with Scylla's type system internals: no C++ type
was mapped to these types. So special "native_types" were added for them.
It required some changes to native functions because these types don't support
the same operations as their real native counterparts.

Fixes #3104."

* 'danfiala/3104-v1' of https://github.com/hagrid-the-developer/scylla:
  tests: Tests for min/max aggregate functions over date/timestamp and timeuuid.
  functions: Added min/max functions for date/timestamp/timeuuid.
  types: Added native types for timestamp and timeuuid.
  Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported.
2018-01-14 17:26:27 +02:00
Daniel Fiala
1d0d419693 tests: Tests for min/max aggregate functions over date/timestamp and timeuuid.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-14 13:17:09 +01:00
Daniel Fiala
5bad03b5a6 functions: Added min/max functions for date/timestamp/timeuuid.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-14 13:13:36 +01:00
Daniel Fiala
0d71194da6 types: Added native types for timestamp and timeuuid.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-14 13:11:36 +01:00
Mika Eloranta' via ScyllaDB development
bc1248e62a build: rpm build script --xtrace option
Enables bash "set -o xtrace" printing of full executed command lines for
debugging purposes.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212944.86008-1-mel@aiven.io>
2018-01-14 12:32:32 +02:00
Mika Eloranta' via ScyllaDB development
7266446227 build: fix rpm build script --jobs N handling
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
2018-01-14 12:30:19 +02:00
Raphael S. Carvalho
fd2b4a7eb3 mutation_reader_test: remove schema left over from dummy selector
it now lives in base class, and this one is useless.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180114032943.28228-1-raphaelsc@scylladb.com>
2018-01-14 10:59:48 +02:00
Raphael S. Carvalho
16f8150916 tests: mutation_reader_test: Fix test_combined_reader_slicing_with_overlapping_range_tombstones
Test fails after fa5a26f12d because generated sstable doesn't contain data for the
shard it was created at, so sharding metadata is empty, resulting in exception
added in the aforementioned commit. That's fixed by using the new make_local_key()
to generate data that belongs to current shard.

make_local_keys(), from which make_local_key() is built on top of, will be useful
to make sstable test work again with any smp count.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180114032025.26739-1-raphaelsc@scylladb.com>
2018-01-14 10:59:29 +02:00
Tomasz Grabiec
9c391970b8 Merge 'per-request timeouts' from Glauber
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.

We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.

This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.

In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.

We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.

After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.

Fixes #2462

* git@github.com:glommer/scylla.git timeouts-v8.1:
  database: delete unused function
  consolidate timeout_clock
  mutation_query: add a timeout to the mutation query path
  flat_mutation_reader: pass timeout down to consume()
  add a timeout to fill_buffer
  add a timeout to fast forward to
  restricted_mutation_reader: don't pass timeouts through the config
    structure
  allow request-specific read timeouts in storage proxy reads
2018-01-12 17:06:27 +01:00
Glauber Costa
08a0c3714c allow request-specific read timeouts in storage proxy reads
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.

We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.

This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.

In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.

We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.

After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.

Fixes #2462

Implementation notes, and general comments about open discussion in 2462:

* Because we are not bypassing the timeout, just setting it high enough,
  I consider the concerns about the batchlog moot: if we fail for any
  other reason that will be propagated. Last case, because the timeout
  is per-CF, we could do what we do for the dirty memory manager and
  move the batchlog alone to use a different timeout setting.

* Storage proxy likes specifying its timeouts as a time_point, whereas
  when we get low enough as to deal with the read_concurrency_config,
  we are talking about deltas. So at some point we need to convert time_points
  to durations. We do that in the database query functions.

v2:
- use per-request instead of per-table timeouts.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
3c9eeea4cf restricted_mutation_reader: don't pass timeouts through the config structure
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.

The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.

The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
5140aaea00 add a timeout to fast forward to
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.

In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:19 -05:00
Glauber Costa
d965af42b0 add a timeout to fill_buffer
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.

The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.

At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
54d3ebde4e flat_mutation_reader: pass timeout down to consume()
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.

To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
8433702c90 mutation_query: add a timeout to the mutation query path
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
80c4a211d8 consolidate timeout_clock
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.

As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.

In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
40c428dc19 database: delete unused function
no in-tree users.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Paweł Dziepak
bd6fa8b331 configure.py: add dependency seastar/configure.py
Scylla's configure.py calls seastar/configure.py and uses seastar.pc
that it produces to generate Scylla's build.ninja. However, there is no
appropriate dependency in build.ninja and changes to
seastar/configure.py alone do not trigger regeneration of Scylla's
build.ninja. This patch remedies that problem.

Message-Id: <20180111144237.5259-1-pdziepak@scylladb.com>
2018-01-11 16:48:06 +02:00
Takuya ASADA
b68ee98310 dist/debian: make pbuilder works on Debian 9
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.

Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.

Fixes #3088

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
2018-01-11 15:02:05 +02:00
Takuya ASADA
420b61b466 dist/debian: follow renaming of gcc-7.2 packages on Debian 8
Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2,
so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1515522920-8266-1-git-send-email-syuu@scylladb.com>
2018-01-11 15:02:04 +02:00
Duarte Nunes
cbbdfde979 sstables/compaction_backlog_tracker: Constify backlog()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-1-duarte@scylladb.com>
2018-01-11 13:20:57 +02:00
Duarte Nunes
43ad5bd182 sstables/compaction_backlog_manager: Fix user-after-free
If the compaction_backlog_manager's lifetime ends before the linked
compaction_backlog_tracker's, the latter's _manager pointer not being
cleared, can lead to a use-after-free error when running
~compaction_backlog_tracker(), as evidenced by unit-tests failed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-2-duarte@scylladb.com>
2018-01-11 13:20:55 +02:00
Amnon Heiman
372b02676a register the cache API before gossip settle
cache service API does not need to wait for the gossip to settle.

Fixes: #2075

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180103094757.13270-1-amnon@scylladb.com>
2018-01-11 10:27:52 +01:00
Paweł Dziepak
b4a4c04bab combined_reader: optimise for disjoint partition streams
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.

That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.

The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".

perf_simple_query -c4 read results (medians of 60):

original regression
             before 8731c1     after 8731c1   diff
 read            326241.02        300244.09  -8.0%

this patch
                    before            after  diff
 read            313882.59        325148.05  3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
2018-01-11 10:21:17 +01:00
Duarte Nunes
891c22904b partition_snapshot_reader: Don't push empty static rows
This patch fixes a regression introduced in 259f6759b4, which pushed
static row fragments regardless of them being empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180110222936.23085-1-duarte@scylladb.com>
2018-01-11 10:05:51 +01:00
Pekka Enberg
92b2e56211 Merge "Revive round-robin coordinator load balancing" from Vlad
"This series revives the round-robin load balancing added by Pekka back in 2015.

 If somebody tries to enable it with the current master it would quite quickly
 lead to a crash due to a few unresolved issues in the corresponding code.

 Fixes #2351
 Fixes #3118"

* 'fix-round-robin-balancing-v2' of github.com:vladzcloudius/scylla:
  transport::server::process_request(): avoid extra copy of the client_state
  service::cql_server::connection::process_request: use client_state "request copy" constructor
  service::client_state: introduce "request copy" copy-constructor
  service::storage_service: add the get_local_auth_service() accessor
  service::client_state: remove the unused _tracing_session_id field
2018-01-11 09:02:13 +02:00
Daniel Fiala
ef3324129a Advertise compatibility with CQL Version 3.3.2, since CAST functions are supported.
Fixes #3103.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-01-10 15:01:22 +01:00
Avi Kivity
56801d1b8c Update scylla-ami submodule
* dist/ami/files/scylla-ami 3366c93...3aa87a7 (1):
  > Move to kernel-ml kernel stream
2018-01-10 11:58:27 +02:00
Vlad Zolotarov
26a9aa5157 transport::server::process_request(): avoid extra copy of the client_state
Don't use submit_to(...) when we are going to handle the request on a local
shard. Otherwise there is a not needed copy of the _client_state in the submit_to(...)
lambda capture list.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-09 14:00:04 -05:00
Vlad Zolotarov
0b88c52639 service::cql_server::connection::process_request: use client_state "request copy" constructor
Create a cross-shard copy of the client_state object and give it to the single request handling
function and give it a timestamp generated by the original client_state instance (which is promised
to be monotonous).

Fixes #3118

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-09 14:00:04 -05:00
Vlad Zolotarov
430d172040 service::client_state: introduce "request copy" copy-constructor
A new constructor creates a copy of the current client_status to be
used in the context of the handling of a single request.

The copy may take place at a shard different from the one where the
request has been received.

In order to ensure the monotonicity of the timestamps used by the request handled
on the same connection the created copy of the client_state is going to use the same timestamp provided by the
caller instead of generating it.

It's the caller's responsibility to ensure the monotonicity of given timestamps.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-09 14:00:03 -05:00
Duarte Nunes
c142b6d0ee atomic_cell: Remove revert flag
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109184420.7556-1-duarte@scylladb.com>
2018-01-09 19:54:51 +01:00
Duarte Nunes
259f6759b4 partition_snapshot_reader: Use static_row() to read static_row
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109162815.5811-2-duarte@scylladb.com>
2018-01-09 19:17:02 +01:00
Duarte Nunes
16c975edcc partition_version: Return static_row fragment from static_row()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109162815.5811-1-duarte@scylladb.com>
2018-01-09 19:17:02 +01:00
Avi Kivity
7e898d2745 Merge seastar upstream
* seastar 6972a1e...a7a3e6f (1):
  > Update dpdk submodule
2018-01-09 18:17:33 +02:00
Tomasz Grabiec
5a32cf9008 tests: Make bad_alloc from test_concurrent_reads_and_eviction less likely
With -m1G, the test failed sporadically, because too many large
mutations were accumulated in memory. Avoid by limiting backlog.

Message-Id: <1515486430-4778-1-git-send-email-tgrabiec@scylladb.com>
2018-01-09 13:52:38 +02:00
Tomasz Grabiec
40ea74a934 tests: Drop unconditional mutation printing from assertions
sprint() may need to allocate significant amount of memory if mutation
is large, and cause bad_alloc in
row_cache_test::test_concurrent_reads_and_eviction.

Message-Id: <1515486454-4913-1-git-send-email-tgrabiec@scylladb.com>
2018-01-09 13:52:19 +02:00
Avi Kivity
d340a03e81 Merge seastar upstream
* seastar b0f5591...6972a1e (8):
  > Merge NOWAIT AIO from Avi
  > configure: Allow overriding protoc compiler path
  > Tutorial: fix default of --reserve-memory
  > future-util: optimise parallel_for_each()
  > future-utils: avoid defining a template with its default template parameter
  > fix socket_address output stream operator
  > test: fix spelling of "abort_source_test"
  > Make dependencies and doc more arch-friendly
2018-01-09 12:33:40 +02:00
Piotr Jastrzebski
3bddf3415f flat_mutation_reader: Add test for make_forwardable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <71c8b195e25c3c5c5b97f12e2d7b2f011c0d3162.1515490058.git.piotr@scylladb.com>
2018-01-09 10:46:04 +01:00
Piotr Jastrzebski
945f45f490 Fix fast_forward_to(partition_range&) in forwardable flat reader.
Making sure fast_forward_to(const partition_range&) sets _current
correctly.

Fixes #3089

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6c29cf273f191da0e21035bcbe1592042ecffc70.1515490058.git.piotr@scylladb.com>
2018-01-09 10:46:04 +01:00
Asias He
774307b3a7 streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
2018-01-08 15:04:06 +02:00
Raphael S. Carvalho
4610e994e1 sstables: cure our blindness on sstable read failure
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)

After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))

Fixes #3006.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
2018-01-08 13:43:13 +02:00
Avi Kivity
72c673fcc3 Merge "I/O Controller for memtables and compactions" from Glauber
"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.

As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.

Tracking reads and writes:
==========================

Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.

A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.

Lifetime management:
====================

In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.

The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.

It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.

The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.

Tranfer of Charges
==================

When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.

The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.

Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).

Resharding
==========

Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.

Memtable Flush I/O Controller
=============================

With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."

* 'compaction-controller-io-v5' of github.com:glommer/scylla:
  database: add a controller for I/O on memtable flushes.
  document the compaction controller
  compaction: adjust shares for compactions
  backlog_controllers: implement generic I/O controller
  factor out some of the controller code
  io shares: multiply all shares by 10
  compaction_strategy: implement backlog manager for the SizeTiered strategy
  infrastructure for backlog estimator for compaction work.
  sstables: notify about end of data component write
  sstables: add read_monitor_generator
  sstables: add read_monitor
  sstables: enhance data consumer with a position tracker
  sstables: enhance the file_writer with an offset tracker
  sstables: pass references instead of pointers for write_monitor
  compaction: control destruction of readers
2018-01-07 15:00:10 +02:00
Avi Kivity
375ed938b4 Merge "Fix potential infinite recursion in leveled compaction" from Raphael
'"The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."

Fixes #2908.'

* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
  tests: test for infinite recursion bug when doing high-level compaction
  Fix potential infinite recursion when combining mutations for leveled compaction
  dht: make it easier to create ring_position_view from token
  dht: introduce is_min/max for ring_position
2018-01-07 13:22:17 +02:00
Vlad Zolotarov
f0d5619634 service::storage_service: add the get_local_auth_service() accessor
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-05 18:00:11 -05:00
Vlad Zolotarov
1d978b9caa service::client_state: remove the unused _tracing_session_id field
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-01-05 18:00:11 -05:00
Asias He
34f6218dc5 gossip: Show correct nodetool status against the shutdown node itself
If a node shuts itself down due to I/O error (such as ENOSPC), then
nodetool status will show the cluster status at the time the shutdown
occured.

In fact the node will be in shutdown status (nodetool gossipinfo shows
the correct status), however, `nodetool status` does not interpret the
shutdown status, instead it use the output of:

curl -X GET --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/endpoint/live"

to decide if a node is in UN status.

To fix, do not include the node itself in the output of get_live_members

Without this patch, when a node is shutdown due to I/O error:
UN  127.0.0.1  296.2 MB   256          ?  056ff68e-615c-4412-8d35-a4626569b9fd  rack1

With this patch, when a node is shutdown due to I/O error:
?N  127.0.0.1  296.2 MB   256          ?  056ff68e-615c-4412-8d35-a4626569b9fd  rack1

Fixes #1629
Message-Id: <039196a478b5b1a8749b3fdaf7e16cfe2eb73a2f.1498528642.git.asias@scylladb.com>
2018-01-04 08:31:01 +02:00
Glauber Costa
4f1b875784 database: add a controller for I/O on memtable flushes.
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.

I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
da792641c6 document the compaction controller
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
244c564aac compaction: adjust shares for compactions
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.

Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
4b44a22236 backlog_controllers: implement generic I/O controller
Like the CPU controller, but will act on I/O priorities.
Shares can go from 0 to 1000.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00
Glauber Costa
1671d9c433 factor out some of the controller code
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.

Make the code a bit more generic, so that a new controller has to only
set the desired parameters

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00
Raphael S. Carvalho
e641c0d333 tests: test for infinite recursion bug when doing high-level compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 16:23:02 -02:00
Raphael S. Carvalho
818830715f Fix potential infinite recursion when combining mutations for leveled compaction
The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made.

Fix is to use ring_position in reader_selector, such that
inclusiveness would be respected.
So reader_selector::has_new_readers() won't return false positive
under the conditions described above.

Fixes #2908.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 16:23:01 -02:00
Raphael S. Carvalho
19d994cfff dht: make it easier to create ring_position_view from token
that's done by adding a separate explicit constructor

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 15:26:26 -02:00
Raphael S. Carvalho
68ac0832b7 dht: introduce is_min/max for ring_position
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 15:26:25 -02:00
Vlad Zolotarov
976f444813 tests: commitlog_test: fix the compilation and test errors introduced by the hinted_handoff series
Use the default commitlog configuration with the hinted_handoff disabled
in the tests.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1514942938-3844-1-git-send-email-vladz@scylladb.com>
2018-01-03 12:20:34 +00:00
Raphael S. Carvalho
e29b598c5f sstables: make compaction_descriptor's ctor explicit to avoid bad conversion
perf sstable used old sstables::compact_sstables() interface and still compiled
due to bad implicit conversion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180103041900.21186-1-raphaelsc@scylladb.com>
2018-01-03 12:37:12 +02:00
Calle Wilund
35b9ec868a auth: Fix transitional auth for non-valid credentials
Fixes #3096

The credentials processing for transitional auth was broken
in ba6a41d, "auth: Switch to sharded service which effectively removed
the "virtualization" of underlying auth in the SASL challenge.

As a quick workaround, add the permissive exception handling to
sasl object as well.

Message-Id: <20180103102724.1083-1-calle@scylladb.com>
2018-01-03 12:33:04 +02:00
Amnon Heiman
3ec84a0b1d API tokens_endpoint: use streams
Returning token_endpoints when there are many tokens and end points can
take a long time.

This patch uses output stream to return the result.

Instead of returning a vector, it uses the streaming functionality in
json layer.

Fixes #2476

Message-Id: <20180103081907.5175-1-amnon@scylladb.com>
2018-01-03 11:11:49 +02:00
Glauber Costa
bb29d082d2 io shares: multiply all shares by 10
Technically all that matters is the proportion among the shares so this
change is functionally a noop. However, The CPU scheduler being proposed
has shares that go all the way up to 1000. In the hopes of being able to
unify I/O and CPU controllers one day, this patch brings the I/O shares
more in line with what Avi is doing for the CPU scheduler.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
074a13ecf1 compaction_strategy: implement backlog manager for the SizeTiered strategy
The SizeTiered backlog for a single SSTable is defined as:

   Bi = Ei * log4(T / Si)

Where:

  - Si is the size of this individual SSTable
  - T is the sum of sizes for all individual SSTables
  - Ei is the effective bytes in this SSTable.

The Effective size of an SSTable is:
 - The uncompacted size for an SSTable under compaction
 - The partially written size for an SSTable being written
 - The SSTable size for an SSTable that is not undergoing
   any of those processes.

The Aggregate Backlog for the entire Table is just the sum of
all individual SSTable backlogs, including the SSTables currently
being written.

Care is taken to avoid iterating over all SSTables, by separating
the aggregate backlog into a static component (sstables not changing) and
a component of SSTables that are undergoing change.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
ca284174d0 infrastructure for backlog estimator for compaction work.
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.

What needs to be done can be explained in three major pieces:

1) Add hooks in the points where sstables are added or inserted to a
   column family (or more precisely, to a compaction_strategy object).

2) Add hooks in reads and write monitors that allows a compaction
   backlog estimator (tracker) to become aware of bytes that are
   partially written and compacted away.

3) Add a per-column family class (compaction_backlog_tracker) that
   can be used to track work that is done and relevant to compactions
   (like the two above), and a compaction manager to provide a
   system-wide backlog based on the response of the individual trackers.

The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.

Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.

All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
86d7c160fd sstables: notify about end of data component write
We need to notify the monitor that the offset tracker that we are using is
about to be destroyed and will no longer be valid.

While we could modify the file_writer interface so that we could capture
the offset_tracker and take ownership of it - guaranteeing it is alive
until we reach the existing on_write_completed(), this feels like a
layer violation.

It is also potentially useful in general to offer the monitor callers
with knowledge that writing the data portion is done.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
3bd6bceaf0 sstables: add read_monitor_generator
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.

The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.

Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
9702a0935b sstables: add read_monitor
Similar to the write_monitor, it will track progress of an sstable
being read. In the current interface, we will notify interested users
about what is the current position in the data file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
f0391bf9a0 sstables: enhance data consumer with a position tracker
Callers, like compactions, will be able to know at any time the current
progress of a read.

As we do that, the currently unimplemented position() method of
data_consume_context becomes redundant and is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
110b8531f4 sstables: enhance the file_writer with an offset tracker
Callers, like the memtable flusher or compactions will be able to find
out the current amount of bytes written at any time.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Glauber Costa
00df0a5ad3 sstables: pass references instead of pointers for write_monitor
This came from Avi's review on the read_monitors. He suggests we
wouldn't keep shared pointers, and would instead have the caller
ensuring lifetime. That makes sense, but having the writer interface
using shared_ptr and the read interface using references would lead to
an inconsistent interface.

For the sake of consistency we will change the write monitor to take
references before we do that. From database.cc's perspective, we could
now keep the monitors in a do_with() block, but we will keep the
shared_ptrs to manage their lifetime in anticipation of upcoming patches
in this series, where we'll have to pass them somewhere else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Glauber Costa
d4109ebb80 compaction: control destruction of readers
Compactions run from a seastar::thread, in run(). They will either fail
or succeed, and from the point of view of ordering of destruction
between the compaction object and its readers:

- if compaction succeed, we have no control over who gets destructed
  first since both objects will be going out of scope.
- if they fail, we will forceably destruct the compaction object, at
  which point the readers are still alive

From the point of view of lifetime management, it would be nice to make
sure that the compaction object outlives whichever other objects it
needs during compaction.

This nice to have will become paramount when we start adding
read_monitors to the compaction object, that have to, themselves outlive
the readers.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:06 -05:00
Avi Kivity
8795238869 Merge "Fix handling of range tombstones starting at same position" from Tomasz
"When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

There is only one consumer which needs the tombstones with monotonic positions, and
that is the sstable writer.

Fixes #3093."

* tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev:
  tests: row_cache: Introduce test for concurrent read, population and eviction
  tests: sstables: Add test for writing combined stream with range tombstones at same position
  tests: memtable: Test that combined mutation source is a mutation source
  tests: memtable: Test that memtable with many versions is a mutation source
  tests: mutation_source: Add test for stream invariants with overlapping tombstones
  tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones
  tests: mutation_reader: Test combined reader slicing on random mutations
  tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys()
  mutation_fragment: Introduce range()
  clustering_interval_set: Introduce overlaps()
  clustering_interval_set: Extract private make_interval()
  mutation_reader: Allow range tombstones with same position in the fragment stream
  sstables: Handle consecutive range_tombstone fragments with same position
  tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
  streamed_mutation: Introduce peek()
  mutation_fragment: Extract mergeable_with()
  mutation_reader: Move definition of combining mutation reader to source file
  mutation_reader: Use make_combined_reader() to create combined reader
2018-01-02 18:32:09 +02:00
Raphael S. Carvalho
2a7eaa4933 tests:perf: add compaction mode to perf_sstable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171209175759.7769-1-raphaelsc@scylladb.com>
2018-01-02 10:16:13 +01:00
Duarte Nunes
39c1987ad7 CMakeLists: Require C++17
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180101165631.2182-1-duarte@scylladb.com>
2018-01-01 19:01:24 +02:00
Avi Kivity
73814db0f1 Merge "auth: Replace delayed_tasks with sleep_abortable" from Duarte
"delayed_tasks has a bug that if the object is destroyed while a timer
callback is queued, the callback will then try to access freed memory.
This series replaces the whole thing with sleep_abortable()."

* 'auth-delayed-tasks/v2' of https://github.com/duarten/scylla:
  auth: Replace delayed_tasks with sleep_abortable
  utils/exponential_backoff_retry: Add helper to automate retries
  utils/exponential_backoff_retry: Add abort_source-based retry
2018-01-01 13:44:01 +02:00
Raphael S. Carvalho
3dcf00ec67 sstables: feed new sstable with its owner shard
Missed opportunity to feed shard id to sstable being written when
working on 67c5c8dc67, so when sstable is reopened after sealed,
its shard doesn't need to be recomputed by open procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171231024529.13664-1-raphaelsc@scylladb.com>
2018-01-01 10:17:07 +02:00
Avi Kivity
d7a91f5b84 build: require C++17 unconditionally
We can now use C++17 in Scylla.
Message-Id: <20171228112934.28659-1-avi@scylladb.com>
2017-12-28 16:44:59 +00:00
Duarte Nunes
81b1455b22 auth: Replace delayed_tasks with sleep_abortable
delayed_tasks has a bug that if the object is destroyed while a timer
callback is queued, the callback will then try to access freed memory.
This could be fixed by providing a stop() function that waits for
pending callbacks, but we can just replace the whole thing by levering
the abort_source-enabled exponential_backoff_retry.
2017-12-28 13:00:28 +00:00
Duarte Nunes
40ad65666f utils/exponential_backoff_retry: Add helper to automate retries
This patch adds the do_until_value static member function to
exponential_backoff_retry, which retries the specified function until
it returns an engaged optional.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 13:00:28 +00:00
Duarte Nunes
9a602c7796 utils/exponential_backoff_retry: Add abort_source-based retry
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 13:00:28 +00:00
Duarte Nunes
89b353cd95 Delete unused nway_merger.hh
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1514463536-7732-1-git-send-email-duarte@scylladb.com>
2017-12-28 14:21:40 +02:00
Raphael S. Carvalho
c76356fb39 sstables: make shard computation resilient to empty sharding metadata
Scylla metadata could be empty due to bugs like the one introduced by
115ff10. Let's make shard computation resilient to empty sharding
metadata by falling back to the approach that uses first and last
keys to compute shards.

Refs #2932.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-2-raphaelsc@scylladb.com>
2017-12-28 14:07:06 +02:00
Raphael S. Carvalho
fa5a26f12d sstables: fail sstable write if unable to generate sharding metadata
SSTable can generate an empty sharding metadata after a bug like
the one introduced here 115ff10, that results in tokens being
generated using base table for the view table. That leads to
sstable being deleted in subsequent boot because all shards will
agree on its deletion given that it will not belong to anybody,
and also compaction to crash because this relies on resulting
sstable belonging to one shard at least.

I wouldn't like to spend days debugging it again because sstable
write silently generated empty sharding metadata, so let's make
write fail when it happens (see issue #2932 for details).

Refs #2932.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171223120140.3642-1-raphaelsc@scylladb.com>
2017-12-28 14:07:05 +02:00
Duarte Nunes
2618209c2d Remove obsolete includes and fix build
move.hh was deleted, but files weren't updated to reflect that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 12:03:44 +00:00
Avi Kivity
fc03ba1c08 streamed_mutation: remove non-missing include
"move.hh" should have been missing, but wasn't.
2017-12-28 14:00:34 +02:00
Duarte Nunes
1374f898b9 Merge seastar upstream
Class optimized_optional was moved into seastar, and its usage
simplified so move_and_disengage() is replaced in favour of
std::exchange(_, { }).

* seastar adaca37...b0f5591 (9):
  > Merge "core: Introduce cancellation mechanism" from Duarte
  > Fix Seastar build that no longer builds with --enable-dpdk after the recent commit fd87ea2
  > noncopyable_function: support function objects whose move constructors throw
  > Adding new hardware options to new config format, using new config format for dpdk device
  > Fix check for Boost version during pre-build configuration.
  > variant_utils: add variant_visitor constructor for C++17 mode
  > Merge "Allows json object to be stream to an" from Amnon
  > Merge 'Default to C++17' from Avi
  > Add const version of subscript operator to circular_buffer

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171228112126.18142-1-duarte@scylladb.com>
2017-12-28 13:24:18 +02:00
Nadav Har'El
58f2b6c285 Drop "VIEWS" as unimplemented reason
After materialized views has been implemented (although not enabled by
default), unimplemented::cause::VIEWS is no longer used. I think we can
drop it.

By the way, there are other no longer used unimplemented reasons, we
should probably drop them too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171224131318.4893-1-nyh@scylladb.com>
2017-12-27 15:08:41 +02:00
Amos Kong
68a3d1e9b2 auth: delete auth/authorizer.cc
This file wasn't used after commit ba6a41d397
Jesses wanted to delete this file, but it's lost.

Signed-off-by: Amos Kong <amos@scylladb.com>
Cc: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <9af5aee2b8d492b865b9b15c9fb16941880600d8.1514305358.git.amos@scylladb.com>
2017-12-26 18:29:38 +02:00
Takuya ASADA
51013f561d dist/debian: rename boost1.63 to scylla-boost163 on Debian 8
We provided "boost1.63" package for Debian 8 since we couldn't build
"scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the
problem and now we have it for Debian 8 too, so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com>
2017-12-25 18:51:36 +02:00
Piotr Jastrzebski
0430968426 Merge flat_mutation_reader_mutation_source into mutation_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 22:32:38 +01:00
Piotr Jastrzebski
3817519844 Remove unused mutation_reader_mutation_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:42:50 +01:00
Piotr Jastrzebski
e0e2fcc013 Remove unused mutation_source constructor.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:27:43 +01:00
Piotr Jastrzebski
66f603fc0a Migrate make_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:27:30 +01:00
Piotr Jastrzebski
d39f8cfb37 Migrate run_conversion_to_mutation_reader_tests to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:26:44 +01:00
Piotr Jastrzebski
ab8918c9c3 flat_mutation_reader_from_mutations: add support for slicing
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 21:25:37 +01:00
Piotr Jastrzebski
093d6f06f0 Remove unused mutation_source constructor.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 16:10:41 +01:00
Piotr Jastrzebski
da39ee5ba0 Migrate partition_counting_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 16:10:29 +01:00
Piotr Jastrzebski
0b34906da3 Migrate throttled_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 16:06:23 +01:00
Piotr Jastrzebski
fa938aafdd Extract delegating_reader from make_delegating_reader
and make it a template to enable using it both with reference_wrapper
and flat_mutation_reader directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-22 15:36:53 +01:00
Tomasz Grabiec
37ddc8bcfd tests: row_cache: Introduce test for concurrent read, population and eviction 2017-12-22 11:58:17 +01:00
Tomasz Grabiec
42ec01661c tests: sstables: Add test for writing combined stream with range tombstones at same position 2017-12-22 11:06:34 +01:00
Tomasz Grabiec
cb34420e1c tests: memtable: Test that combined mutation source is a mutation source 2017-12-22 11:06:34 +01:00
Tomasz Grabiec
7ce02bc22e tests: memtable: Test that memtable with many versions is a mutation source 2017-12-22 11:06:34 +01:00
Tomasz Grabiec
9cd35f4b90 tests: mutation_source: Add test for stream invariants with overlapping tombstones 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
7ce52df88b tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
ca6de9e78c tests: mutation_reader: Test combined reader slicing on random mutations 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
73a79372a4 tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
2be3cbbb81 mutation_fragment: Introduce range() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
0c0d52a933 clustering_interval_set: Introduce overlaps() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
c1d96bda88 clustering_interval_set: Extract private make_interval() 2017-12-22 11:06:33 +01:00
Tomasz Grabiec
41ede08a1d mutation_reader: Allow range tombstones with same position in the fragment stream
When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

Fixes #3093.
2017-12-22 11:06:20 +01:00
Tomasz Grabiec
f9038d5d78 sstables: Handle consecutive range_tombstone fragments with same position
In preparation for allowing fragment streams to produce range_tombstones
with the same position.
2017-12-22 11:04:02 +01:00
Tomasz Grabiec
92b89d576d tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
In preparation for allowing fragment stream to produce consecutive
range tombstones with the same position.
2017-12-21 22:45:35 +01:00
Tomasz Grabiec
815cd254e2 streamed_mutation: Introduce peek()
Will be used in assertions to merge consecutive range tombstones.
2017-12-21 22:45:35 +01:00
Piotr Jastrzebski
963b128a87 row_cache_test: call row_cache::make_flat_reader in mutation_sources
instead of calling row_cache::make_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 22:22:11 +01:00
Piotr Jastrzebski
fd1b27c89d Remove unused friend declaration in flat_mutation_reader::impl
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:41:15 +01:00
Tomasz Grabiec
c5f82aa5bd mutation_fragment: Extract mergeable_with() 2017-12-21 21:24:11 +01:00
Tomasz Grabiec
60ed5d29c0 mutation_reader: Move definition of combining mutation reader to source file
So that the whole world doesn't recompile when it changes.
2017-12-21 21:24:11 +01:00
Tomasz Grabiec
52285a9e73 mutation_reader: Use make_combined_reader() to create combined reader
So that we can hide the definition of combined_mutation_reader. It's
also less verbose.
2017-12-21 21:24:11 +01:00
Piotr Jastrzebski
a02434120a Migrate make_source_with to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:18:35 +01:00
Piotr Jastrzebski
2c1f0250c2 Migrate make_empty_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:17:46 +01:00
Piotr Jastrzebski
b5ad96c9ca Remove unused mutation_source constructor
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:01:35 +01:00
Piotr Jastrzebski
5eb702a405 Migrate test_multi_range_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 20:48:51 +01:00
Piotr Jastrzebski
b583ef7c8b Remove unused mutation_source constructors
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 20:34:20 +01:00
Paweł Dziepak
dfb6296d08 Merge "Migrate all clients of make_combined_reader to flat reader" from Piotr
"Remove old overloads that use mutation_reader."

* 'haaawk/combined_reader_clients_migration_v1_after_comments_2' of github.com:scylladb/seastar-dev:
  Remove unused make_combined_reader overload.
  Migrate test_fast_forwarding_combining_reader to flat reader
  flat_mutation_reader_from_mutations: support partition_range
  Don't pass fwd to flat_mutation_reader_from_mutations if it's no
  Remove unused make_combined_reader overload.
  Migrate test_combining_two_partially_overlapping_readers to flat reader
  Migrate test_combining_two_non_overlapping_readers to flat reader
  Migrate combined_mutation_reader_test to flat reader
  Migrate test_sm_fast_forwarding_combining_reader to flat reader
  Migrate test_combining_one_empty_reader to flat reader
  Migrate test_combining_two_empty_readers to flat reader
  Migrate test_combining_two_readers_with_one_reader_empty to flat reader
  Migrate test_combining_one_reader_with_many_partitions to flat reader
  Migrate test_combining_two_readers_with_the_same_row to flat reader
  Migrate make_combined_mutation_source to flat reader
  mutation_source: Add constructors for sources that ignore forwarding
2017-12-21 16:04:49 +00:00
Piotr Jastrzebski
04ce7dfb84 Remove unused make_combined_reader overload.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
759baa3a11 Migrate test_fast_forwarding_combining_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
83e55283f7 flat_mutation_reader_from_mutations: support partition_range
This is needed to make it possible for
flat_mutation_reader_from_mutations to replace
make_reader_returning_many.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
9e3da50ed1 Don't pass fwd to flat_mutation_reader_from_mutations if it's no
Default value for fwd is no so there's no need to pass it explicitly.
This is important because we will add additional parameter to
flat_mutation_reader_from_mutations in next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
b3b6db4f50 Remove unused make_combined_reader overload.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
202c562f68 Migrate test_combining_two_partially_overlapping_readers to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
6c62454076 Migrate test_combining_two_non_overlapping_readers to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
bef2cf8ed9 Migrate combined_mutation_reader_test to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
19d4bce624 Migrate test_sm_fast_forwarding_combining_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
17e6f6b089 Migrate test_combining_one_empty_reader to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
1f77370d9e Migrate test_combining_two_empty_readers to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
a702d0ec3f Migrate test_combining_two_readers_with_one_reader_empty to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
9a5d6bd8af Migrate test_combining_one_reader_with_many_partitions to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
13551e6f50 Migrate test_combining_two_readers_with_the_same_row to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:43 +01:00
Piotr Jastrzebski
b1c1709127 Migrate make_combined_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 17:00:42 +01:00
Piotr Jastrzebski
024e01ad9e mutation_source: Add constructors for sources that ignore forwarding
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 16:59:57 +01:00
Paweł Dziepak
4dfddc97c7 db/schema_tables: do not use moved from shared pointer
Shared pointer view is captured by two continuations, one of which is
moving it away. Using do_with() solves the problem.

Fixes #3092.
Message-Id: <20171221111614.16208-1-pdziepak@scylladb.com>
2017-12-21 15:13:25 +01:00
Tomasz Grabiec
b0a56a91c2 Merge "Remove memtable::make_reader" from Piotr
Migrate all the places that used memtable::make_reader to use
memtable::make_flat_reader and remove memtable::make_reader.

* seastar-dev.git haaawk/remove_memtable_make_reader_v2_rebased:
  Remove memtable::make_reader
  Stop using memtable::make_reader in row_cache_stress_test
  Stop using memtable::make_reader in row_cache_test
  Stop using memtable::make_reader in mutation_test
  Stop using memtable::make_reader in streamed_mutation_test
  Stop using memtable::make_reader in memtable_snapshot_source.hh
  Stop using memtable::make_reader in memtable::apply
  Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
  Add default parameter values in make_combined_reader
  Migrate test_virtual_dirty_accounting_on_flush to flat reader
  Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
  Simplify flat_reader_assertions& produces(const mutation& m)
  Migrate test_partition_version_consistency_after_lsa_compaction_happens
  flat_mutation_reader: Allow setting buffer capacity
  Add next_mutation() to flat_mutation_reader_assertions
  cf::for_all_partitions::iteration_state: don't store schema_ptr
  read_mutation_from_flat_mutation_reader: don't take schema_ptr
  Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
2017-12-21 14:02:56 +01:00
Piotr Jastrzebski
17f2eb8ff7 Remove memtable::make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
85d2b24415 Stop using memtable::make_reader in row_cache_stress_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
129a282cbf Stop using memtable::make_reader in row_cache_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
dc75df6353 Stop using memtable::make_reader in mutation_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
660086f2d6 Stop using memtable::make_reader in streamed_mutation_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
2a9cd5bffe Stop using memtable::make_reader in memtable_snapshot_source.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
6bcee5976b Stop using memtable::make_reader in memtable::apply
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
a67d6bef29 Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
ff718d6573 Add default parameter values in make_combined_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b1676db658 Migrate test_virtual_dirty_accounting_on_flush to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b90677272f Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
to flat reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
20e31e462e Simplify flat_reader_assertions& produces(const mutation& m)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
ddecd385c1 Migrate test_partition_version_consistency_after_lsa_compaction_happens
to flat reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
5f8fba8a61 flat_mutation_reader: Allow setting buffer capacity
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b18c075470 Add next_mutation() to flat_mutation_reader_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
308ec43ea5 cf::for_all_partitions::iteration_state: don't store schema_ptr
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
570703a169 read_mutation_from_flat_mutation_reader: don't take schema_ptr
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
681dc26dd1 Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Tomasz Grabiec
71cc63dfa6 Merge "Fixes for multi_range_reader" from Paweł
The following patches contain fixes for skipping to the next parititon
in multi_range_reader and completelty dissable support for fast
forwarding inside a single partition, which is not needed and would only
add unnecessary complexity.

* https://github.com/pdziepak/scylla.git fix-multi_range_reader/v1:
  flat_multi_range_mutation_reader: disallow
    streamed_mutation::forwarding
  flat_multi_range_mutation_reader: clear buffer on next_partition()
  tests/flat_multi_range_mutation_reader: test skipping to next
    partition
2017-12-21 11:06:57 +01:00
George Tavares
ceecd542cd db/view: Consume updated rows regardless of static row
Using Materialized Views, if the base table has static columns,
and the update in base table mutates static and non static rows,
the streamed_mutation is stopped before process non static row.
The patch avoids stopping the stream_mutation and adds a test case.

Message-Id: <20171220173434.25091-1-tavares.george@gmail.com>
2017-12-21 00:49:15 +01:00
Paweł Dziepak
da0655ab3c tests/flat_multi_range_mutation_reader: test skipping to next partition 2017-12-20 16:08:09 +00:00
Paweł Dziepak
5d72acac0c flat_multi_range_mutation_reader: clear buffer on next_partition() 2017-12-20 16:08:09 +00:00
Paweł Dziepak
3cf46a31a6 flat_multi_range_mutation_reader: disallow streamed_mutation::forwarding
Properly implementing streamed_mutation::forwarding::yes in multi range
reader would noticeably increase its complexity and is not needed.
2017-12-20 14:50:11 +00:00
Tomasz Grabiec
dfe48bbbc7 range_tombstone_list: Fix insert_from()
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.

Fixes #3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>
2017-12-20 12:20:20 +00:00
Raphael S. Carvalho
daaadfd515 compaction_manager: remove dead sstable rewrite submission function
this rewrite submission was used by old resharding, but it's no longer
needed, so let's remove it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171219191052.13689-1-raphaelsc@scylladb.com>
2017-12-20 09:29:43 +02:00
Avi Kivity
772d1f47d7 Merge "Fix read amplification in sstable reads" from Paweł
"4b9a34a85425d1279b471b2ff0b0f2462328929c "Merge sstable_data_source
into sstable_mutation_reader" has introduced unintentional changes, some
of them causing excessive read amplification during empty range reads.
The following patches restore the previous behaviour."

* tag 'fix-read-amplification/v1' of https://github.com/pdziepak/scylla:
  sstables: set _read_enabled to false if possible
  sstables: set _single_partition_read for single parititon reads
2017-12-19 18:17:14 +02:00
Tomasz Grabiec
6a6bf58b98 flat_mutation_reader: Fix make_nonforwardable()
It emitted end-of-stream prematurely if buffer was full.
Message-Id: <1513697716-32634-1-git-send-email-tgrabiec@scylladb.com>
2017-12-19 15:56:49 +00:00
Avi Kivity
2137d753b3 Merge "Serialize compaction of same size tier for different cfs" from Raphael
"Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written."

Fixes #1295.

* 'similar_sized_compaction_serialization_v4' of github.com:raphaelsc/scylla:
  sstables: remove column_family from compaction_weight_registration
  compaction_manager: serialize compaction of same size tier for different cfs
  sstables: introduces deregister() and weight() to compaction_weight_registration
  sstables: move compaction_weight_registration to its own header
  sstables: improve compact_sstables() interface
2017-12-19 16:32:27 +02:00
Tomasz Grabiec
7b36c8423c row_cache: Fix single_partition_populating_reader not waiting on create_underlying() to resolve
Results in undefined behavior.
Message-Id: <1513691679-27081-1-git-send-email-tgrabiec@scylladb.com>
2017-12-19 16:12:11 +02:00
Paweł Dziepak
574c6006f6 sstables: set _read_enabled to false if possible 2017-12-19 13:59:13 +00:00
Paweł Dziepak
1beb3552fc sstables: set _single_partition_read for single parititon reads 2017-12-19 13:59:13 +00:00
Piotr Jastrzebski
570fc5afed Use row_cache::make_flat_reader in column_family::make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ba1659ceed8676f45942ce6e7506158026947345.1513687259.git.piotr@scylladb.com>
2017-12-19 14:42:32 +02:00
Raphael S. Carvalho
928beae242 Fix compilation of db/hints/manager.cc and row_cache.cc
compiler: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Problems introduced in f6a461c7a4
and 37b19ae6ba, respectively.

They both fail to compile due to use of method in lambda without
explicit mention of this. Some of failure is fixed by not using
auto in lambda parameter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171218222144.12297-1-raphaelsc@scylladb.com>
2017-12-19 11:15:45 +01:00
Avi Kivity
d97ea6b0f4 Merge seastar upstream
* seastar 2b23547...adaca37 (7):
  > Merge "Support for skipping over bytes from input stream in input_stream::consume" from Vladimir
  > build: enforce Boost >= 1.58 during configuration.
  > Tutorial: beginning of documentation of CPU scheduling et al.
  > circular_buffer: make move-constructor noexcept
  > circular_buffer: convert existing documentation to doxygen format
  > build: fix detection of membarrier syscall support
  > Merge "Improve systemwide_memory_barrier() on newer Linuces" from Avi
2017-12-19 11:21:35 +02:00
Avi Kivity
8dbc6bbcdc Update scylla-ami submodule
* dist/ami/files/scylla-ami be90a3f...3366c93 (1):
  > scylla_install_ami: skip ec2_check while building AMI
2017-12-19 10:10:22 +02:00
Takuya ASADA
77fbdd487c dist/ami: Switch to official CentOS base image
We had switched our own CentOS base image since we couldn't make built AMI to
public due to base image settings, it's probably because the image provided via
AWS market place.
However, I've found an official image outside of market place, and I succeeded
making built AMI to public based on the image.
URL: https://wiki.centos.org/Cloud/AWS

Once we could able to use official image, we probably should use official one.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513659164-28029-1-git-send-email-syuu@scylladb.com>
2017-12-19 10:07:14 +02:00
Tomasz Grabiec
37b19ae6ba Merge "Migrate cache to use flat_mutation_reader" from Piotr 2017-12-18 17:53:20 +01:00
Piotr Jastrzebski
d756c49baf Rename cache_streamed_mutation_test to cache_flat_mutation_reader_test
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
14d98aaa0b Rename row_cache::create_underlying_flat_reader to
create_underlying_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
49993e56a9 Remove unused row_cache::create_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
b976872c1a Rename all *_underlying_flat methods in read_context to *_underlying.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
1457a3d771 Rename cache_entry::*read_flat to cache_entry::*read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
8b796a884f Rename read_context::enter_flat_partition to enter_partition
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
8d37b71843 Rename autoupdating_underlying_flat_reader to autoupdating_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
9789c37e9d Remove autoupdating_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
893e434207 Stop using autoupdating_underlying_reader in read_context
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
6e9b54cc77 Remove unused cache_streamed_mutation
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
df17bad13b Remove unused cache_entry::read and do_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
003670c3cd Remove unused read_directly_from_underlying overload
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
9fab29be82 Rename _sm to _reader in scanning_and_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
610fa7a2c2 Stop using streamed_mutation in scanning_and_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
3153d5d2c2 Rename _sm to _reader in single_partition_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
556edfab29 Stop using streamed mutation in single_partition_populating_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
fec4468669 Add read_directly_from_underlying that returns flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
7012dc1049 Add make_delegating_reader
It creates a flat_mutation_reader from a reference to another reader.

This makes it easier to compose code in more elegant way.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
4088dcba5a Add make_nonforwardable for flat_mutation_reader.
It turns a reader that allows fast forwarding into
a reader that does not allow it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:53 +01:00
Piotr Jastrzebski
47eb609aeb Change fill_buffer_from_streamed_mutation to fill_buffer_from
that can handle both streamed_mutation and flat_mutation_reader
as source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:24:16 +01:00
Nadav Har'El
ba3cb057f5 Fix compilation of tests/hint_test.cc
Starting with commit fb0866ca20, tests
do not have to, and MUST NOT, define the disk error handlers. If they
do, we get a re-definition of variables already defined in
disk-error-handler.cc.

tests/hint_test.cc was apparently written before that commit, so we
need to remove the duplicate variables to get it to link.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218133635.20500-1-nyh@scylladb.com>
2017-12-18 15:37:19 +02:00
Nadav Har'El
101cce3c79 Fix compilation of tests/commitlog_test.cc
In commit 878d58d23a, a new parameter was
added to commitlog::descriptor. The commit message says that "It's default
value is a descriptor::FILENAME_PREFIX." while in reality, it did not have
a default value and compilation of tests/commitlog_test.cc broke, because
it didn't specify a value.

So this patch adds a default value for this parameter, as was suggested
by the original commit message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218131020.17883-1-nyh@scylladb.com>
2017-12-18 15:35:34 +02:00
Nadav Har'El
73aad5736f Fix compilation of tests/cql_test_env.cc
In commit 1f4f71e619, an
stdx::optional<std::vector<sstring>> parameter was added to storage_proxy's
constructor. However, this parameter was not made optional, and
tests/cql_test_env.cc failed to compile because it didn't provide this
parameter.

This patch makes this parameter optional (if missing, it's like an empty
stdx::optional) so cql_test_env.cc compiles.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20171218132121.18782-1-nyh@scylladb.com>
2017-12-18 15:32:54 +02:00
Piotr Jastrzebski
527b48564d Fix fast_forward_to in make_forwardable
It wasn't setting _end_of_stream to false which
is necessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
880623e2e9 Use cache_entry::read_flat in make_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
a9b6551584 Add cache_entry::read_flat
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
8a275dfaeb Create transform for flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
a322268416 Turn cache_flat_mutation_reader into a flat reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
f4e048f6ff Add consume_mutation_fragments_until to flat_mutation_reader.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
82c603069b Make cache_flat_mutation_reader a friend of row_cache and cache_tracker
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
f467e84424 Rename cache_streamed_mutation to cache_flat_mutation_reader
in cache_flat_mutation_reader.hh

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
3075780097 Make copy of cache_streamed_mutation.hh
and call it cache_flat_mutation_reader.hh.
It will be turned into a flat mutation reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
072fc2a309 Move lsa_manager to row_cache.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
d525a306a0 Add reserve_one to flat_mutation_reader::impl
This will be used in cache.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
714868db2d Use autoupdating_underlying_flat_reader in read_context
and add read_context::enter_flat_partition. This will
temporarily coexist with read_context::enter_partition
but after everything in cache is migrated to flat reader
the new method will replace old one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
3e980cac3d Make autoupdating_underlying_flat_reader use flat reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
77b6f7c599 read_context: create a copy of autoupdating_underlying_reader
called autoupdating_underlying_flat_reader. It will be modified
in the next patch to use flat reader to underlying.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
bf4e1c0c54 Add row_cache::create_underlying_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
16a0d306fd Turn scanning_and_populating_reader into flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
656e8622e1 Turn single_partition_populating_reader into flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
1a7011b6b5 Extract fill_buffer_from_streamed_mutation
it will be reused in other readers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:26:44 +01:00
Raphael S. Carvalho
38318c753a sstables: remove column_family from compaction_weight_registration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:42:52 -02:00
Raphael S. Carvalho
eff62bc61e compaction_manager: serialize compaction of same size tier for different cfs
Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written.

That being said, compaction jobs of same size tier are now serialized
on a given shard, such that maximum number of compaction (system wise)
is now:
(SHARDS) * (SIZE TIERS)
instead of:
(SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS)

We'll work hard to release a size tier (weight) for a column family
waiting on it as fast as possible, given that we wouldn't like to
underutilize resources available for compaction. We want one starting
after the other. Compaction for a column family that cannot run now
because the size tier is taken, will be postponed. There's a worker
that will be sleeping on a condition variable that will be signalled
whenever a compaction completes. FIFO ordering is used on postponed
list for fairness.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:42:48 -02:00
Raphael S. Carvalho
fa0e53f626 sstables: introduces deregister() and weight() to compaction_weight_registration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:34:08 -02:00
Raphael S. Carvalho
20d8a2c045 sstables: move compaction_weight_registration to its own header
That will be needed for using it in compaction.hh. We can't declare
compaction_weight_registration in compaction_manager.hh, because
compaction.hh can't include the former due to cyclic dependency,
so compaction_weight_registration will be declared in its own
header.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:26:51 -02:00
Raphael S. Carvalho
49f3cfe746 sstables: improve compact_sstables() interface
Motivation is that a new field in the descriptor will be forwarded
to compaction procedure without extending parameter list even more.
Also beautifies the interface, making it concise and easier to
play with.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-17 17:22:19 -02:00
Michael Munday
0a67a505a5 sstables: write summary in little-endian byte order on big-endian systems
The summary positions are defined to be in 'native' byte order.
Unfortunately this makes sharing files between big- and little-endian
machines much more difficult. For example, test files need to be
generated for both potential byte orders.

This change sets the byte order of the affected data to little-endian.
Ideally there would still be a way to deal with files generated on
big-endian systems using the 'native' byte order (see #3056).

Message-Id: <20171212183652.87881-1-mike.munday@ibm.com>
2017-12-17 11:10:49 +02:00
Glauber Costa
b8f49fcc14 conf: document listen_on_broadcast_address
That's a supported feature that is listed in our help message, but it
is not present in the yaml file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171215011240.16027-1-glauber@scylladb.com>
2017-12-17 10:55:09 +02:00
Vlad Zolotarov
be6f8be9cb messaging_service: fix a mutli-NIC support
Don't enforce the outgoing connections from the 'listen_address'
interface only.

If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.

Fixes #3066

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
2017-12-17 10:51:20 +02:00
Avi Kivity
11de20fc33 Merge "SSTable summary regeneration fixes" from Raphael
"Fixes #3057."

* 'summary_recreation_fixes_v2' of github.com:raphaelsc/scylla:
  tests: sstable summary recreation sanity test
  sstables: make loading of sstable without summary to work again
  sstables: fix summary generation with dynamic index sampling
2017-12-17 09:17:36 +02:00
Takuya ASADA
c2e87f4677 dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
2017-12-16 22:03:25 +02:00
Duarte Nunes
f6a461c7a4 Merge 'hinted handoff' from Vlad
"This series is the first part of hinted handoff implementation.
It includes:
   - Minor adjustment of commitlog layer.
   - Generation of hints when storage_proxy calls hint_to_dead_endpoints(...).
   - Sending the hints to the Node that becomes UP.

It doesn't include:
   - Node decommissioning.
   - Resharding."

* 'hinted_handoff-v7-1' of github.com:vladzcloudius/scylla:
  main + storage_service: wire up hints generation
  config: add hints related options
  db::hints::manager: initial commit
  tracing: make the session state modifying methods and tracing::trace(...) noexcept
  utils::fb_utilities: add is_me(addr) method
  tests: hint_test: initial commit
  db::commitlog::replay_position: added std::hash<replay_position>
  db::commitlog: truncate segments to their actual sizes during shutdown
  db::commitlog: allow defining a metrics category name
  db/commitlog/commitlog::descriptor: add a filename_prefix parameter
  db::commitlog::descriptor::descriptor(filename): pass a filename as a const ref
  docs: hinted_handoff_design.md: high level design of a Hinted Handoff feature
2017-12-14 21:16:40 +01:00
Vlad Zolotarov
1f4f71e619 main + storage_service: wire up hints generation
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:11 -05:00
Vlad Zolotarov
c2296c9575 config: add hints related options
- hints_directory:
      - This option allows defining of the directory where hints files are going
        to be stored if hinted handoff is enabled.

   - hinted_handoff_enabled:
      - May receive either a boolean value or a list of DCs. In the later case this
        will define the DCs to which Nodes hints are going to be generated.

   - max_hint_window_in_ms:
      - Maximum amount of milliseconds the hints are going to be generated to the Node that is DOWN.
        After this time period the hints are no longer going to be generated until the Node is seen UP.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:11 -05:00
Vlad Zolotarov
51bbf18c08 db::hints::manager: initial commit
Curently implemented:
   - Hints generation: db::hints::manager::store_hint(...).
   - Sending: db::hints::manager::on_timer().

TODO:
   - Resharding.
   - Node decommission.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:07 -05:00
Vlad Zolotarov
fcff872089 tracing: make the session state modifying methods and tracing::trace(...) noexcept
Make state session creation, stop_forground() and tracing::trace(...) methods
noexcept.
Most of them have already been implemented the way that they won't throw
but this patch makes it official...

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
6c037899b5 utils::fb_utilities: add is_me(addr) method
Add a widely used method that returns TRUE if a given address is a broadcast
address of the local node.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
b20dbe16d8 tests: hint_test: initial commit
Test the regular commitlog with the custom file name prefix.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
ec15d60a2d db::commitlog::replay_position: added std::hash<replay_position>
It's needed for hinted handoff.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
af70c0a709 db::commitlog: truncate segments to their actual sizes during shutdown
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:48 -05:00
Vlad Zolotarov
033af6c950 db::commitlog: allow defining a metrics category name
Add a new field to db::commitlog::config that would define the metrics category name.
If not given - metrics are not going to be registered.
Set it to "commitlog" in db::commitlog::config(const db::config&).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
878d58d23a db/commitlog/commitlog::descriptor: add a filename_prefix parameter
This parameter is used when creating a new segment.
It's default value is a descriptor::FILENAME_PREFIX.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
719b1fb24f db::commitlog::descriptor::descriptor(filename): pass a filename as a const ref
Avoid not needed copy by passing a file name as a reference.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Vlad Zolotarov
1ddb6e6509 docs: hinted_handoff_design.md: high level design of a Hinted Handoff feature
Hinted Handoff is a feature that allows replaying failed writes.
The mutation and the destination replica are saved in a log and replayed
later according to the feature configuration.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:05:47 -05:00
Raphael S. Carvalho
b5ace682a4 tests: sstable summary recreation sanity test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-14 16:59:36 -02:00
Raphael S. Carvalho
cdfa4d5c0d sstables: make loading of sstable without summary to work again
Boot failed when loading sstable with missing summary because a
internal procedure failed to take into account that a sstable
can have its summary recreated from index. Make it work again
by making that procedure aware of that.

Fixes #3057.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-14 16:59:22 -02:00
Raphael S. Carvalho
7c6a19fcc8 sstables: fix summary generation with dynamic index sampling
When recreating summary, data length was passed as data offset to
procedure that decides whether to sample or not. The problem is
that the procedure decides to sample index entry if data offset
is beyond a threshold. So the resulting summary will contain
only N sequential indexes entries starting from the first one,
which makes it quite inefficient. What should be done instead
is to pass position of current index entry, so summary content
will be as if it was created by a regular sstable write.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-14 16:34:00 -02:00
Piotr Jastrzebski
ceaf0dee99 Introduce row_cache::make_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-14 12:49:39 +01:00
Piotr Jastrzebski
ac1d2f98e4 Fix build by removing semicolon after concept
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4504cf47be0a451c58052476bc8cc4f9cba59472.1513248094.git.piotr@scylladb.com>
2017-12-14 10:46:13 +00:00
Raphael S. Carvalho
95d1995876 fix compilation of stream_session.cc
stream_session.cc:417:62: error: cannot call member function ‘utils::UUID streaming::stream_session::plan_id()’ without object
         sslog.warn("[Stream #{}] Failed to send: {}", plan_id(), ep);

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171214022621.19442-1-raphaelsc@scylladb.com>
2017-12-14 10:57:33 +01:00
Amos Kong
b07de93636 Reset default cluster_name back to 'Test Cluster' for compatibility
There are some users used original default cluster_name 'Test Cluster',
they will fail to start the node for cluster_name change if they use
new scylla.yaml.

'ScyllaDB Cluster' isn't more beautiful than 'Test Cluster', reset back
to original old to avoid problem for users.

Fixes #3060

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8c9dab8a64d0f4ab3a5d6910b87af696c60e5076.1513072453.git.amos@scylladb.com>
2017-12-13 16:57:43 +02:00
Avi Kivity
6cb3b29168 Merge "Convert sstable readers to flat streams" from Paweł
"While aa8c2cbc16 'Merge "Migrate sstables
to flat_mutation_reader" from Piotr' has converted the low-level sstable
reader to the new flat_mutation_reader interface there were still
multiple readers related to sstables that required converting,
including:
 - restricted reader
 - filtering reader
 - single partition sstable reader
This series completes their conversion to the flat stream interface."

* tag 'flat_mutation_reader-sstable-readers/v2' of https://github.com/pdziepak/scylla:
  db: convert single_key_sstalbe_reader to flat streams
  db: fully convert incremental_reader_selector to flat readers
  db: make make_range_sstable_reader() return flat reader
  db: make column_family::make_reader() return flat reader
  db: make column_family::make_sstable_reader() return a flat reader
  filtering_reader: switch to flat mutation fragment streams
  filtering_reader: pass a const dht::decorated_key& to the callback
  mutation_reader: drop make_restricted_reader()
  db: use make_restricted_flat_reader
  mutation_reader: convert restricted reader to flat streams
2017-12-13 15:37:26 +02:00
Paweł Dziepak
8e0da776ab db: convert single_key_sstalbe_reader to flat streams
Before flat mutation readers sstable::read_row() returned a
future<streamed_mutation>. That required a helper reader that would wait
for the streamed_mutations from all relevant sstables to be created and
then construct a mutation merger.
With flat mutation readers sstable::read_row_flat() returns a
flat_mutation_reader (no futures) so that the code can be simplified by
collecting all the relevant readers and creating a combined reader
without suspension points.
The unfortunate disadvantage of the flat_mutation_reader-based approach
is the fact that combined reader now needlessly compares the partition
keys even though we know that we read only a single partition, but
optimising that is out of scope of this patch.
2017-12-13 12:01:03 +00:00
Paweł Dziepak
24026a0c7d db: fully convert incremental_reader_selector to flat readers 2017-12-13 12:01:03 +00:00
Paweł Dziepak
73b3d02cc0 db: make make_range_sstable_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
8b3c3fc832 db: make column_family::make_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
e12959616c db: make column_family::make_sstable_reader() return a flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
a0a13ceb46 filtering_reader: switch to flat mutation fragment streams 2017-12-13 12:01:03 +00:00
Paweł Dziepak
3bbb3b300d filtering_reader: pass a const dht::decorated_key& to the callback
All users of the filtering reader need only the decorated key of a
partition, but currently the predicate is given a reference to
streamed_mutations which are obsolete now.
2017-12-13 11:57:27 +00:00
Paweł Dziepak
d8dad04564 mutation_reader: drop make_restricted_reader()
make_restricted_reader() has been replaced by
make_restricted_flat_reader().
2017-12-13 11:57:22 +00:00
Paweł Dziepak
f3901eb154 db: use make_restricted_flat_reader 2017-12-13 10:46:41 +00:00
Paweł Dziepak
3839bc5d60 mutation_reader: convert restricted reader to flat streams 2017-12-13 10:46:41 +00:00
Asias He
a9dab60b6c streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
2017-12-13 12:32:41 +02:00
Glauber Costa
1aabbc75ab database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
2017-12-13 10:09:20 +02:00
Avi Kivity
73428c96bd Merge "Refined security model for roles" from Jesse
"This patch series refines the security model for the upcoming switch to
roles-based access control. Roles are still do not have any function,
but CQL statements related to roles manipulate metdata. The next major
patch series after this one will switch the system to roles.

Previously, most operations around roles required superuser, but this
violates an important idea in security called the "principal of least
privilege": that a user should have only the minimum access possible to
resources in order to achieve their objective.

To that end, this patch series introduces permissions on role resources.
For example, to grant a role to a user, the performing user must have
been granted AUTHORIZE on the role being granted.

In the table below, a user (role) that has been granted the permission
in the left-most column can perform the CQL query in the right columns
depending on if the permission has been granted to the root role
resource (all roles), or a particular role resource.

Perm.           All roles               Specific role (r)
---------------------------------------------------------
CREATE          CREATE ROLE

ALTER           ALTER ROLE *            ALTER ROLE r

DROP            DROP ROLE *             DROP ROLE r

AUTHORIZE       GRANT ROLE */REVOKE     GRANT ROLE r/
                ROLE *                  REVOKE ROLE r

DESCRIBE        LIST ROLES

The following restrictions around superuser exist:

- CREATE ROLE: Only a superuser can create a superuser role.

- ALTER ROLE: Only a superuser can alter the superuser status of a role.

- ALTER ROLE: You cannot alter the superuser status of yourself or of a
  role granted to you.

- DROP ROLE: Only a superuser can drop a role that has superuser.

The following additional "escape hatches" apply:

- ALTER ROLE: You can alter yourself (except to give yourself
  superuser).

- LIST ROLES: You can list your own roles and list the roles of any role
  granted to you.

Finally, a note on terminology: I like to say a role (or user) "is"
superuser if the role (user) has directly been marked as a superuser. A
role (user) "has" superuser if they have been granted a role that is a
superuser. The second statement encompasses the first, since a role can
always be said to have been granted to itself.

Fixes #2988."

* 'jhk/role_permissions/v2' of https://github.com/hakuch/scylla: (24 commits)
  auth: Move permissions cache instance to service
  auth: Add roles query function to service
  cql3: Update access checks for `revoke_role_statement`
  cql3: Update access checks in `grant_role_statement`
  cql3: Update access checks in `list_roles_statement`
  cql3: Update access checks in `drop_role_statement`
  cql3: Update access checks in `alter_role_statement`
  cql3: Update access checks in `create_role_statement`
  tests: Switch to dedicated testing superuser
  auth: Publicize enforcing check for service
  tests: Expose client state from test env
  Allow checking permissions from `client_state`
  auth: Support querying for granted superuser
  auth/service.hh: Document the class
  cql3: Change `create_role_statement` base
  cql3/Cql.g: Add role resources to grammar
  cql3/Cql.g: Avoid extra copy of `auth::resource`
  auth:resource.cc: Use `string_view` in reverse map
  auth: Add `role` resource kind
  auth: Add the DESCRIBE permission
  ...
2017-12-12 19:52:10 +02:00
Jesse Haber-Kucharsky
092f2e659c auth: Move permissions cache instance to service
Instead of a single sharded service shared all by all instances of
`auth::service`, it makes more sense for each instance of
`auth::service` to own its own instance of the permissions cache.
2017-12-12 12:22:46 -05:00
Jesse Haber-Kucharsky
59911411ed auth: Add roles query function to service
While it just calls into the underlying role manager, this level of
indirection allows us to add a roles cache in the future (which is
consistent with the behavior of Apache Cassandra).
2017-12-12 12:22:42 -05:00
Jesse Haber-Kucharsky
fff120a2be cql3: Update access checks for revoke_role_statement
A role can be revoked from another role if the user has AUTHORIZE
permission on the role being revoked.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
de05baafd2 cql3: Update access checks in grant_role_statement
A role can be granted to another role if the user has AUTHORIZE on the
role being granted.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
749575bbf6 cql3: Update access checks in list_roles_statement
A user with DESCRIBE on the root role resource can list any roles of any
roles, and also the roles in the system.

Otherwise, a user can list all the roles it has been granted and can
list all roles granted to those roles.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
4618766431 cql3: Update access checks in drop_role_statement
A role can be dropped if the performer has DROP permission on the role.
A role that has superuser (either directly or through another role
it has been granted) cannot be dropped except by a superuser.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
9f4281cc77 cql3: Update access checks in alter_role_statement
Only superusers can alter superuser status, but only to roles not
granted to them. You can always alter your own role. You can alter
another role if you have ALTER permission on the role.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
fe6e9fe923 cql3: Update access checks in create_role_statement
CREATE ROLE requires CREATE on <ALL ROLES>. Creating a superuser role
requires that the performer is a superuser.

This change also forms the beginning of a test suite for the CQL
interface to roles. We start with verifying access-control properties of
CREATE ROLE as written in this patch.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
10d3dab9ac tests: Switch to dedicated testing superuser
The auth service will eventually add the default
superuser ("cassandra"), but the current code does so after a delay.
Using a dedicated superuser for unit tests side-steps the issue and
allows the user to be created immediately.
2017-12-12 12:07:11 -05:00
Jesse Haber-Kucharsky
56d84d4e26 auth: Publicize enforcing check for service 2017-12-12 12:06:49 -05:00
Jesse Haber-Kucharsky
af670328e1 tests: Expose client state from test env
This is useful for manipulating and querying the current user.
2017-12-12 12:03:01 -05:00
Jesse Haber-Kucharsky
6f9df19eb8 Allow checking permissions from client_state
Previously, this function was private and only `ensure_has_permission`
was public. `ensure_has_permission` throws in the absence of a
permission, but it can also be useful to query a permission without it
being an error.
2017-12-12 12:03:01 -05:00
Jesse Haber-Kucharsky
7339295969 auth: Support querying for granted superuser
This functionality is useful for implementing CQL statements and will
replace `auth::is_super_user` once roles have replaced users in Scylla.

Since eventually the auth service will have a roles cache, this function
is here rather than a part `role_manager`.
2017-12-12 12:02:38 -05:00
Jesse Haber-Kucharsky
56e1f2e30f auth/service.hh: Document the class 2017-12-12 11:24:44 -05:00
Jesse Haber-Kucharsky
daea70abe3 cql3: Change create_role_statement base
It is an `authentication_statement`, not an
`authorization_statement` (really, it's neither, but we're being
consistent with Apache Cassandra).
2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
a0cffead69 cql3/Cql.g: Add role resources to grammar 2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
4ae6b02572 cql3/Cql.g: Avoid extra copy of auth::resource 2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
77da3c4496 auth:resource.cc: Use string_view in reverse map
This avoids unnecessary copies.
2017-12-12 11:06:49 -05:00
Jesse Haber-Kucharsky
0546007fb5 auth: Add role resource kind 2017-12-12 11:06:35 -05:00
Jesse Haber-Kucharsky
9452533230 auth: Add the DESCRIBE permission
When a user is granted DESCRIBE on all roles (a resource kind that
doesn't exist yet in the code, but will exist soon), they gain the
ability to execute LIST ROLES queries.
2017-12-12 10:59:26 -05:00
Jesse Haber-Kucharsky
d29463beba auth: Support resource-specific permission sets
Different kinds of resources support different permissions. For example,
a keyspace supports the CREATE permission, which allows a user to
create tables in that keyspace. However, a table does not have an
applicable CREATE permission.

If a non-applicable permission is requested, an
`invalid_request_exception` is thrown.
2017-12-12 10:59:26 -05:00
Avi Kivity
e6940d8d4a Merge "Gossip propagation and stabilization" from Calle
"Fixes #2866
Fixes #2894

Changes gossip propagation to allow "atomic" grouping of values to ensure
their respective order.
Modifies gossip bootstrap startup to potentially wait longer in cases
where stabilization (messages done) takes time, to avoid data loss
in repair."

* 'calle/gossip' of github.com:scylladb/seastar-dev:
  gossip: wait for stabilized gossip on bootstrap
  gossiper: Prevent race condition in  propagation
  utils::to_string: Add printers for pairs+maps
  utils::in: Add helper type for perfect forwarding initializer lists
2017-12-12 17:59:00 +02:00
Jesse Haber-Kucharsky
eb0de39c98 auth/resource.hh: Use Doxygen-style formatting
Though we're still selective about its application.
2017-12-12 10:45:26 -05:00
Jesse Haber-Kucharsky
b986f48960 auth: Remove ALL_DATA permission set
This set is equal to `permissions::ALL`. When we switch over to
resource-specific permission sets, we will filter the set of all
permissions to only those that are applicable for the resource in
question.
2017-12-12 10:30:19 -05:00
Jesse Haber-Kucharsky
b14dc07f14 auth: Move particular permission set to caller
Applicable permission sets will soon be specific to each kind of
resource. This change prepares us for dynamic querying of permission
sets by resource.
2017-12-12 10:30:19 -05:00
Avi Kivity
eda35d2a57 Merge seastar upstream
* seastar ac78eec...2b23547 (10):
  > Merge "update shares for I/O classes" from Glauber
  > Merge "Resumable tasks" from Avi
  > input_stream: un-unroll input_stream::consume()
  > net: adding yaml-based parser for network configuration supporting multiple interfaces
  > scripts: perftune.py: don't attempt to set IRQs' affinity when IRQs list is empty
  > tutorial: fix example code
  > http: api_docs add swagger 2.0 support
  > Support custom function for reading of config-files.
  > Revert "provide an interface for updating the shares of an I/O class"
  > provide an interface for updating the shares of an I/O class
2017-12-12 11:00:38 +02:00
Michael Munday
b68b82dc8d tests: loading_cache_test: align DMA buffers
DMA reads and writes require that data be correctly aligned.

Message-Id: <20171211130202.77608-1-mike.munday@ibm.com>
2017-12-11 15:04:26 +02:00
Michael Munday
aea5f3bd1c sstables: fix compression on big endian systems
The encoding logic was incorrect for big endian systems (shift needed
to be in the opposite direction). Rather than fix that issue I have
re-written the relevant code to restrict the storage format to little
endian byte order on all systems. My hope is that this will be a bit
easier to maintain.

Message-Id: <20171211124454.77488-1-mike.munday@ibm.com>
2017-12-11 14:54:22 +02:00
Michael Munday
9e99105aa2 configure.py: use default system linker if gold is not available
Most distros on s390x don't currently have gold installed by default.
Rather than disable gold on the platform add a check to see if gold
is installed and switch back to using the default system linker if it
isn't. The try_compile_and_link functionality is copied from the
seastar project.

Message-Id: <20171211122156.77385-1-mike.munday@ibm.com>
2017-12-11 14:29:43 +02:00
Paweł Dziepak
d10b74b9cf Merge "Preparatory changes before changing semantics of continuity merging" from Tomasz
"The changes in this series fall into one of the following:
  1) improve unit tests
  2) improve code reuse in mvcc so that later cahnges will be easier
  3) fix minor issues which were exposed by the above"

* tag 'tgrabiec/improve-and-fix-mvcc-tests-v4' of github.com:scylladb/seastar-dev:
  tests: mvcc: Add more tests for consistency of continuity merging
  tests: mvcc: Fix test_apply_is_atomic()
  tests: mvcc: Do not assume that continuity of current row is updated on partition_snapshot_row_cursor::maybe_refresh()
  mvcc: Reuse partition_snapshot_row_cursor in apply_to_incomplete()
  mvcc: Propagate region reference to partition_entry::apply_to_incomplete()
  mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_if_complete()
  mvcc: partition_snapshot_row_cursor: Extract prepare_heap()
  mvcc: Add const-qualified partition_version_ref::operator*()
  tests: mvcc: Use mutation_partition_assertions
  tests: Introduce mutation_partition_assertions
  tests: Randomize static row continuity in random_mutation_generator
  tests: mutation_assertion: Introduce is_continuous()
  mvcc: Introduce partition_snapshot_row_cursor::read_partition()
  mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment
  mutation_partition: Extract sliced() from mutation into mutation_partition
  mvcc: Introduce partition_snapshot::static_row_continuous()
  mvcc: Introduce partition_snapshot::range_tombstones() for full range
  mvcc: Don't require external schema in parition_snapshot::range_tombstones()
  mutation_partition: Define equal_continuity() using get_continuity()
  mutation_partition: Make check_continuity() const-qualified
  mutation_partition: Make check_continuity() public
  mutation_partition: Introduce mutation_partition::get_continuity()
  Introduce clustering_interval_set
  mutation_partition: Leave moved-from row in an empty state
  mutation_partition: Fix upgrade() not preserving static row continuity
2017-12-11 09:31:00 +00:00
Amnon Heiman
bc356a3c15 scylla_setup support private repo on debian during setup
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917145248.19677-1-amnon@scylladb.com>
2017-12-11 10:36:30 +02:00
Jesse Haber-Kucharsky
7e3a344460 cql3: Add missing return
Since `return` is missing, the "else" branch is also taken and this
results a user being created from scratch.

Fixes #3058.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bf3ca5907b046586d9bfe00f3b61b3ac695ba9c5.1512951084.git.jhaberku@scylladb.com>
2017-12-11 09:55:05 +02:00
Avi Kivity
b29b091f4e Merge "Power8 porting" from Vlad
"This series includes a few patches from Michael Munday <mike.munday@ibm.com> (Z-project)
and a few from me. The most significant is PATCH10 that introduces a vectorized version
of CRC32 calculation (based on the Anton Blanchard's work)."

* 'scylla-power64-port-v2-1' of https://github.com/vladzcloudius/scylla:
  test.py: limit the tests to run on 2 shards with 4GB of memory
  tests: sstable_datafile_test: fix the compilation error on Power
  tests: compound_test: fix the 'narrowing' compilation error on Power
  cql3::constants::literal: fix the empty string parser
  utils::crc32: add power64 crc32 HW accelerated implementation
  repair: use seastar::cache_line_size for aligning to the cache line size
  build: add -lcryptopp to libs
  utils/allocation_strategy: force alignment to be at least sizeof(void*)
  utils::crc: introduce process_le/be(T) methods
  utils/crc: use zlib for crc32 on non-x86 platforms
  main: only perform SSE 4.2 check on x86-family CPUs
  configure.py: don't use 'gold' linker on Power
  configure.pu: add --target flag to override -march value
2017-12-08 20:48:41 +02:00
Vlad Zolotarov
57a6ed5aaa test.py: limit the tests to run on 2 shards with 4GB of memory
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
22ca5d2596 tests: sstable_datafile_test: fix the compilation error on Power
'char' and int8_t ('unsigned char') are different types. 'bytes' base type
is int8_t - use the correct type for casting.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
6a51e6fe33 tests: compound_test: fix the 'narrowing' compilation error on Power
'bytes' has int8_t as a base type and 0xff value is out of this type's range.
Use the corresponding signed value instead.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
3ebaf86ebc cql3::constants::literal: fix the empty string parser
Don't assume the 'char' being signed - this is implementation dependent.
Compare to '\xFF' value which is the actual intent.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
0145ae2b4b utils::crc32: add power64 crc32 HW accelerated implementation
Based on the work of Anton Blanchard <anton@au.ibm.com>, IBM that may be found
here: https://github.com/antonblanchard/crc32-vpmsum

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Vlad Zolotarov
97506f39b2 repair: use seastar::cache_line_size for aligning to the cache line size
Use seastar::cache_line_size for cache line alignment instead of a hard coded value (64) - this value is
not always correct, e.g. PPC64 platform, where cache line size is 128B.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Tomasz Grabiec
e81a4476c8 tests: mvcc: Add more tests for consistency of continuity merging 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
3b6167b4c4 tests: mvcc: Fix test_apply_is_atomic()
partition_entry::apply() requires that mutations are fully continuous.
2017-12-08 17:50:48 +01:00
Tomasz Grabiec
33c1f33c90 tests: mvcc: Do not assume that continuity of current row is updated on partition_snapshot_row_cursor::maybe_refresh()
It currently is updated only when iterators are invalidated. Better
to not assume that, because it's not really needed, and
maintaining this would complicate maybe_refresh() after continuity
merging rules change later.
2017-12-08 17:50:48 +01:00
Tomasz Grabiec
4094c66979 mvcc: Reuse partition_snapshot_row_cursor in apply_to_incomplete()
Reduces duplication of knowledge about how logical mutation_partition
view is obtained for multiple versions.
2017-12-08 17:50:48 +01:00
Tomasz Grabiec
12704fd679 mvcc: Propagate region reference to partition_entry::apply_to_incomplete() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
376033af13 mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_if_complete() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
8e9f8d93ef mvcc: partition_snapshot_row_cursor: Extract prepare_heap() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
a6e083ef6f mvcc: Add const-qualified partition_version_ref::operator*() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
230ca7d01b tests: mvcc: Use mutation_partition_assertions 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
c7539f2ed0 tests: Introduce mutation_partition_assertions
mutation_assertions are now delegating to mutation_partition_assertions.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
0ddb419eca tests: Randomize static row continuity in random_mutation_generator 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
a3f9799d70 tests: mutation_assertion: Introduce is_continuous() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
05a19737e4 mvcc: Introduce partition_snapshot_row_cursor::read_partition()
Useful in tests.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
8e8ece5dec mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
b3709047b0 mutation_partition: Extract sliced() from mutation into mutation_partition
So that we can call it on mutation_partition.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
b26ce36d4b mvcc: Introduce partition_snapshot::static_row_continuous() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
c283744fcb mvcc: Introduce partition_snapshot::range_tombstones() for full range 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
df964c70f8 mvcc: Don't require external schema in parition_snapshot::range_tombstones() 2017-12-08 17:50:47 +01:00
Michael Munday
8df2afc255 build: add -lcryptopp to libs
Not sure why this is necessary on s390x but not x86.
2017-12-08 10:12:41 -05:00
Michael Munday
18c0ab539e utils/allocation_strategy: force alignment to be at least sizeof(void*)
The alignment of packed structs can be 1. The system¹ posix_memalign
function will return EINVAL when passed this alignment. This fix
forces the alignment to be at least sizeof(void*).

¹ The seastar implementation of posix_memalign does not appear to
  have this limitation currently.
2017-12-08 10:12:41 -05:00
Michael Munday
5158b3f484 utils::crc: introduce process_le/be(T) methods
Replace the oblique process(T) overloads for integer types with
explicit process_le/be(T) methods that would interpret the given integer
as a stream of bytes using the corresponding endiannes.

For instance

process_le(0x11223344) would treat this integer as the following array of bytes:
{0x44, 0x33, 0x22, 0x11}.

process_be(0x11223344) on the other hand would treat this integer as if it's
{0x11, 0x22, 0x33, 0x44}.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 10:12:21 -05:00
Michael Munday
26b7c2622e utils/crc: use zlib for crc32 on non-x86 platforms
Ideally we should use the Castagnoli polynomial to match the SSE 4.2
crc32 instructions, but this works for now.
2017-12-08 09:47:50 -05:00
Michael Munday
f2be7d3e9e main: only perform SSE 4.2 check on x86-family CPUs
The check doesn't make sense on other architectures (e.g. s390x).
2017-12-08 09:47:50 -05:00
Vlad Zolotarov
03693de803 configure.py: don't use 'gold' linker on Power
'gold' linker is not a part of binutils on Power yet.
Let's not use it on Power.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 09:47:50 -05:00
Michael Munday
92d6a2b76c configure.pu: add --target flag to override -march value
This is probably the simplest way to make the build work on other
architectures. --target can be set to an empty string to allow
the compiler's default to be used.

If --target is not set then the default is going to be 'nehalem' on
x86 machines and the compiler's default on all other platforms.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Signed-off-by: Michael Munday <mike.munday@ibm.com>
2017-12-08 09:47:50 -05:00
Tomasz Grabiec
5541c9fd63 mutation_partition: Define equal_continuity() using get_continuity()
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
bde050835f mutation_partition: Make check_continuity() const-qualified 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
f9257886cb mutation_partition: Make check_continuity() public 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
865bd8a594 mutation_partition: Introduce mutation_partition::get_continuity()
Intended to be used in tests.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
7e5d243a95 Introduce clustering_interval_set
Will make it easy to represent and manipulate continuity in tests.

Could also replace clustering_row_ranges in the future, which is
currently a naked vector<> with no semantic methods.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
22138554e6 mutation_partition: Leave moved-from row in an empty state
Needed by apply_monotonically(). Fixes SIGSEGV in mutation_test_g.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
a305a28574 mutation_partition: Fix upgrade() not preserving static row continuity
We do not rely on this yet, but will.
2017-12-08 12:01:27 +01:00
Paweł Dziepak
051cbbc9af Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.

Introduced in 2.0

Fixes #3053."

* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add reproducer for issue #3053
  tests: mvcc: Add test for partition_snapshot::range_tombstones()
  mvcc: Optimize partition_snapshot::range_tombstones() for single version case
  mvcc: Fix partition_snapshot::range_tombstones()
  tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
2017-12-08 10:27:17 +00:00
Tomasz Grabiec
4cc4c661f3 tests: row_cache: Add reproducer for issue #3053
The issue is that partition_snapshot::range_tombstones() is
deoverlapping tombstones coming from different versions, and it may
happen that due to range tombstone splitting that function will return
a tombstone which starts after the requested range. This breaks
assumptions made by the cache reader. It keeps track of the maximum
fragment position, and if cache reader will then need to read from
sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
2017-12-08 10:15:58 +01:00
Tomasz Grabiec
b6f4637aec tests: mvcc: Add test for partition_snapshot::range_tombstones() 2017-12-08 10:15:58 +01:00
Tomasz Grabiec
183554cbc4 mvcc: Optimize partition_snapshot::range_tombstones() for single version case 2017-12-08 10:15:58 +01:00
Tomasz Grabiec
1303320377 mvcc: Fix partition_snapshot::range_tombstones()
partition_snapshot::range_tombstones() is deoverlapping tombstones
coming from different versions and it may happen that due to range
tombstone splitting the method will return a tombstone which starts
after the requested range. This would cause it to return a tombstone
which doesn't overlap with the requested range.

This breaks assumptions made by cache reader. It keeps track of the
maximum fragment position, and if cache reader will then need to read
from sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.

Exposed by a change in row_cache_test.cc::test_mvcc() which fills the
buffer of sm5 reader after it is created.

Fixes #3053.
2017-12-08 10:15:58 +01:00
Tomasz Grabiec
89e3b734ed tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
It is assumed that dummy entries are only at !is_clustering_row() positions.
Causes cache_streamed_mutation to assert when trying to trim a range tombstone.
2017-12-07 20:20:37 +01:00
Avi Kivity
d934ca55a7 Merge "SSTable resharding fixes" from Raphael
"Didn't affect any release. Regression introduced in 301358e.

Fixes #3041"

* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
  tests: add sstable resharding test to test.py
  tests: fix sstable resharding test
  sstables: Fix resharding by not filtering out mutation that belongs to other shard
  db: introduce make_range_sstable_reader
  rename make_range_sstable_reader to make_local_shard_sstable_reader
  db: extract sstable reader creation from incremental_reader_selector
  db: reuse make_range_sstable_reader in make_sstable_reader
2017-12-07 16:42:48 +02:00
Amos Kong
8fd5d27508 dist/debian: add scylla-tools-core to depends list
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com>
2017-12-07 13:40:10 +02:00
Amos Kong
eb3b138ee2 dist/redhat: add scylla-tools-core to requires list
Fixes #3051

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com>
2017-12-07 13:40:08 +02:00
Gleb Natapov
8f104bab5d storage_proxy: send negative write replies only when entire cluster supports the feature
Message-Id: <20171207102934.GM1885@scylladb.com>
2017-12-07 12:31:35 +02:00
Botond Dénes
1ff65f41fd mutation_reader_merger: don't query the kind of moved-from fragment
Call mutation_fragment_kind() on the fragment *before* it's moved as
there are not guarantees for the state of a moved-from object (apart
from that it's in a valid one).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c47b1e22877bb9499f1fbb9d513093c29ef1901b.1512635422.git.bdenes@scylladb.com>
2017-12-07 10:40:31 +02:00
Avi Kivity
060e5d3354 Merge "Improve time-series performance by not actually compacting fully expired tables" from Raphael
"In time-series, it's common for tables in a given time window to be eventually
fully expired. The deletion of such tables is done by compaction, but there's
*no* need to *actually* compact such fully expired sstables *iff* their full
deletion will not cause older data to be ressurected. In other words, a fully
expired table can be actually skipped (but deleted in the end) by compaction
*iff* it doesn't contain newer data than its overlapping counterparts. So there
may be false negatives, but never false positives.
All that said, the goal behind this patchset is to save read bandwidth of disk
in such scenarios. Given that fully expired sstables will not be read by
compaction process anymore, read amplification will be greatly reduced too.

Fixes #2620."

* 'time_series_performance_improvement_v2_2' of github.com:raphaelsc/scylla:
  tests: check sstable auto correct bad max deletion time
  tests: add test for compaction with fully expired table
  sstables/compaction: do not actually compact fully expired sstables
  sstables: make sstable auto correct max_local_deletion_time
  sstables: switch to const ref wherever possible
  sstables: use gc_clock::time_point for gc_before
  gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
  sstables: introduce sstable::get_max_local_deletion_time
  sstables: remove unnecessary copy in time series strategies
  sstables: change return value type of get_fully_expired_sstables
  dtcs: make code to extract non expired tables faster
  sstables: add has_correct_max_deletion_time to sstable
2017-12-07 10:29:31 +02:00
Avi Kivity
908daa67bd Merge "Generalize data_resource" from Jesse
"Soon we will have resources beyond just keyspaces and table names. There
will be resources for roles, for user-defined functions (UDFs), and
possible resources for REST end-points. This change generalizes the
implementation of a `data_resource` to many different kinds of
resources, though there is still only one kind (`data`).

The most important patch is 2/5 ("auth/resource: Generalize to different
kinds"), which re-writes `auth::data_resource`. The patch message should
sufficiently explain the design decisions involved.

The other patches rename files and identifiers based on the expanded
role of this class, except for 5/5 ("auth/resource.hh: Rename
`resource_ids`"): this patch gives a more appropriate name to a type
alias.

Fixes #3027."

* 'jhk/generalize_resource/v3' of https://github.com/hakuch/scylla:
  auth/resource.hh: Rename `resource_ids`
  auth: Rename `data_resource` files
  cql3/authorization_statement: Fix typo
  auth/resource: Generalize to different kinds
  auth: Rename `data_resource` to `resource`
2017-12-07 10:25:58 +02:00
Botond Dénes
9fce51f8a0 Add streamed mutation fast-forwarding unit test for the flat combined-reader
Test for the bug fixed by 9661769.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <fc917bae8e9c99f026bf7b366e6e9d39faf466af.1512630741.git.bdenes@scylladb.com>
2017-12-07 09:45:12 +02:00
Raphael S. Carvalho
39f7404436 tests: add sstable resharding test to test.py
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:27 -02:00
Raphael S. Carvalho
fc193c29cf tests: fix sstable resharding test
wrong sstable was used when checking for content, and storage service
for test was missing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:27 -02:00
Raphael S. Carvalho
bad21ba444 sstables: Fix resharding by not filtering out mutation that belongs to other shard
After 301358e, sstable resharding stopped work because shared sstables would
use a filtering reader, which excludes mutation that belong to other shards.
That completely breaks which relies on compaction of mutations that belong
to different shards. The fix is about using recently introduced non local
shard reader.

Fixes #3041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
f1b65a115a db: introduce make_range_sstable_reader
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
d1b146baa6 rename make_range_sstable_reader to make_local_shard_sstable_reader
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:25 -02:00
Raphael S. Carvalho
3d725d6823 db: extract sstable reader creation from incremental_reader_selector
step closer to divorcing incremental_selector from sstables

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 01:53:16 -02:00
Raphael S. Carvalho
ab82bacddd db: reuse make_range_sstable_reader in make_sstable_reader
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 01:53:14 -02:00
Raphael S. Carvalho
5eef7371b3 tests: check sstable auto correct bad max deletion time
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
a86ee38638 tests: add test for compaction with fully expired table
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
809b30c4a2 sstables/compaction: do not actually compact fully expired sstables
There's no need to actually compact a sstable which is fully expired
and which deletion of all its data will not ressurect older data.
For that, a sstable will only be considered fully expired if it
doesn't contain data newer than its overlapping counterparts.
That way, there could be a false negative, but never a false positive.
Currently, a fully expired sstable would unnecessarily waste read
bandwidth of disk. This will help a lot time series workloads in
which data for a given time window is all deleted at once using TTL.

Fixes #2620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
810e2ec3d9 sstables: make sstable auto correct max_local_deletion_time
sstables created prior to cc6c383 can contain bad max deletion time stat,
which would make get_fully_expired_sstables return sstables that aren't
actually fully expired. Let's make sstable invalidate the stat if it
is potentially incorrect.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d2ab154f12 sstables: switch to const ref wherever possible
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d916c8cdad sstables: use gc_clock::time_point for gc_before
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
1d0e6496ec gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:32 -02:00
Raphael S. Carvalho
fcdce38e7f sstables: introduce sstable::get_max_local_deletion_time
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:47:05 -02:00
Raphael S. Carvalho
18bdf496fe sstables: remove unnecessary copy in time series strategies
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:46:46 -02:00
Raphael S. Carvalho
45c11865fa sstables: change return value type of get_fully_expired_sstables
unordered_set will allow us to quickly extract fully expired tables
from a set of compacting sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:45:55 -02:00
Raphael S. Carvalho
4fe6fea758 dtcs: make code to extract non expired tables faster
since it's O(n) and not O(n log n).

change also needed for change in interface of function to retrieve
fully expired tables, or sort lambda would need to be parametrized.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:40:16 -02:00
Raphael S. Carvalho
11176324bd sstables: add has_correct_max_deletion_time to sstable
Commit cc6c38324 fixes the stat. It was only updated for range
tombstone prior to fix, so a sstable that had a regular cell with
no expiration time could be considered fully expired which can
lead to bad decisions in compaction for time series workloads.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:40:05 -02:00
Jesse Haber-Kucharsky
aea262cdc4 auth/resource.hh: Rename resource_ids 2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
3cad18631d auth: Rename data_resource files
Now that there can be many kinds of resources, the old name doesn't fit.
2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
3665261a90 cql3/authorization_statement: Fix typo 2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
1bb22bb190 auth/resource: Generalize to different kinds
This change generalizes the implementation of a `resource` to many
different kinds of resources, though there is still only one
kind (`data`). In the future, we also expect resource kinds for roles,
user-defined functions (UDFs), and possibly on particular REST
end-points.

I considered several approaches to generalizing to different kinds of
resources.

One approach is to have a base class that is inherited from by different
resource kinds. The common functionality would be accessed through
virtual member functions and kind-specific functions would exist in
sub-classes. I rejected this approach because dealing with different
kinds of resources uniformly requires storage and life-time management
through something like `std::unique_ptr<auth::resource>`, which means
that we lose value semantics (including comparison) and must deal with
complications around ownership.

Another option was to use `boost::variant` (or, in future,
`std::variant`). This is closer to what we want, since there a static
set of resource kinds that we support. I rejected this approach for two
reasons. The first is that all resource kinds share the same data (a
list of segments and a root identifier), which would be duplicated in
each type that composed the variant. The second is that the complexity
and source-code overhead of `boost::variant` didn't seem warranted.

The solution I ended up with is home-grown variant. All resources are
described in the same `final` class: `auth::resource`. This class has
value semantics, supports equality comparison, and has a strict
ordering. All resources have in common a tag ("kind") and a list of
parts. Most operations on resources don't care about the kind of
resource (like getting its name, parsing a name, querying for the
parent, etc). These are just member functions of the class.

When we care about a kind-specific interpretation of a resource, we can
produce a "view" of the resource. For example, `data_resource_view`
allows for accessing the (optional) keyspace and table names.

I anticipate in the future to add functions for creating role
resources (`auth::resource::role`) and also `role_resource_view`.

The functional behaviour of the system should be unchanged with this
patch.

I've added new unit tests in `auth_resource_test.cc` and removed the old
test from `auth_test.cc`.

Fixes #3027.
2017-12-06 14:37:56 -05:00
Jesse Haber-Kucharsky
8fe53ecf78 auth: Rename data_resource to resource
The implementation and interface of `auth::resource` will change soon to
support different kinds of resources beyond just data (keyspaces and
tables).
2017-12-06 10:18:05 -05:00
Gleb Natapov
ddf117535a storage_proxy: add counters for speculative reads
Fixes #3030

Message-Id: <20171206143611.8756-1-gleb@scylladb.com>
2017-12-06 16:38:16 +02:00
Avi Kivity
ccc315bcfe Merge "storage_proxy: allow fail request earlier if CL cannot be reached due to errors" from Gleb
"This is CASSANDRA-7886 and CASSANDRA-8592. The patch series detects
that CL of a request can no longer be reached due to errors and fails
the request earlier. New type of errors are reported: read/write failure
which were introduced in cql v4 protocol. For compatibility if older
protocol is used the error is translated to timeout error."

* 'gleb/request-failure_v2' of github.com:scylladb/seastar-dev:
  storage_proxy: fail read/write requests early if it cannot be completed due to errors
  storage_service: add WRITE_FAILURE_REPLY_FEATURE feature
  gossiper: add node_has_feature() function
  cql: add read/write failure exceptions
  storage_proxy: fix data presence reporting in read timeout error during
  storage_proxy: remove inheritance from enable_shared_from_this for abstract_write_response_handler
  storage_proxy: remove unneeded field in abstract_write_response_handler
  storage_proxy: fix pending endpoint accounting for EACH_QUORUM
  consistency_level: constify quorum_for() and local_quorum_for()
2017-12-06 16:17:19 +02:00
Botond Dénes
9661769313 combined_mutation_reader: fix fast-fowarding related row-skipping bug
When fast forwarding is enabled and all readers positioned inside the
current partition return EOS, return EOS from the combined-reader
too. Instead of skipping to the next partition if there are idle readers
(positioned at some later partition) available. This will cause rows to
be skipped in some cases.

The fix is to distinguish EOS'd readers that are only halted (waiting
for a fast-forward) from thoose really out of data. To achieve this we
track the last fragment-kind the reader emitted. If that was a
partition-end then the reader is out of data, otherwise it might emit
more fragments after a fast-forward. Without this additional information
it is impossible to determine why a reader reached EOS and the code
later may make the wrong decision about whether the combined-reader as
a whole is at EOS or not.
Also when fast-forwarding between partition-ranges or calling
next_partition() we set the last fragment-kind of forwarded readers
because they should emit a partition-start, otherwise they are out of
data.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6f0b21b1ec62e1197de6b46510d5508cdb4a6977.1512569218.git.bdenes@scylladb.com>
2017-12-06 16:09:05 +02:00
Takuya ASADA
aeb6ebce5a dist/debian: need apt-get update after installing GPG key for 3rdparty repo
We need apt-get update after install GPG key, otherwise we still get
unauthenticated package error on Debian package build.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512556948-29398-1-git-send-email-syuu@scylladb.com>
2017-12-06 12:43:17 +02:00
Jesse Haber-Kucharsky
772b432345 auth: Copying role exceptions cannot throw
This is a small correctness change.

According to cppreference.com [1], derived classes of `std::exception`
are not permitted to throw exceptions when they are copied.

To satisfy this requirement for `auth::roles_argument_exception`, we
store exception members as `std::shared_ptr` which has a `noexcept` copy
ctor. Since exceptions can cross shards, we cannot use a
`seastar::shared_ptr`.

This change is motivated by #3021.

[1] http://en.cppreference.com/w/cpp/error/exception/exception

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <7706df0c701b90e7cb309c84a86d9f813461e801.1512501024.git.jhaberku@scylladb.com>
2017-12-06 09:42:45 +01:00
Vladimir Krivopalov
1fc0c60fdc Support "CREATE TABLE WITH id" command.
Fixes #2059

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <92874a2bf1b4e79ef9f05875b3fa42804d17833c.1512508924.git.vladimir@scylladb.com>
2017-12-06 09:39:56 +01:00
Takuya ASADA
8f02967a3b dist/debian: install CA certificates before install repo GPG key
Since pbuilder chroot environment does not install CA certificates by default,
accessing https://download.opensuse.org will cause certificate verification
error.
So we need to install it before installing 3rdparty repo GPG key.

Also, checking existance of gpgkeys_curl is not needed, since it's always
not installed since we are running the script in clean chroot environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512517001-27524-1-git-send-email-syuu@scylladb.com>
2017-12-06 08:57:01 +01:00
Avi Kivity
3501c147b7 Merge "Use new recommended classes from JsonCpp instead of deprecated ones" from Vladimir
"This fix for the issue #2989 first adds unit tests for caching_options which
is the only class that uses the helpers from json.hh. This is done to
have regression tests in place for the main change.
The second commit adds conditional use of new recommended JsonCpp API
where available. For older versions of the library, it uses the old
code."

* 'issues/2989/v1' of https://github.com/argenet/scylla:
  Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp.
  tests: Add unit tests for caching_options.
2017-12-06 09:11:40 +02:00
Avi Kivity
601a03dda7 Merge "Make sstable tests use flat_mutation_reader" from Paweł
"This series makes sstable tests use flat stream interface. The main
motivation is to allow eventual removal of mutation_reader and
streamed_mutation and ensuring that the conversion between the
interfaces doesn't hide any bugs that would be otherwise found."

* tag 'flat_mutation_reader-sstable-tests/v1' of https://github.com/pdziepak/scylla:
  sstables: drop read_range_rows()
  tests/mutation_reader: stop using read_range_rows()
  incremental_reader_selector: do not use read_range_rows()
  tests/sstable: stop using read_range_rows()
  sstables: drop read_row()
  tests/sstables: use read_row_flat() instead of read_row()
  database: use read_row_flat() instead of read_row()
  tests/sstable_mutation_test: get flat_mutation_readers from mutation sources
  tests/sstables: make sstable_reader return flat_mutation_reader
  sstable: drop read_row() overload accepting sstable::key
  tests/sstable: stop using read_row() with sstable::key
  tests/flat_mutation_reader_assertions: add has_monotonic_positions()
  tests/flat_mutation_reader_assertions: add produces(Range)
  tests/flat_mutation_reader_assertions: add produces(mutation)
  tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
  tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
  tests/flat_mutation_reader_assertions: fix fast forwarding
2017-12-05 18:10:43 +02:00
Vladimir Krivopalov
b35c2fe177 Attach backtrace to marshal_exception-s thrown from generic functions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <06ad18c3563855771dd3ea8d0ec99533642e1919.1511931828.git.vladimir@scylladb.com>
2017-12-05 16:14:55 +01:00
Paweł Dziepak
0d8f964a79 sstables: drop read_range_rows()
It has been deprecated by read_range_rows_flat().
2017-12-05 14:53:14 +00:00
Paweł Dziepak
0c50f113c8 tests/mutation_reader: stop using read_range_rows() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
ce9a890940 incremental_reader_selector: do not use read_range_rows() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
15ad148604 tests/sstable: stop using read_range_rows()
read_range_rows() is deprecated by read_range_rows_flat().
2017-12-05 14:53:14 +00:00
Paweł Dziepak
e739ad98e5 sstables: drop read_row() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
de8ebd6752 tests/sstables: use read_row_flat() instead of read_row() 2017-12-05 14:53:14 +00:00
Paweł Dziepak
bccca90207 database: use read_row_flat() instead of read_row() 2017-12-05 14:52:57 +00:00
Paweł Dziepak
582bacbd81 tests/sstable_mutation_test: get flat_mutation_readers from mutation sources 2017-12-05 14:52:32 +00:00
Paweł Dziepak
74e1c38f80 tests/sstables: make sstable_reader return flat_mutation_reader 2017-12-05 14:52:32 +00:00
Paweł Dziepak
7fce7a9e3a sstable: drop read_row() overload accepting sstable::key
sstable::key needs to be converted to a dht::decorated_key which needs
to be kept alive until the returned reader dies.
2017-12-05 14:49:25 +00:00
Paweł Dziepak
77a4231147 tests/sstable: stop using read_row() with sstable::key 2017-12-05 14:47:46 +00:00
Paweł Dziepak
52c1e9fcf4 tests/flat_mutation_reader_assertions: add has_monotonic_positions()
has_monotonic_positions() verifies that the stream is monotonic.
Based on streamed_mutation_assertions::has_monotonic_positions().
2017-12-05 14:47:46 +00:00
Paweł Dziepak
5b6f680b45 tests/flat_mutation_reader_assertions: add produces(Range)
The assertions already have produces(mutation) and
produces(dht::decorated_key) overloads. Additional overload that accepts
a range of elements will allow to check if a range of mutations of
decorated keys is produced.
The same interface is exposed by mutation_reader_assertions.
2017-12-05 14:47:46 +00:00
Paweł Dziepak
ef4fa1a8c1 tests/flat_mutation_reader_assertions: add produces(mutation) 2017-12-05 14:47:31 +00:00
Gleb Natapov
16964de1f3 storage_proxy: fail read/write requests early if it cannot be completed due to errors
If errors make reaching CL impossible a request can be aborted earlier
without waiting for timeout.
2017-12-05 16:46:25 +02:00
Gleb Natapov
0be3bd383b storage_service: add WRITE_FAILURE_REPLY_FEATURE feature
Presence of the flag indicates that the node is ready to process
negative mutation write replies.
2017-12-05 16:46:25 +02:00
Calle Wilund
8af0b501a2 gossip: wait for stabilized gossip on bootstrap
Fixes #2866

Instead of a raw 30s sleep waiting for gossip to stabilize/set up 
ranges on bootstrap, use similar logic as 'wait_for_gossip_to_settle'
and loop said 30s or more until we neither grow/shrink ep set, or
are processing ACK:s.
2017-12-05 14:28:34 +00:00
Calle Wilund
1c8302e692 gossiper: Prevent race condition in propagation
Fixes #2894

Allow applying certain application states as monotonic sets,
i.e. allow set of states as input, and ensure the values are 
re-versioned and all applied together.
Then do so for certain states that are  by design coupled
(status/tokens). 

Similar solution as origins, as issue is copy of the same.
2017-12-05 14:28:34 +00:00
Calle Wilund
2095cb82a5 utils::to_string: Add printers for pairs+maps 2017-12-05 14:28:34 +00:00
Calle Wilund
f4362a5289 utils::in: Add helper type for perfect forwarding initializer lists
wrapper type (courtesy of
http://cpptruths.blogspot.se/2013/09/21-ways-of-passing-parameters-plus-one.html#inTidiom)
to enable move semantics in initializer lists. Useful as an engineering
overkill to retain nice call sites.
2017-12-05 14:28:34 +00:00
Paweł Dziepak
d2dfca458f tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
There is an equivalent member function in mutation_reader assertions.
2017-12-05 13:11:55 +00:00
Paweł Dziepak
28caa76c8c tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
produces(mutation_fragment::kind) is provided by
streamed_mutation_assertions and is going to be needed in order to
fully convert tests to the flat mutation readers.
2017-12-05 13:04:16 +00:00
Paweł Dziepak
21886b7a3f tests/flat_mutation_reader_assertions: fix fast forwarding
Both fast_forward_to() overloads return a future which should be waited
for. Additionally, fast_forward_to(const dht::partition_range&) expects
the range to remain valid at least until the next call to
fast_forward_to(). The original mutation_reader_assertions guaranteed
that and so should flat_mutation_reader_assertions.
2017-12-05 13:04:16 +00:00
Gleb Natapov
fb8a626813 gossiper: add node_has_feature() function
The function allows to check if an endpoint supports certain feature.
2017-12-05 15:02:17 +02:00
Gleb Natapov
6ef26a4a4a cql: add read/write failure exceptions
Those errors were added by cql protocol v4 and are translated to
timeout exception if earlier protocol is negotiated.
2017-12-05 15:02:17 +02:00
Gleb Natapov
6a85cae707 storage_proxy: fix data presence reporting in read timeout error during
_responses variable is never updated, so remove it. response_count() was
meant to be used.
2017-12-05 15:02:17 +02:00
Gleb Natapov
f392bd6db7 storage_proxy: remove inheritance from enable_shared_from_this for abstract_write_response_handler
No code uses shared_from_this() on abstract_write_response_handler
object, so remove the inheritance.
2017-12-05 15:02:17 +02:00
Gleb Natapov
d974c26eeb storage_proxy: remove unneeded field in abstract_write_response_handler 2017-12-05 15:02:17 +02:00
Gleb Natapov
e7cfe2dd1b storage_proxy: fix pending endpoint accounting for EACH_QUORUM
_total_block_for should account for pending endpoints, but for EACH_QUORUM
it did not.
2017-12-05 15:01:37 +02:00
Takuya ASADA
b492a1e1b1 dist/redhat: fix typo on build_rpm.sh
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512466884-18383-2-git-send-email-syuu@scylladb.com>
2017-12-05 13:40:03 +02:00
José Guilherme Vanz
5261eb7225 build_rpm.sh: command line argument not used
The command line argument `--configure-user` of the build_rpm.sh script
is used nowhere. Thus, this commit remove it all code related to
this flag.

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20171205025920.401-1-guilherme.sft@gmail.com>
2017-12-05 13:24:17 +02:00
Gleb Natapov
357c77a333 consistency_level: constify quorum_for() and local_quorum_for() 2017-12-05 13:01:20 +02:00
Avi Kivity
eea768180b Merge seastar upstream
* seastar dc44656...ac78eec (3):
  > json formatter: Add unsigned support to the json formatter
  > Add missing usual smart-pointer methods to foreign_ptr
  > future-util: remove use of forward references in some primitives
2017-12-05 11:12:12 +02:00
Raphael S. Carvalho
de19e7d942 tests:perf: make perf_sstable write mode work again
Recently, memtable flush in test requires storage service for tests,
or it fails with "Assertion `local_is_initialized()' failed".
storage_service_for_tests needs to run in a thread, that's why
flush_memtable was flattened.
Last but not least, we need to revert flushed memory account because
same memtable is used for all sstables in the perf test so as not
to trigger `_mt._flushed_memory <= _mt.occupancy().used_space()'

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171205012853.21559-1-raphaelsc@scylladb.com>
2017-12-05 10:18:53 +02:00
Vladimir Krivopalov
76775ddf26 Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp.
In version 1.8.3 of JsonCpp shipped with Fedora 27, old FastWriter and
Reader classes from JsonCpp have been deprecated in favour of
newer/better ones: CharReaderBuilder/CharReader and
StreamWriterBuilder/StreamWriter.
This fix uses the new classes where available or resorts to old ones for
older versions of the library.

Fixes #2989

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2017-12-04 21:03:05 -08:00
Vladimir Krivopalov
114c71dcd8 tests: Add unit tests for caching_options.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2017-12-04 17:42:23 -08:00
Paweł Dziepak
046991b0b7 Merge "Flatten combined_mutation_reader" from Botond
"Convert combined_mutation_reader into a flat_mutation_reader impl. For
now - in the name of incremental progress - all consumers are updated to
use the combined reader through the
mutation_reader_from_flat_mutation_reader adaptor. The combined reader also
uses all it's sub mutation_readers through the
flat_mutation_reader_from_mutation_reader adaptor."

* 'bdenes/flatten-combined-reader-v8' of https://github.com/denesb/scylla:
  Add unit tests for the combined reader - selector interactions
  Add flat_mutation_reader overload of make_combined_reader
  Flatten the implementation of combined_mutation_reader
  Add mutation_fragment_merger
  mutation_fragment::apply(): handle partition start and end too
  Add non-const overload of partition_start::partition_tombstone()
  Make combined_mutation_reader a flat_mutation_reader
  Move the mutation merging logic to combined_mutation_reader
  Remove the unnecessary indirection of mutation_reader_merger::next()
  Move the implementation of combined_mutation_reader into mutation_reader_merger
  Remove unused mutation_and_reader::less_compare and operator<
2017-12-04 13:19:05 +00:00
Avi Kivity
a25b5e30f8 Merge "enable secure-apt for Ubuntu/Debian pbuilder" from Takuya
* 'debian-secure-apt-3rdparty-v3' of https://github.com/syuu1228/scylla:
  dist/debian: support Ubuntu 18.04LTS
  dist/debian: disable ALLOWUNTRUSTED
  dist/debian: enable secure-apt for Debian
  dist/debian: enable secure-apt for Ubuntu
2017-12-04 14:46:42 +02:00
Takuya ASADA
4ea3daede9 dist/debian: support Ubuntu 18.04LTS
Ubuntu 18.04LTS is not released yet, but it's already usable so we can prepare
for it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
b4695611ed dist/debian: disable ALLOWUNTRUSTED
We have enabled secure-apt for 3rdparty repos, so we don't need ALLOWUNTRUSTED
anymore.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
92f8743f97 dist/debian: enable secure-apt for Debian
Enable secure-apt for Debian as well.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
a9fef02f9c dist/debian: enable secure-apt for Ubuntu
Our external repos are already signed repo, so let's enable secure-apt.
Seems like more recent version of Ubuntu (tested on 18.04) does not accept
skipping GPG check, so we need it anyway in near future.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2017-12-04 19:49:19 +09:00
Takuya ASADA
531c2e4e89 dist/ami: support AMI cross build
Now we can cross build our .rpm/.deb packages, so let's extend AMI build script
to support cross build, too.

Also Ubuntu 16.04 support added, since it's latest Ubuntu LTS release.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510247204-2899-1-git-send-email-syuu@scylladb.com>
2017-12-04 12:33:24 +02:00
Botond Dénes
956b3519dd Add unit tests for the combined reader - selector interactions
There are a few edge cases that were untested and as this patch-series
reworks completely how the combined-reader works these should be tested
as well to ensure they keep working.
2017-12-04 07:57:43 +02:00
Botond Dénes
e7535f5e88 Add flat_mutation_reader overload of make_combined_reader 2017-12-04 07:57:43 +02:00
Botond Dénes
8731c1bc66 Flatten the implementation of combined_mutation_reader
In fact flatten mutation_reader_merger and adjust combined_mutation_reader
accordingly.
2017-12-04 07:57:43 +02:00
Botond Dénes
217740c608 Add mutation_fragment_merger
This is the mutation fragment level equivalent of mutation_merger.
It merges fragments produced by different sources. Mutation
fragments are not as self-contained as streamed mutations, they have
external context, e.g. the partition they belong to. To support this
mutation_fragment_merger operates on a producer instead of a vector of
fragments. Producer can have internal state and can do side-actions as
fragments are consumed.
2017-12-04 07:57:43 +02:00
Botond Dénes
f6d11a3cfc mutation_fragment::apply(): handle partition start and end too 2017-12-04 07:57:43 +02:00
Botond Dénes
e47791810b Add non-const overload of partition_start::partition_tombstone()
And make the const version return a const reference so that code
mutating the returned value won't compile if the partition_start object
is const.
2017-12-04 07:57:43 +02:00
Botond Dénes
3f8110b5b6 Make combined_mutation_reader a flat_mutation_reader
For now only the interface is converted, behind the scenes the previous
implementation remains, it's output is simply converted by
flat_mutation_reader_from_mutation_reader. The implementation will be
converted in the following patches.
2017-12-04 07:57:43 +02:00
Botond Dénes
c011747c30 Move the mutation merging logic to combined_mutation_reader
This is the second step in splitting the combined readers's logic into
two parts as outlined in the previous patch.
2017-12-04 07:57:43 +02:00
Botond Dénes
3681e17555 Remove the unnecessary indirection of mutation_reader_merger::next() 2017-12-04 07:57:43 +02:00
Botond Dénes
c5e57e0961 Move the implementation of combined_mutation_reader into mutation_reader_merger
This simple code-movement and patch lays the groundwork for splitting
the logic in combined_mutation_reader into two blocks:
* one that takes care of moving the readers in lockstep and emits their
    output as a non-decreasing stream of streamed_mutations and
* one that takes care of merging the above stream into
    strictly-increasing stream of streamed_mutations.

This in turn is preparation-work to the transformation of
combined_mutation_reader into a flat_mutation_reader::impl.
2017-12-04 07:57:43 +02:00
Botond Dénes
85b5ded670 Remove unused mutation_and_reader::less_compare and operator< 2017-12-04 07:57:43 +02:00
Avi Kivity
f3d5674108 Merge "auth: Retry delayed task in case of error" from Duarte
"A delayed task can fail to execute, for example if the consistency
level the task required can't be achieves, so we should ensure it is
retried.

Fixes #3038"

* 'auth-retry/v2' of https://github.com/duarten/scylla:
  auth/standard_role_manager: Extend exception handling
  auth/common: Add exception handling and retry to task scheduling
  auth/standard_role_manager: Lift async block to caller
2017-12-03 12:08:03 +02:00
Vladimir Krivopalov
41eb278899 Only allow DISTINCT SELECT queries with partition key restrictions.
Fixes #2049

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <75e69626d797e63fb1e93a9120f135d4959fad1c.1512162540.git.vladimir@scylladb.com>
2017-12-03 11:59:11 +02:00
Duarte Nunes
7434d21023 auth/standard_role_manager: Extend exception handling
Also handle exceptions thrown by has_existing_roles(), and print a
similar message to Apache Cassandra in case of error.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-02 22:40:13 +00:00
Duarte Nunes
01e2c7b614 auth/common: Add exception handling and retry to task scheduling
This follows the implementation in Apache Cassandra. The auth tasks
executed by delay_until_system_ready() usually perform a query with
QUORUM consistency level, which can fail if some nodes are
unavailable. So, we provide both exception handling and a retry
mechanism.

Fixes #3038

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-02 22:40:06 +00:00
Duarte Nunes
82206f966d auth/standard_role_manager: Lift async block to caller
has_existing_roles() creates a seastar thread, but that can be
lifted to the caller for prettier code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-02 20:15:09 +00:00
Takuya ASADA
8c403ea4e0 dist/debian: disable entire pybuild actions
Even after 25bc18b commited, we still see the build error similar to #3036 on
some environment, but not on dh_auto_install, it on dh_auto_test (see #3039).

So we need to disable entire pybuild actions, not just dh_auto_install.

Fixes #3039

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512185097-23828-1-git-send-email-syuu@scylladb.com>
2017-12-02 19:36:43 +02:00
Vladimir Krivopalov
7f7bf8f23a test.py: Fix a typo in role_manager_test name.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e80ef188c024f178c1c94fe3739b77a2c2448bd4.1512162655.git.vladimir@scylladb.com>
2017-12-01 21:25:08 +00:00
Takuya ASADA
25bc18b8ff dist/debian: skip running dh_auto_install on pybuild
We are getting package build error on dh_auto_install which is invoked by
pybuild.
But since we handle all installation on debian/scylla-server.install, we can
simply skip running dh_auto_install.

Fixes #3036

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512065117-15708-1-git-send-email-syuu@scylladb.com>
2017-12-01 16:06:25 +02:00
Duarte Nunes
9694bee0d4 Merge 'Improvements to mutation printout' from Tomasz
"This series makes it easier to comprehend assertion failures which
involve printing mutation contents."

* 'tgrabiec/mutation-printout' of github.com:scylladb/seastar-dev:
  tests: Introduce mutation_diff script
  mutation: Make printout more concise
  mutation_partition: Don't print absent elements
  mutation_partition: Make row_marker printout similar to other partition elements
  database: Move operator<<() overloads to appropriate source files
  mutation_partition: Use multi-line printout
  position_in_partition: Improve printout
2017-12-01 11:02:02 +00:00
Tomasz Grabiec
c3276451af tests: Introduce mutation_diff script
Converts assertion failure messages which spit out mutation contents
into a human-readable diff.
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
66990867b8 mutation: Make printout more concise
Before:

{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition:

After:

{ks.cf {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} {mutation_partition:
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
05a6c67804 mutation_partition: Don't print absent elements
Makes printout shorter and thus easier to parse.
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
d8b54a57aa mutation_partition: Make row_marker printout similar to other partition elements 2017-12-01 10:52:37 +01:00
Tomasz Grabiec
fd7ab5fe99 database: Move operator<<() overloads to appropriate source files 2017-12-01 10:52:37 +01:00
Tomasz Grabiec
7bde3090b4 mutation_partition: Use multi-line printout
Convert to a multi line output, which is easier to read for a human.

After:

{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition: {tombstone: none},
 range_tombstones: {},
 static: cont=1 {row: },
 clustered: {
    {rows_entry: cont=true dummy=false {position: clustered,ckp{000c636b30303030303030303030},0} {deletable_row: {row: }}},
    {rows_entry: cont=true dummy=true {position: clustered,ckp{000c636b30303030303030303031},0} {deletable_row: {row: }}}}}}
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
36caf0f9db position_in_partition: Improve printout
Before:

 {position: type clustered, bound_weight -1, key ckp{000c636b30303030303030303033}}

After:

 {position: clustered,ckp{000c636b30303030303030303033},-1}

Benefits:

  - most significant parts appear first.
    bound_weight, which is least significant, was in the middle before.

  - shorter, so a bit easier to parse assertion failures.
2017-12-01 10:52:37 +01:00
Jesse Haber-Kucharsky
cc19545f20 auth/standard_role_manager: Fix initialization
Checking for existing roles requires that the system is "settled" first.
This is consistent with the existing code for user-management, but not
with the initial introduction of the role manager.

Fixes #3028.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <57157a0df92dba6bf9a95960b9c8261a45acb1ad.1512093477.git.jhaberku@scylladb.com>
2017-12-01 10:20:16 +01:00
Duarte Nunes
1b4ca6aadf auth/standard_role_manager: Add exception handling for background task
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171130233851.32827-1-duarte@scylladb.com>
2017-12-01 10:20:16 +01:00
Duarte Nunes
ab6f0de6e7 auth/service: Stop role manager instead of starting
Fixes #3028

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171130232032.31924-1-duarte@scylladb.com>
2017-12-01 10:20:16 +01:00
Avi Kivity
f56c8415d8 Merge seastar upstream
* seastar b2a3ea3...dc44656 (1):
  > Update dpdk submodule
2017-11-30 10:37:23 +02:00
Avi Kivity
ca4abb1bbf Merge seastar upstream
* seastar 3b09bad...b2a3ea3 (5):
  > dependency: use new gcc c++ boost
  > test.py: remove unused black_hole
  > util: Add throw_with_backtrace helper to add backtraces to exceptions.
  > tests: add vruntime to scheduling_group_demo
  > Fix Clang build for recently added io_tester app.
2017-11-30 10:31:48 +02:00
Vladimir Krivopalov
6d76ac8043 Lift checks on list and map values to allow values of length > 64K.
Fixes #3007

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <7b232a655b5531d4bfa2be3d9611f8b1ba0349b0.1512021011.git.vladimir@scylladb.com>
2017-11-30 10:31:19 +02:00
Amos Kong
bfc055fedc install different dependence for fedora and centos
The packages are installed from nstall-dependencies.sh don't satisfy
requests in configuration on CentOS. This patch switched to use
newer packages from scylla-3rdparty repo.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9bca7b08704f68c604560e5ec7ce0c0358d328da.1511965492.git.amos@scylladb.com>
2017-11-29 17:05:47 +02:00
Duarte Nunes
cda3ddd146 compound_compact: Change universal reference to const reference
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
2017-11-29 14:41:35 +01:00
Tomasz Grabiec
e9cce59b85 Merge "compact_storage serialization fixes" from Duarte
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.

* git@github.com:duarten/scylla.git  rt-compact-fixes/v1:
  compound_compact: Allow rvalues in size()
  sstables/sstables: Convert non-compound clustering element to compound
  tests/sstable_mutation_test: Verify we can write/read non-correct RTs
  service/storage_service: Export non-compound RT feature
2017-11-29 14:17:50 +01:00
Duarte Nunes
2f513514cc service/storage_service: Export non-compound RT feature
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:50 +01:00
Duarte Nunes
13fc26214e tests/sstable_mutation_test: Verify we can write/read non-correct RTs
Add test to verify we can write and read non-compound tombstones and
compound ones for backward compatibility.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:50 +01:00
Duarte Nunes
013659506b sstables/sstables: Convert non-compound clustering element to compound
576ea421dc introduced a regression
as it didn't change the assumption that all clustering elements where
compound when writing a range tombstone, compound or non-compound, as
compound. Thus, we serialized a non-compound element while we should
have serialized a compound one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:50 +01:00
Duarte Nunes
ec8ce3388e compound_compact: Allow rvalues in size()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-29 14:17:49 +01:00
Paweł Dziepak
586b61d57d size_estimates: convert reader to flat mutation readers
Message-Id: <20171129105909.27084-1-pdziepak@scylladb.com>
2017-11-29 12:14:05 +00:00
Amos Kong
c2bdb3bdbc test.py: remove unused black_hole
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <2e79a58906e8f3ba512586fe4ea4a662fa1a3d35.1511944232.git.amos@scylladb.com>
2017-11-29 11:07:24 +02:00
Amos Kong
fd71405465 auth/transitional: use defined package name prefix
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f3337b00a9209a9af4918a25145d661488387fa8.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:33 +01:00
Amos Kong
46541d400e test.py: fix test runner description
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9b6febecc18376e774611322119a6300dc7363e2.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:22 +01:00
Amos Kong
edfaeb40d9 storage_service: fix trace msg in get_ring_delay()
The trace log in get_ring_delay() is wrong.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <2556583ec160d0417ed669fe3322a16ffda37ce7.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:12 +01:00
Amos Kong
d5caaee0cc main: move messaging service notify to right position
Commit eb13f65949 adjusted the start time
of messaging service, but the notify message wasn't moved together.

Signed-off-by: Amos Kong <amos@scylladb.com>
Cc: Pekka Enberg <penberg@scylladb.com>
Message-Id: <1073f285189686619bb4870ef1be20f0f24e8532.1511945338.git.amos@scylladb.com>
2017-11-29 09:59:01 +01:00
Amos Kong
4be66f8498 main: remove repeat register of storage service API
We repeatedly register storage service API twice. The first one is
before starting storage service, let's remove it.

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8bb09c2acfed57bf74a81d189fa08ba34a594294.1511945338.git.amos@scylladb.com>
2017-11-29 09:58:50 +01:00
Raphael S. Carvalho
f699cf17ae sstables: fix data_consume_context's move operator and ctor
after 7f8b62bc0b, its move operator and ctor broke. That potentially
leads to error because data_consume_context dtor moves sstable ref
to continuation when waiting for in-flight reads from input stream.
Otherwise, sstable can be destroyed meanwhile and file descriptor
would be invalid, leading to EBADF.

Fixes #3020.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171129014917.11841-1-raphaelsc@scylladb.com>
2017-11-29 09:53:47 +01:00
Avi Kivity
4cfcd8055e Merge "Drop reversible apply() from mutation_partition" from Tomasz
"This simplifies implementation of mutation_partition merging by relaxing
exception guarantees it needs to provide. This allows reverters to be dropped.

Direct motivation for this is to make it easier to implement new semantics
for merging of clustering range continuity.

Implementation details:

We only need strong exception guarantees when applying to the memtable, which is
using MVCC. Instead of calling apply() with strong exception guarantees on the latest
version, we will move the incoming mutation to a new partition_version and then
use monotonic apply() to merge them. If that merging fails, we attach the version with
the remainder, which cannot fail. This way apply() always succeeds if the allocation
of partition_version object succeeds.

Results of `perf_simple_query_g -c1 -m1G --write` (high overwrite rate):

Before:

 101011.13 tps
 102498.07 tps
 103174.68 tps
 102879.55 tps
 103524.48 tps
 102794.56 tps
 103565.11 tps
 103018.51 tps
 103494.37 tps
 102375.81 tps
 103361.65 tps

After:

 101785.37 tps
 101366.19 tps
 103532.26 tps
 100834.83 tps
 100552.11 tps
 100891.31 tps
 101752.06 tps
 101532.00 tps
 100612.06 tps
 102750.62 tps
 100889.16 tps

Fixes #2012."

* tag 'tgrabiec/drop-reversible-apply-v1' of github.com:scylladb/seastar-dev:
  mutation_partition: Drop apply_reversibly()
  mutation_partition: Relax exception guarantees of apply()
  mutation_partition: Introduce apply_weak()
  tests: mvcc: Add test for atomicity of partition_entry::apply()
  tests: Move failure_injecting_allocation_strategy to a header
  tests: mutation_partition: Test exception guarantees of apply_monotonically()
  mvcc: Use apply_monotonically() where sufficient
  mvcc: partition_version: Use apply_monotonically() to provide atomicity
  mvcc: Extract partition_entry::add_version()
  mutation_partition: Introduce apply_monotonically()
  mutation_partition: Introduce row::consume_with()
2017-11-28 16:35:06 +02:00
Tomasz Grabiec
70e14f78a7 mutation_partition: Drop apply_reversibly() 2017-11-28 13:03:06 +01:00
Tomasz Grabiec
091e10fc70 mutation_partition: Relax exception guarantees of apply()
The uses which needed strong or weak exception guarantees were
switched to a solution involving apply_monotonically(). All remaining
uses don't need any exception guarantees.
2017-11-28 13:03:06 +01:00
Tomasz Grabiec
988d3c67b4 mutation_partition: Introduce apply_weak()
Intended to be used by code which doesn't need any exception
guarantees.  Currently just delegates to apply_monotonically().
2017-11-28 13:03:03 +01:00
Tomasz Grabiec
ad37826fcb tests: mvcc: Add test for atomicity of partition_entry::apply() 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
e5532bd644 tests: Move failure_injecting_allocation_strategy to a header 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
1b5f2b0473 tests: mutation_partition: Test exception guarantees of apply_monotonically() 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
376cddb212 mvcc: Use apply_monotonically() where sufficient 2017-11-28 12:38:28 +01:00
Tomasz Grabiec
49c0705409 mvcc: partition_version: Use apply_monotonically() to provide atomicity
This patch drops the use of apply_reversibly(). We move the mutation
to be applied into a new version and then use apply_monotonically() to
merge it (if no snapshot) with the current version. This guarantees
that apply() is atomic even if apply_monotonically() throws.

Fixes #2012.
2017-11-28 12:38:28 +01:00
Tomasz Grabiec
52cabe343c mvcc: Extract partition_entry::add_version() 2017-11-28 12:38:27 +01:00
Tomasz Grabiec
97ebf51d3a mutation_partition: Introduce apply_monotonically()
Has weaker exception guarantees than apply(), which allows for simpler
implementation. Intended to replace the apply() with strong exception
guarantees.
2017-11-28 12:28:51 +01:00
Paweł Dziepak
c0253d683b remove partition_snapshot_reader
All uses of partition_snapshot_reader have already been replaced by
partition_snapshot_flat_reader.

Message-Id: <20171128103929.16614-1-pdziepak@scylladb.com>
2017-11-28 12:49:38 +02:00
Tomasz Grabiec
978b874065 mutation_partition: Introduce row::consume_with() 2017-11-28 11:20:03 +01:00
Duarte Nunes
1fbe9dc851 message/messaging_service: Close all server sockets
We were stopping the loop prematurely.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171127181417.8167-1-duarte@scylladb.com>
2017-11-28 11:08:08 +02:00
Jesse Haber-Kucharsky
fb0866ca20 Move thread_local declarations out of main.cc
Since `disk-error-handler.hh` defines these global variables `extern`,
it makes sense to declare them in the `disk-error-handler.cc` instead of
`main.cc`.

This means that test files don't have to declare them.

Fixes #2735.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <1eed120bfd9bb3647e03fe05b60c871de2df2a86.1511810004.git.jhaberku@scylladb.com>
2017-11-27 20:27:42 +01:00
Tomasz Grabiec
04106b4c96 Merge "Convert memtable flush reader to flat streams" from Paweł
This series converts memtable flush reader to the new flat mutation
readers. Just like the scanning reader, flush reader concatenates
multiple partition snapshot readers in order to provide a stream
of all partitions in the memtable.

* https://github.com/pdziepak/scylla.git flat_mutation_reader-memtable-flush/v1
   tests/flat_mutation_reader_assertion: add produces_partition()
   memtable: make make_flush_reader() return flat_mutation_reader
   flat_mutation_reader: add optimised flat_mutation_reader_opt
   memtable: switch flush reader implementation to flat streams
   tests/memtable: add test for flush reader
2017-11-27 20:07:23 +01:00
Paweł Dziepak
87b600cad8 tests/memtable: add test for flush reader 2017-11-27 20:07:23 +01:00
Paweł Dziepak
9dc566c64b memtable: switch flush reader implementation to flat streams 2017-11-27 20:07:22 +01:00
Paweł Dziepak
9c5acaa823 flat_mutation_reader: add optimised flat_mutation_reader_opt 2017-11-27 20:07:22 +01:00
Paweł Dziepak
32eb6437fd memtable: make make_flush_reader() return flat_mutation_reader 2017-11-27 20:07:22 +01:00
Paweł Dziepak
15099a0e8c tests/flat_mutation_reader_assertion: add produces_partition() 2017-11-27 20:07:22 +01:00
Avi Kivity
b7c96b8bd3 Merge "Dormant role-management and CQL" from Jesse
"This series adds the role-management interface, the primary implementation, and the corresponding CQL.

Importantly, this series does not integrate the system with roles, nor does it remove user-based access control. Several new CQL statements are available and should function, but these modify metadata only and have no functional impact on the actual
+system.

The new statements are:

- CREATE ROLE
- ALTER ROLE
- DROP ROLE
- GRANT ROLE
- REVOKE ROLE
- LIST ROLES

The security model of the role manager is simple at this point: only superusers can create and drop roles. The next patch series will introduce fine-grained role permissions and also slightly change the CQL syntax to more consistent with the
+rest of the grammar. This patch series is a starting point for evolving the roles feature and integrating it.

Fixes #2987."

* 'jhk/role_management/v5' of https://github.com/hakuch/scylla:
  auth: Add `alter_role_statement`
  auth: Add `create_role_statement`
  auth: Add `drop_role_statement`
  auth: Add 'revoke_role_statement'
  auth: Add `grant_role_statement`
  auth: Add `list_roles_statement`
  auth: Add dormant role manager to `service`
  auth/service.cc: Remove redundant declarations
  cql3: Add `role_name` and parser rules
  auth: Add role manager
  auth: Unconditionally create the `system_auth` keyspace
  unimplemented.hh: Use [[noreturn]] instead of GCC attribute
  New `unimplemented` feature: roles
2017-11-27 20:01:34 +02:00
Jesse Haber-Kucharsky
9638c2b822 auth: Add alter_role_statement
As with `create_role_statement`, until roles are integrated with the
rest of the system, authentication-related options are ignored.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
ba5dfe0a76 auth: Add create_role_statement
Unlike Apache Cassandra, the role manager does not write data related to
password authentication in the metadata tables, and the rest of the
system does not yet integration with the role manager.

Therefore, executing `CREATE ROLE` currently ignores all
authentication-related options (`PASSWORD` and `OPTIONS`).
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
7524607f26 auth: Add drop_role_statement
Dropping a role removes all references to it from other roles.

As with the role-management statements, executing this statement updates
metadata but has not functional impact yet.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
2594eb0a11 auth: Add 'revoke_role_statement'
As with `grant_role_statement`, executing this statement updates
metadata but has no functional effect.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
7679d45fb6 auth: Add grant_role_statement
While granting a role updates the necessary metadata, since roles do not
interact with the rest of the system yet, there is not functional impact
of doing so.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
b024d40d51 auth: Add list_roles_statement 2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
460f3c7065 auth: Add dormant role manager to service
The role manager still does not interact with the rest of the system,
but the role manager is now sharded on all cores and metadata is
created.

The following metadata tables are created:

- `system_auth.roles`
- `system_auth.role_members`

The default superuser, "cassandra", is also created, but has no function.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
27420fa189 auth/service.cc: Remove redundant declarations 2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
7e78e1ebdc cql3: Add role_name and parser rules
The `userOrRoleName` parser rule is important for future CQL
role-related statements.

`cql3::role_name` is a small utility for role-related CQL statements
that enforce an important property of role names: that they are always
lower-case unless quoted appropriately.
2017-11-27 12:14:24 -05:00
Jesse Haber-Kucharsky
b266b4b687 auth: Add role manager
The role manager is responsible for creating, removing, querying for,
granting, and revoking roles.

The role manager does not yet run in production, and is not connected to
the rest of the system.

Included in this patch is the definition of the abstract role management
interface, and also the implementation of the standard role manager.

The standard role manager is tested fully in the `role_manager_test`.
2017-11-27 12:14:20 -05:00
Jesse Haber-Kucharsky
8b23d32bb1 auth: Unconditionally create the system_auth keyspace
The `system_auth` keyspace is used to store tables for authentication
and authorization metadata.

Previously, this keyspace would only be created if either of the
non-default authenticator or authorizer were activated in configuration.

The upcoming role-management system is enabled unconditionally and also
uses the `system_auth` keyspace for its metadata.
2017-11-27 10:01:52 -05:00
Jesse Haber-Kucharsky
832072d1d9 unimplemented.hh: Use [[noreturn]] instead of GCC attribute 2017-11-27 10:01:52 -05:00
Jesse Haber-Kucharsky
b58914feb8 New unimplemented feature: roles 2017-11-27 10:01:52 -05:00
Duarte Nunes
922f095f22 tests: Initialize storage service for some tests
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
2017-11-26 17:41:06 +02:00
Duarte Nunes
15fbb8e1ca cql3/delete_statement: Allow non-range deletions on non-compound schemas
This patch fixes a regression introduced in
1c872e2ddc.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126102333.3736-1-duarte@scylladb.com>
2017-11-26 12:29:09 +02:00
Takuya ASADA
7380a6088b dist/debian: link libgcc dynamically
As we discussed on the thread (https://github.com/scylladb/scylla/issues/2941),
since we override symbols on libgcc, we need to link libgcc dynamically for
Ubuntu/Debian too (CentOS already do it).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511542866-21486-2-git-send-email-syuu@scylladb.com>
2017-11-25 20:09:51 +02:00
Takuya ASADA
df6546d151 dist/debian: switch to our PPA verions of gcc-72
Now we have gcc-7.2 on our PPA for Ubuntu 16.04/14.04, let's switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511542866-21486-1-git-send-email-syuu@scylladb.com>
2017-11-25 20:09:51 +02:00
Avi Kivity
757d0243a0 Merge seastar upstream
* seastar 7f87529...3b09bad (7):
  > Extend Travis CI to cover Clang 5.0 builds.
  > fair_queue: disallow zeroed shares.
  > Multiple fixes to io_tester to make it compile with GCC 5:
  > transformers: Create tuple explicitely for older compiler support
  > core/sstring: Add construction from `string_view`
  > io_tester: enhanced fair queue tester
  > fstream: do not ignore dma_write return value
2017-11-25 19:50:42 +02:00
Duarte Nunes
4a6ffa3f5c tests/sstable_mutation_test: Change make_reader to make_flat_reader
A merge conflict between 596ebaed1f and
bd1efbc25c caused the test to fail to
build.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-25 15:23:36 +00:00
Tomasz Grabiec
596ebaed1f Merge "Convert sstable writers to flat mutation readers" from Paweł
The following patches convert sstable writers to use flat mutation
readers instead of the legacy mutation_reader interface.
Writers were already using flat consumer interface and used
consume_flattened_in_thread(), so most of the work was limited to
providing an appropriate equivalent for flat mutation readers.

* https://github.com/pdziepak/scylla.git flat_mutation_reader-sstable-write/v1:
  flat_mutation_reader: move consumer_adapter out of consume()
  flat_mutation_reader: introduce consume_in_thread()
  tests/flat_mutation_reader: test consume_in_thread()
  sstables: switch write_components() to flat_mutation_reader
  streamed_mutation: drop streamed_mutation_returning()
  sstables: convert compaction to flat_mutation_reader
  mutation_reader: drop consume_flattened_in_thread()
2017-11-24 16:05:21 +01:00
Tomasz Grabiec
bd1efbc25c Merge "Fixes to sstable files for non-compound schemas" from Duarte
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.

We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.

Fixes #2995
Fixes #2986
Fixes #2979
Fixes #2992
Fixes #2993

* git@github.com:duarten/scylla.git  promoted-index-serialization/v3:
  sstables/sstables: Unify column name writers
  sstables/sstables: Don't write index entry for a missing row maker
  sstables/sstables: Reuse write_range_tombstone() for row tombstones
  sstables/sstables: Lift index writing for row tombstones
  sstables/sstables: Leverage index code upon range tombstone consume
  sstables/sstables: Move out tombstone check in write_range_tombstone()
  sstables/sstables: A schema with static columns is always compound
  sstables/sstables: Lift column name writing logic
  sstables/sstables: Use schema-aware write_column_name() for
    collections
  sstables/sstables: Use schema-aware write_column_name() for row marker
  sstables/sstables: Use schema-aware write_column_name() for static row
  sstables/sstables: Writing promoted index entry leverages
    column_name_writer
  sstables/sstables: Add supported feature list to sstables
  sstables/sstables: Don't use incorrectly serialized promoted index
  cql3/single_column_primary_key_restrictions: Implement is_inclusive()
  cql3/delete_statement: Constrain range deletions for non-compound
    schemas
  tests/cql_query_test: Verify range deletion constraints
  sstables/sstables: Correctly deserialize range tombstones
  service/storage_service: Add feature for correct non-compound RTs
  tests/sstable_*: Start the storage service for some cases
  sstables/sstable_writer: Prepare to control range tombstone
    serialization
  sstables/sstables: Correctly serialize range tombstones
  tests/sstable_assertions: Fix monotonicity check for promoted indexes
  tests/sstable_assertions: Assert a promoted index is empty
  tests/sstable_mutation_test: Verify promoted index serializes
    correctly
  tests/sstable_mutation_test: Verify promoted index repeats tombstones
  tests/sstable_mutation_test: Ensure range tombstone serializes
    correctly
  tests/sstable_datafile_test: Add test for incorrect promoted index
  tests/sstable_datafile_test: Verify reading of incorrect range
    tombstones
  sstables/sstable: Rename schema-oblivious write_column_name() function
  sstables/sstables: No promoted index without clustering keys
  tests/sstable_mutation_test: Verify promoted index is not generated
  sstables/sstables: Optimize column name writing and indexing
  compound_compat: Don't assume compoundness
2017-11-24 16:03:49 +01:00
Tomasz Grabiec
35e404b1a2 tests: sstable: Make tombstone_purge_test more reliable
TTL of 1 second may cause the cell to expire right after we write it,
if the second component of current time changes right after it. Use
larger ttl to avoid spurious faliures due to this.
Message-Id: <1511463392-1451-1-git-send-email-tgrabiec@scylladb.com>
2017-11-24 10:52:26 +00:00
Vladimir Krivopalov
fb7d46fc2e Allow COUNT(*) and COUNT(1) to be queried with other aggregations or columns
Fixes #2218

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <c387d34969d5bcfb8b2bf42806e6e05a9b8a067c.1511487356.git.vladimir@scylladb.com>
2017-11-24 10:01:21 +00:00
Duarte Nunes
576ea421dc compound_compat: Don't assume compoundness
This patch changes some factory functions so that they don't assume
the schema is compound.

This enables some code simplification in
sstables::write_column_name().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 19:14:15 +00:00
Duarte Nunes
8597e1c3f9 sstables/sstables: Optimize column name writing and indexing
Instead of serializing the column name twice, serialize it once into a
buffer which gets used for index bookkeeping and to write to disk.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 19:14:08 +00:00
Paweł Dziepak
6a1fe70a72 mutation_reader: drop consume_flattened_in_thread() 2017-11-23 18:14:31 +00:00
Paweł Dziepak
b64dd21751 sstables: convert compaction to flat_mutation_reader 2017-11-23 18:14:31 +00:00
Paweł Dziepak
9b39d3b023 streamed_mutation: drop streamed_mutation_returning() 2017-11-23 18:14:31 +00:00
Paweł Dziepak
11b32276e6 sstables: switch write_components() to flat_mutation_reader 2017-11-23 18:14:31 +00:00
Paweł Dziepak
2660a43290 tests/flat_mutation_reader: test consume_in_thread() 2017-11-23 18:14:31 +00:00
Paweł Dziepak
cea5778fee flat_mutation_reader: introduce consume_in_thread()
flat_mutation_reader provides a replacement for the old
consume_flattened*() interface and therefore an 'in-thread' variant is
also necessary. It expects to be executed in a seastar::thread context
and guarantees that the consumer member functions will be invoked inside
that thread as well (which is why it cannot be easily replaced by
non-thread version).

Addition to that, just like the old consume_flattened_in_thread() its
replacement allows specifying a filter functions that causes selected
partitions to be skipped entirely and never reach the consumer.
2017-11-23 18:14:31 +00:00
Duarte Nunes
5aa5780701 tests/sstable_mutation_test: Verify promoted index is not generated
Verify we don't generated a promoted index if the schema lacks
clustering keys.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
10dea07ab7 sstables/sstables: No promoted index without clustering keys
We don't need to generate promoted index if the schema lacks
clustering keys.

Fixes #2995

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
66df2e41fc sstables/sstable: Rename schema-oblivious write_column_name() function
This function is now called write_compound_non_dense_column_name() so
callers are aware of the cases where it call be called.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
338f038e7a tests/sstable_datafile_test: Verify reading of incorrect range tombstones
Add a test to verify that we can still read incorrectly written range
tombstones for non-compound schemas, for previous Scylla versions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
766ca8dff4 tests/sstable_datafile_test: Add test for incorrect promoted index
Ensure we don't load incorrectly generated promoted indexes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
f9a76084e9 tests/sstable_mutation_test: Ensure range tombstone serializes correctly
This patch ensures range tombstones are correctly serialized for dense
non-compound schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
e612f71ed6 tests/sstable_mutation_test: Verify promoted index repeats tombstones
Both for compact and non-compact storage schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
d8af9ffe5a tests/sstable_mutation_test: Verify promoted index serializes correctly
For different types of schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
32cb8b6dc0 tests/sstable_assertions: Assert a promoted index is empty
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
ffaa3341c3 tests/sstable_assertions: Fix monotonicity check for promoted indexes
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
24b867adda sstables/sstables: Correctly serialize range tombstones
This patch ensures we correctly serialize range tombstones for dense
non-compound schemas, which until now assumed the bounds were compound
composite. We also fix the reading function, which assumed the same
thing. This affected Apache Cassandra compatibility.

Fixes #2986

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
3368411e03 sstables/sstable_writer: Prepare to control range tombstone serialization
This patch adds support to sstable_writer to be able to control
correct range tombstone serialization.

When range tombstone serialization will be fixed in subsequent
patches, it will only be enabled when the whole cluster supports the
feature to allow for rollbacks.

The feature needs to be enabled for an sstable as a whole, to prevent
problems with it being enabled during an sstable write.

Thus, the sstable writer will pass on this information to the sstable
methods that carry out the actual file writing.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:54 +00:00
Duarte Nunes
19cd65a681 tests/sstable_*: Start the storage service for some cases
We will need to check the cluster's enabled features when writing
range tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
ae3a58d7ec service/storage_service: Add feature for correct non-compound RTs
This patch adds a cluster feature to enable correct serialization of
non-compound range tombstones. We thus support rollbacks during an
upgrade, as we will only change range tombstone serialization when the
cluster is fully upgraded and all nodes are capable of reading the new
format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
eeacef3089 sstables/sstables: Correctly deserialize range tombstones
This patch changes the range tombstone read path to deal with
correctly written non-compound range tombstones, while also
maintaining backward compatibility and reading old Scylla-generated
range tombstones.

The fix for the write path will activate an sstable feature which will
connect with this patch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
e51fc2096b tests/cql_query_test: Verify range deletion constraints
Test that unsupported range deletions against non-compound schemas are
rejected.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
1c872e2ddc cql3/delete_statement: Constrain range deletions for non-compound schemas
We cannot represent ranged deletions with non-inclusive bounds on our
current storage format for schemas that are non-compound, since the
clustering key won't include the EOC byte.

Refs #2986

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
eea4e349ea cql3/single_column_primary_key_restrictions: Implement is_inclusive()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
f217dcc0ce sstables/sstables: Don't use incorrectly serialized promoted index
Promoted indexes generated before this patch by Scylla are considered
incorrect if they belong to a non-compound schema, due to #2993.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
8cdd8e2431 sstables/sstables: Add supported feature list to sstables
This patch adds additional metadata to the scylla sstable component.
Namely, it adds a list of features that the current sstable supports.
The upcoming usages of the feature list are meant for backward
compatibility, but the implementation makes no such assumptions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
e81a3d487d sstables/sstables: Writing promoted index entry leverages column_name_writer
This patch refactors writing a promoted index entry to leverage the
column_name_writer. It not only reduces code duplication, but also
solves two important bugs:

1) Column names for schema types other than compound non-dense were
   not correctly serialized, as the wrong overload of
   write_column_name() was being called, which assumed the specified
   composite to be compound.

2) Before, for some schema types we were passing an empty
   clustering_key to maybe_flush_pi_block(), which caused it to bypass
   appending open range tombstones to the data file, causing wrong
   query results to be returned.

Fixes #2979
Fixes #2992
Fixes #2993

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
51eed140d2 sstables/sstables: Use schema-aware write_column_name() for static row
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
b7624afca6 sstables/sstables: Use schema-aware write_column_name() for row marker
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
42f125c1ef sstables/sstables: Use schema-aware write_column_name() for collections
Eventually all current callers of write_column_name() will move to the
schema-aware one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
96daf17f8c sstables/sstables: Lift column name writing logic
This patch lifts the logic to write a column name depending on the
schema's denseness and compoundness into a function, so that it may
later be reused in other places. We still duplicate the same logic
when writing a clustered row because the index writer requires it for
now.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
594ed2d02a sstables/sstables: A schema with static columns is always compound
A schema can only have static columns if it has at least one
clustering column. A schema with a clustering column is always
compound, unless it is created with compact storage. A schema created
with compact storage cannot have static columns, so we can remove dead
code from the sstable write path.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
8011f3393e sstables/sstables: Move out tombstone check in write_range_tombstone()
We were incurring in superfluous checks as they were already performed
in some of the callers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
1e0f155447 sstables/sstables: Leverage index code upon range tombstone consume
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
989dc1d8c0 sstables/sstables: Lift index writing for row tombstones
This will allow code reuse in the following patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
5dfbdbaa04 sstables/sstables: Reuse write_range_tombstone() for row tombstones
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
d1e1fc928e sstables/sstables: Don't write index entry for a missing row maker
Encapsulate the decision to write the row_marker and to write a
corresponding entry in the promoted index. We now avoid writing the
index entry if there is no row marker, and just start indexing the row
at the first cell.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Duarte Nunes
8907c1dfb2 sstables/sstables: Unify column name writers
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-23 16:45:53 +00:00
Paweł Dziepak
7936a55836 flat_mutation_reader: move consumer_adapter out of consume()
Making consumer_adapter a member of flat_mutation_reader::impl instead
of being a local class in consume() will make it possible to reuse that
helper in other functions.
2017-11-23 14:25:31 +00:00
Glauber Costa
881a859b21 transport: enhance reporting of requests blocked in the transport layer
It's hard to make sense of the metric transport.requests_blocked_memory
because it shows a queue size. Specially in production setups scraping
at every 15 seconds, that doesn't tell us much.

We solve that in other layers that record blocking by providing both a
requests_blocked_memory and requests_blocked_memory_current

Fixes #3010

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171123033329.32596-1-glauber@scylladb.com>
2017-11-23 12:37:16 +02:00
Amnon Heiman
3f8d9a87ee estimated_histogram: update the sum and count when merging
When merging histograms the count and the sum should be updated.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20171122154822.23855-1-amnon@scylladb.com>
2017-11-22 16:55:55 +01:00
Glauber Costa
6c4e8049a0 estimated_histogram: also fill up sum metric
Prometheus histograms have 3 embedded metrics: count, buckets, and sum.
Currently we fill up count and buckets but sum is left at 0. This is
particularly bad, since according to the prometheus documentation, the
best way to calculate histogram averages is to write:

  rate(metric_sum[5m]) / rate(metric_count[5m])

One way of keeping track of the sum is adding the value we sampled,
every time we sample. However, the interface for the estimated histogram
has a method that allows to add a metric while allowing to adjust the
count for missing metrics (add_nano())

That makes acumulating a sum inaccurate--as we will have no values for
the points that were added. To overcome that, when we call add_nano(),
we pretend we are introducing new_count - _count metrics, all with the
same value.

Long term, doing away with sampling may help us provide more accurate
results.

After this patch, we are able to correctly calculate latency averages
through the data exported in prometheus.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171122144558.7575-1-glauber@scylladb.com>
2017-11-22 16:10:12 +01:00
Tomasz Grabiec
e9ffe36d65 Merge "Remove sstable::read_rows" from Piotr
* seastar-dev.git haaawk/flat_reader_remove_read_rows:
  sstable_mutation_test: use read_rows_flat instead of read_rows
  perf_sstable: use read_rows_flat instead of read_rows
  Remove sstable::read_rows
2017-11-22 15:50:59 +01:00
Piotr Jastrzebski
0fdfd2c5bc Remove sstable::read_rows
It's no longer used. read_rows_flat is used everything instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:48:57 +01:00
Piotr Jastrzebski
571bac7336 perf_sstable: use read_rows_flat instead of read_rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:48:57 +01:00
Piotr Jastrzebski
da2f2164e9 sstable_mutation_test: use read_rows_flat instead of read_rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:48:57 +01:00
Tomasz Grabiec
aa8c2cbc16 Merge "Migrate sstables to flat_mutation_reader" from Piotr
Introduce sstable::read_row_flat and sstable::read_range_rows_flat methods
and use them in sstable::as_mutation_source.

* https://github.com/scylladb/seastar-dev/tree/haaawk/flat_reader_sstables_v3:
  Introduce conversion from flat_mutation_reader to streamed_mutation
  Add sstables::read_rows_flat and sstables::read_range_rows_flat
  Turn sstable_mutation_reader into a flat_mutation_reader
  sstable: add getter for filter_tracker
  Move mp_row_consumer methods implementations to the bottom
  Remove unused sstable_mutation_reader constructor
  Replace "sm" with "partition" in get_next_sm and on_sm_finished
  Move advance_to_upper_bound above sstable_mutation_reader
  Store sstable_mutation_reader pointer in mp_row_consumer
  Stop using streamed_mutation in consumer and reader
  Stop using streamed_mutation in sstable_data_source
  Delete sstable_streamed_mutation
  Introduce sstable::read_row_flat
  Migrate sstable::as_mutation_source to flat_mutation_reader
  Remove single_partition_reader_adaptor
  Merge data_consume_context::impl into data_consume_context
  Create data_consume_context_opt.
  Merge on_partition_finished into mark_partition_finished
  Check _partition_finished instead of _current_partition_key
  Merge sstable_data_source into sstable_mutation_reader
  Remove sstable_data_source
  Remove get_next_partition and partition_header
2017-11-22 15:45:21 +01:00
Calle Wilund
912d29e79b storage_service: don't use potentially stale iterator in log
Message-Id: <20171121115119.29642-2-calle@scylladb.com>
2017-11-22 15:26:56 +01:00
Piotr Jastrzebski
df110e8b4d Remove get_next_partition and partition_header
Handle next_partition in on_next_partition instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
a3b69235e3 Remove sstable_data_source
It's not used any more and can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
4b9a34a854 Merge sstable_data_source into sstable_mutation_reader
There's no need for sstable_data_source to be separated any more.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
f2191e0984 Check _partition_finished instead of _current_partition_key
to check whether partition is finished. In next patch
_current_partition_key will be merged with sstable_data_source::_key
and won't be cleared any more.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
be0c9040a6 Merge on_partition_finished into mark_partition_finished
This simplifies code quite a bit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:49 +01:00
Piotr Jastrzebski
8afbe0ead0 Create data_consume_context_opt.
This will be used in sstable_mutation_reader before
first fill_buffer is called and a proper data_consume_context
is created.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-22 15:24:22 +01:00
Duarte Nunes
3d24eed39e service/storage_service: Remove outdated FIXME
Thrift server is now a bit more graceful on shutdown.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171121214341.31165-1-duarte@scylladb.com>
2017-11-22 10:48:46 +02:00
Piotr Jastrzebski
7f8b62bc0b Merge data_consume_context::impl into data_consume_context
There's no reason to use pimpl in data_consume_context

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-21 20:22:38 +01:00
Takuya ASADA
c1b97d11ea dist/redhat: avoid hardcoding GPG key file path on scylla-epel-7-x86_64.cfg
Since we want to support cross building, we shouldn't hardcode GPG file path,
even these files provided on recent version of mock.

This fixes build error on some older build environment such as CentOS-7.2.

Fixes #3002

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511277722-22917-1-git-send-email-syuu@scylladb.com>
2017-11-21 17:25:39 +02:00
Vladimir Krivopalov
61b1988aa1 Use meaningful error messages when throwing a marshal_exception
Fixes #2977

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20171121005108.23074-1-vladimir@scylladb.com>
2017-11-21 16:05:43 +02:00
Daniel Fiala
21ea05ada1 utils/big_decimal: Fix compilation issue with converion of cpp_int to uint64_t.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171121134854.16278-1-daniel@scylladb.com>
2017-11-21 15:51:29 +02:00
Tomasz Grabiec
6969a235f3 Merge "Convert queries to flat mutation readers" from Paweł
These patches convert queries (data, mutation and counter) to flat
mutation readers. All of them already use consume_flattened() to
consume a flat stream of data, so the only major missing thing
 was adding support for reversed partitions to
flat_mutation_reader::consume().

* pdziepak flat_mutation_reader-queries/v3-rebased:
  flat_mutation_reader: keep reference to decorated key valid
  flat_muation_reader: support consuming reversed partitions
  tests/flat_mutation_reader: add test for
    flat_mutation_reader::consume()
  mutation_partition: convert queries to flat_mutation_readers
  tests/row_cache_stress_test: do not use consume_flattened()
  mutation_reader: drop consume_flattened()
  streamed_mutation: drop reverse_streamed_mutation()
2017-11-21 12:55:57 +01:00
Paweł Dziepak
8baf682216 streamed_mutation: drop reverse_streamed_mutation() 2017-11-21 11:37:04 +00:00
Paweł Dziepak
5753e85c6b mutation_reader: drop consume_flattened()
consume_flattened() has been fully replaced by
flat_mutation_reader::consume()
2017-11-21 11:37:04 +00:00
Paweł Dziepak
5851b86369 tests/row_cache_stress_test: do not use consume_flattened() 2017-11-21 11:37:04 +00:00
Paweł Dziepak
48c3db54c9 mutation_partition: convert queries to flat_mutation_readers 2017-11-21 11:37:04 +00:00
Paweł Dziepak
00c8b38a88 tests/flat_mutation_reader: add test for flat_mutation_reader::consume() 2017-11-21 11:37:04 +00:00
Paweł Dziepak
cdb30f74a8 flat_muation_reader: support consuming reversed partitions
Some queries may need the fragments that belong to partition to be
emitted in the reversed order. Current support for that is very limited
(see #1413), but should work reasonably well for small partitions.
2017-11-21 11:37:04 +00:00
Paweł Dziepak
c817adc809 flat_mutation_reader: keep reference to decorated key valid
consume_flattened() guarantees that partition key (passed by reference)
will be valid until the end of partition.
flat_mutation_reader::consume() provides the same interface for consumer
so it also should make sure that the key remains valid.
2017-11-21 11:37:04 +00:00
Paweł Dziepak
1b936876b7 streamed_mutation: make emit_range_tombstone() exception safe
For a time range tombstone that was already removed from a tree
is owned by a raw pointer. This doesn't end well if creation of
a mutation fragment or a call to push_mutation_fragment() throw.
Message-Id: <20171121105749.16559-1-pdziepak@scylladb.com>
2017-11-21 12:28:20 +01:00
Avi Kivity
c6fa727af0 tracing: add missing include
The IDE doesn't understand what lw_shared_ptr<> means without it,
though it does compile.
2017-11-21 13:24:07 +02:00
Piotr Jastrzebski
ae3259c9be position_in_partition: support _type in operator<<
It is useful to print position_in_partition::_type together
with other fields to have a full view of what does the position
represent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <d2d25155a656aa6c2cefcd4964abccfa31cc4c45.1511252093.git.piotr@scylladb.com>
2017-11-21 12:35:32 +02:00
Vlad Zolotarov
941aa20252 cql_transport::cql_server: fix the distributed prepared statements cache population
Don't std::move() the "query" string inside the parallel_for_each() lambda.
parallel_for_each is going to invoke the given callback object for each element of the range
and as a result the first call of lambda that std::move()s the "query" is going to destroy it for
all other calls.

Fixes #2998

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1511225744-1159-1-git-send-email-vladz@scylladb.com>
2017-11-21 10:37:49 +02:00
Piotr Jastrzebski
644f9d9883 Remove single_partition_reader_adaptor
It is no longer used anywhere.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
eb31ec00a2 Migrate sstable::as_mutation_source to flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
11a354b144 Introduce sstable::read_row_flat
This will be used together with sstables::read_range_rows
to migrate sstables::as_mutation_source().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
65c6f339d6 Delete sstable_streamed_mutation
It's no longer used so can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
e241b0c2de Stop using streamed_mutation in sstable_data_source
Use a partition_header instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
375c321e9d Stop using streamed_mutation in consumer and reader
Don't use streamed_mutation in mp_row_consumer
and sstable_mutation_reader.

Also use sstable_mutation_reader in sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:22:57 +01:00
Avi Kivity
ff19cdc092 Merge seastar upstream
* seastar 78cd87f...7f87529 (3):
  > exception: use phdr hash on reactor threads only
  > tests: httpd use noncopyable_function
  > Merge "fixes of issues found by seastar's unit tests" (ppc) from Vlad

Fixes #2967.
2017-11-20 17:16:52 +02:00
Botond Dénes
f059e71056 Add fast-forwarding with no data test to mutation_source_test
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9cb630bf9441e178b2040709f92767d4a740a875.1511180262.git.bdenes@scylladb.com>
2017-11-20 13:36:14 +01:00
Botond Dénes
a1a0d445d6 flat_mutation_reader_assertions: add fast_forward_to(position_range)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7b530909cf188887377aec3985f9f8c0e3b9b1e8.1511180262.git.bdenes@scylladb.com>
2017-11-20 13:35:57 +01:00
Botond Dénes
8065dca4a1 flat_mutation_reader_from_mutation_reader(): make ff more resilient
Currently flat_mutation_reader_from_mutation_reader()'s
converting_reader will throw std::runtime_error if fast_forward_to() is
called when its internal streamed_mutation_opt is disengaged. This can
create problems if this reader is a sub-reader of a combined reader as the
latter has no way to determine the source of a sub-reader EOS. A reader
can be in EOS either because it reached the end of the current
position_range or because it doesn't have any more data.
To avoid this, instead of throwing we just silently ignore the fact that
the streamed_mutation_opt is disengaged and set _end_of_stream to true
which is still correct.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <83d309b225950bdbbd931f1c5e7fb91c9929ba1c.1511180262.git.bdenes@scylladb.com>
2017-11-20 13:35:42 +01:00
Duarte Nunes
34a0b85982 thrift/server: Handle exception within gate
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
2017-11-20 13:55:14 +02:00
Takuya ASADA
f26cde582f configure.py: suppress 'nonnull-compare' warning on antlr3
We get following warning from antlr3 header when we compile Scylla with gcc-7.2:
/opt/scylladb/include/antlr3bitset.inl: In member function 'antlr3::BitsetList<AllocatorType>::BitsetType* antlr3::BitsetList<AllocatorType>::bitsetLoad() [with ImplTraits = antlr3::TraitsBase<antlr3::CustomTraitsBase>]':
/opt/scylladb/include/antlr3bitset.inl:54:2: error: nonnull argument 'this' compared to NULL [-Werror=nonnull-compare]

To make it compilable we need to specify '-Wno-nonnull-compare' on cflags.

Message-Id: <1510952411-20722-2-git-send-email-syuu@scylladb.com>
2017-11-20 13:07:09 +02:00
Takuya ASADA
ab9d7cdc65 dist/debian: switch Debian 3rdparty packages to external build service
Switch Debian 3rdparty packages to our OBS repo
(https://build.opensuse.org/project/subprojects/home:scylladb).

We don't use 3rdparty packages on dist/debian/dep, so dropped them.
Also we switch Debian to gcc-7.2/boost-1.63 on same time.

Due to packaging issues following packages doesn't renamed our 3rdparty
package naming rule for now:
 - gcc-7: renamed as 'xxx-scylla72', instead of scylla-xxx-72.
 - boost1.63: doesn't renamed, also doesn't changed prefix to /opt/scylladb

Message-Id: <1510952411-20722-1-git-send-email-syuu@scylladb.com>
2017-11-20 13:07:04 +02:00
Tomasz Grabiec
cec5b0a5b8 Merge "Fix reversed queries with range tombstones" from Paweł
This series reworks handling of range tombstones in reversed queries
so that they are applied to correct rows. Additionally, the concept
of flipped range tombstones is removed, since it only made it harder
to reason about the code.

Fixes #2982.

* https://github.com/pdziepak/scylla fix-reverse-query-range-tombstone/v2:
  streamed_mutation: fix reversing range tombstones
  range_tombstone: drop flip()
  tests/cql_query_test: test range tombstones and reverse queries
  tests/range_tombstone_list: add test for range_tombstone_accumulator
2017-11-17 16:31:34 +01:00
Tomasz Grabiec
2113299b61 sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition()
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.

Fixes #2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>
2017-11-17 11:00:26 +00:00
Piotr Jastrzebski
f7bf782a41 Store sstable_mutation_reader pointer in mp_row_consumer
The reader will be used by mp_row_consumer instead of streamed_mutation
in next patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
145fcf846e Move advance_to_upper_bound above sstable_mutation_reader
It will be used in sstable_mutation_reader when the reader
will be used to implement sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
1c7938c44d Replace "sm" with "partition" in get_next_sm and on_sm_finished
Streamed mutation won't be used any more so get_next_partition
and on_partition_finished are more suitable names.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
4943f52ad7 Remove unused sstable_mutation_reader constructor
The constructor is never used so it can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
c7971eb8e3 Move mp_row_consumer methods implementations to the bottom
Those methods have to be below sstable_mutation_reader because
they will be using the reader instead of streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
19fcf8accf sstable: add getter for filter_tracker
This will be needed to use sstable_mutation_reader for
sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
537b42e153 Turn sstable_mutation_reader into a flat_mutation_reader
This is the first step which still uses streamed_mutation.
Next step will be to get rid of streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:00 +01:00
Paweł Dziepak
81b9595dcc tests/range_tombstone_list: add test for range_tombstone_accumulator 2017-11-16 17:15:36 +00:00
Paweł Dziepak
774fcc8c66 tests/cql_query_test: test range tombstones and reverse queries
Reproducer for #2982.
2017-11-16 17:15:36 +00:00
Paweł Dziepak
bb54af66a9 range_tombstone: drop flip()
Flipped range tombstones violated the assumption that position() <
end_position() and therefore could only be used in some specific cases.
2017-11-16 17:15:36 +00:00
Paweł Dziepak
5f08831192 streamed_mutation: fix reversing range tombstones
Right now reversed streamed mutation emits range tombstones after the
mutation fragments affected by them. This breakes the queries.

This patch reworks the way range tombstones are handled in reversed
streams:
 - range tombstones are no longer flipped -- invariant that start bound
   is smaller than the end bound always holds
 - in reversed streams they are ordered by their end_position()

Fixes #2982.
2017-11-16 17:15:36 +00:00
Avi Kivity
6950389a3f Update seastar submodule
* seastar 11ad0b1...78cd87f (3):
  > Merge "http: Use output stream for files" from Amnon
  > tutorial: a section about when_all() and when_all_succeed()
  > Merge "Power8 related changes (what's left of them)" from Vlad
2017-11-16 16:31:46 +02:00
Avi Kivity
f18b3928d0 Merge 2017-11-16 14:55:45 +02:00
Avi Kivity
beffe469af index_entry: add move constructor, assigment operators
As can be seen in one of the traces in #2958, the copy constructor
of index_entry is called in response to std::vector<index_entry>::push_back(index_entry&&).

This is wasteful. Fix by providing the full suite of constructors/assignment operators.
Message-Id: <20171116121608.5580-1-avi@scylladb.com>
2017-11-16 13:54:05 +01:00
Avi Kivity
bbcfc57cb4 Merge "Free auth and its use from global variables" from Jesse
"This patch series addresses #2929. The objective is to eliminate global
state from the implementation and use of all access-control functionlity.

I've made every effort to make these patches logically independent and
incremental, but the final patch is big: this was necessary because
eliminating the global instances themselves is an atomic change."

* 'jhk/non_global_auth/v2' of https://github.com/hakuch/scylla:
  auth: Switch to sharded service
  tracing/trace_keyspace_helper: Use internal `client_state`
  auth: Make the QP an explicit dependency
  auth: Unify Java class name attributes
  auth: Make life-time control more consistent
  auth: Move metadata constants
  auth: Don't expose internal constant
  auth: Extract `permissions_cache`
  utils/loading_cache: Include necessary dependency
  auth: Fix static constant initialization
  auth: Extract `delayed_tasks` from `auth.cc`
2017-11-16 14:52:34 +02:00
Jesse Haber-Kucharsky
ba6a41d397 auth: Switch to sharded service
This change appears quite large, but is logically fairly simple.

Previously, the `auth` module was structured around global state in a
number of ways:

- There existed global instances for the authenticator and the
  authorizer, which were accessed pervasively throughout the system
  through `auth::authenticator::get()` and `auth::authorizer::get()`,
  respectively. These instances needed to be initialized before they
  could be used with `auth::authenticator::setup(sstring type_name)`
  and `auth::authorizer::setup(sstring type_name)`.

- The implementation of the `auth::auth` functions and the authenticator
  and authorizer depended on resources accessed globally through
  `cql3::get_local_query_processor()` and
  `service::get_local_migration_manager()`.

- CQL statements would check for access and manage users through static
  functions in `auth::auth`. These functions would access the global
  authenticator and authorizer instances and depended on the necessary
  systems being started before they were used.

This change eliminates global state from all of these.

The specific changes are:

- Move out `allow_all_authenticator` and `allow_all_authorizer` into
  their own files so that they're constructed like any other
  authenticator or authorizer.

- Delete `auth.hh` and `auth.cc`. Constants and helper functions useful
  for implementing functionality in the `auth` module have moved to
  `common.hh`.

- Remove silent global dependency in
  `auth::authenticated_user::is_super()` on the auth* service in favour
  of a new function `auth::is_super_user()` with an explicit auth*
  service argument.

- Remove global authenticator and authorizer instances, as well as the
  `setup()` functions.

- Expose dependency on the auth* service in
  `auth::authorizer::authorize()` and `auth::authorizer::list()`, which
  is necessary to check for superuser status.

- Add an explicit `service::migration_manager` argument to the
  authenticators and authorizers so they can announce metadata tables.

- The permissions cache now requires an auth* service reference instead
  of just an authorizer since authorizing also requires this.

- The permissions cache configuration can now easily be created from the
  DB configuration.

- Move the static functions in `auth::auth` to the new `auth::service`.
  Where possible, previously static resources like the `delayed_tasks`
  are now members.

- Validating `cql3::user_options` requires an authenticator, which was
  previously accessed globally.

- Instances of the auth* service are accessed through `external`
  instances of `client_state` instead of globally. This includes several
  CQL statements including `alter_user_statement`,
  `create_user_statement`, `drop_user_statement`, `grant_statement`,
  `list_permissions_statement`, `permissions_altering_statement`, and
  `revoke_statement`. For `internal` `client_state`, this is `nullptr`.

- Since the `cql_server` is responsible for instantiating connections
  and each connection gets a new `client_state`, the `cql_server` is
  instantiated with a reference to the auth* service.

- Similarly, the Thrift server is now also instantiated with a reference
  to the auth* service.

- Since the storage service is responsible for instantiating and
  starting the sharded servers, it is instantiated with the sharded
  auth* service which it threads through. All relevant factory functions
  have been updated.

- The storage service is still responsible for starting the auth*
  service it has been provided, and shutting it down.

- The `cql_test_env` is now instantiated with an instance of the auth*
  service, and can be accessed through a member function.

- All unit tests have been updated and pass.

Fixes #2929.
2017-11-15 23:22:42 -05:00
Jesse Haber-Kucharsky
1dd181bd7b tracing/trace_keyspace_helper: Use internal client_state 2017-11-15 23:19:18 -05:00
Jesse Haber-Kucharsky
41612ee577 auth: Make the QP an explicit dependency
Rather than have all uses of the QP in auth reference global variables,
we supply a QP reference to both the authenticator and authorizer on
construction.

The caller still references a global variable when constructing the
instances, but fixing this problem is a much larger task that is out of
scope of this change.
2017-11-15 23:19:13 -05:00
Jesse Haber-Kucharsky
157e22a4f0 auth: Unify Java class name attributes 2017-11-15 23:19:00 -05:00
Jesse Haber-Kucharsky
9aff5d9a77 auth: Make life-time control more consistent 2017-11-15 23:18:44 -05:00
Jesse Haber-Kucharsky
5825e37310 auth: Move metadata constants
This change is motivated partly be aesthetics, but more significantly
due to the future work to refactor `auth` into a sharded service. Since
doing so will require writing `auth::auth` from scratch, these
constants (and other common functionality) need a new home.
2017-11-15 23:18:42 -05:00
Jesse Haber-Kucharsky
22670cae82 auth: Don't expose internal constant 2017-11-15 23:17:52 -05:00
Jesse Haber-Kucharsky
20b7f92b9c auth: Extract permissions_cache
In addition to improving clarity, this makes the cache testable.

There shouldn't be any functional changes.
2017-11-15 23:17:41 -05:00
Jesse Haber-Kucharsky
6f4241574c utils/loading_cache: Include necessary dependency 2017-11-15 23:17:05 -05:00
Jesse Haber-Kucharsky
5c39a2cc15 auth: Fix static constant initialization
Using "Meyer's singletons" eliminate the problem of static constant
initialization order because static variables inside functions are
initialized only the first time control flow passes over their
declaration.

Fixes #2966.
2017-11-15 23:16:52 -05:00
Jesse Haber-Kucharsky
507e1ef8d5 auth: Extract delayed_tasks from auth.cc
This simple task scheduler is used by the auth module to delay metadata
creation until the system is settled.

Extracting it out allows the `auth` module to be refactored into a
sharded service and for other components of `auth` to make use of it.

Fixes #2965.
2017-11-15 23:16:46 -05:00
Piotr Jastrzebski
74f0c01865 Add sstables::read_rows_flat and sstables::read_range_rows_flat
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 15:33:23 +01:00
Piotr Jastrzebski
3f70dfc939 Introduce conversion from flat_mutation_reader to streamed_mutation
Allows splitting migration into small steps.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 15:33:23 +01:00
Botond Dénes
7f60fb19b4 flat_mutation_reader::fast_forward_buffer_to: remove schema parameter
e7a0732f72 added the schema to
flat_mutation_reader::impl so the schema doesn't need to be provided
externally anymore.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <04933512d3485d85629a9945b8ecb211aa2aab50.1510732121.git.bdenes@scylladb.com>
2017-11-15 10:40:02 +01:00
Tomasz Grabiec
a061be688d Merge "Prepare sstables read path for flat_mutation_reader" from Piotr
This patchset prepares sstables read path for flat_mutation_reader.
It cuts some dependencies between classes and replaces
sstables::mutation_reader with ::mutation_reader.  This will make it
possible to gradually convert the code to flat_mutation_reader because
we have converters between flat_mutation_reader and ::mutation_reader.

* seastar-dev.git haaawk/flat_reader_prepare_sstables_rebased
  Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
  Replace sstables::mutation_reader with ::mutation_reader
  Remove range_reader_adaptor
  Remove sstable_range_wrapping_reader
2017-11-15 10:40:02 +01:00
Piotr Jastrzebski
6cd4b6b09c Remove sstable_range_wrapping_reader
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:02 +01:00
Piotr Jastrzebski
6d85e4fb0c Remove range_reader_adaptor
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:02 +01:00
Piotr Jastrzebski
ea449c9cce Replace sstables::mutation_reader with ::mutation_reader
This will make migration to flat_mutation_reader much
easier and sstables::mutation_reader is going away with
this migration anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:01 +01:00
Piotr Jastrzebski
228f0737f4 Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
Before this patch mp_row_consumer was using sstable_streamed_mutation
in two ways:

1. Populate sstable_streamed_mutation's buffer with mutation_fragments
2. Advance sstable_streamed_mutation's sstable_data_source to new position.

We can easily reduce those dependencies only to the first one.
This will reduce the coupling between those classes and simplify
the flow of execution.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:01 +01:00
Takuya ASADA
07c191af41 dist/common/scripts/scylla_dev_mode_setup: include scylla_lib.sh
To use verify_args function we requires scylla_lib.sh, so include it.

Fixes #2945

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510154173-18017-1-git-send-email-syuu@scylladb.com>
2017-11-15 11:31:14 +02:00
Avi Kivity
7caf3a543e Merge "Respect size-tiered options in strategies that rely on its functionality" from Raphael
"Otherwise, such strategies couldn't behave as expected when it needs to do STCS."

* 'respecting_stcs_options_v2' of github.com:raphaelsc/scylla:
  tests: enable twcs test that relied on size-tiered properties
  twcs: respect stcs options by forwarding them to stcs method
  lcs: forward stcs options to respect them
  stcs: make most_interesting_bucket respect size-tiered options
  stcs: make most_interesting_bucket respect thresholds
  compaction: make size_tiered_most_interesting_bucket static method of stcs class
  stcs: introduce new ctor
  stcs: make header self contained
  stcs: inline function definition so as not to break one definition rule
2017-11-14 17:57:57 +02:00
Raphael S. Carvalho
1f478d5daa tests: enable twcs test that relied on size-tiered properties
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:27 -02:00
Raphael S. Carvalho
8165af1d08 twcs: respect stcs options by forwarding them to stcs method
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:27 -02:00
Raphael S. Carvalho
9cdc047a4c lcs: forward stcs options to respect them
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:27 -02:00
Raphael S. Carvalho
2b7f87474b stcs: make most_interesting_bucket respect size-tiered options
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:25 -02:00
Raphael S. Carvalho
d8ec913c34 stcs: make most_interesting_bucket respect thresholds
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:26:04 -02:00
Raphael S. Carvalho
cb6d060d8e compaction: make size_tiered_most_interesting_bucket static method of stcs class
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:24:03 -02:00
Raphael S. Carvalho
b69dbf8b99 stcs: introduce new ctor
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:21:59 -02:00
Avi Kivity
0dc888f963 Merge 2017-11-14 15:59:52 +02:00
Tomasz Grabiec
7323fe76db gossiper: Replicate endpoint_state::is_alive()
Broken in f570e41d18.

Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.

Fixes regression in dtest:

  consistency_test.py:TestAvailability.test_simple_strategy

which was expected to get "unavailable" exception but it was getting a
timeout.

Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
2017-11-14 15:58:00 +02:00
Vlad Zolotarov
c6c41aa877 tests: loading_cache_test: make it more robust
Make sure loading_cache::stop() is always called where appropriate:
regardless whether the test failed or there was an exception during the test.
Otherwise a false-alarm use-after-free error may occur.

Fixes #2955

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1510625736-3109-1-git-send-email-vladz@scylladb.com>
2017-11-14 11:35:49 +00:00
Avi Kivity
09e730f9f2 Merge "Fix bugs in cache related to handling of bad_alloc" from Tomasz
"Fixes #2944."

* tag 'tgrabiec/cache-exception-safety-fixes-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add test for exception safety of multi-partition scans
  tests: row_cache: Add test for exception safety of single-partition reads
  tests: mutation_source_tests: Always print the seed
  tests: Disable alloc failure injection in test assertions
  tests: Avoid needless copies
  row_cache: Fix exception safety of cache_entry::read()
  row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
  row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh()
  cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation
  mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors
  cache_streamed_mutation: Make advancing to the next range exception-safe
  cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe
  cache_streamed_mutation: Make drain_tombstones() exception-safe
  cache_streamed_mutation: Return void from start_reading_from_underlying()
  cache_streamed_mutation: Document invariants related to exception-safety
  streamed_mutation: Add reserve_one()
  lsa: Guarantee invalidated references on allocating section retry
  mvcc: partition_snapshot_row_cursor: Mark allocation points
2017-11-14 11:42:13 +02:00
Raphael S. Carvalho
f6574412a3 stcs: make header self contained
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-13 18:07:31 -02:00
Raphael S. Carvalho
2b45aa3593 stcs: inline function definition so as not to break one definition rule
goal is to allow multiple definitions of header

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-13 18:07:30 -02:00
Tomasz Grabiec
638d23025b tests: row_cache: Add test for exception safety of multi-partition scans 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
084e1861c8 tests: row_cache: Add test for exception safety of single-partition reads 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
a968a84ec5 tests: mutation_source_tests: Always print the seed
BOOST_TEST_MESSAGE() is not logged by default, and for some tests we
don't want to enable that because it's too noisy. But we need to know
the seed to reproduce a failure, so we better to always print it.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
e868929faf tests: Disable alloc failure injection in test assertions
Injecting failures to assertions doesn't add much value but slows down
test execution by adding extra iterations.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5cf7f9d1bb tests: Avoid needless copies 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
1971332195 row_cache: Fix exception safety of cache_entry::read()
When we fail, we need to return streamed_mutation back, so that
the operation can be retried.

Causes SIGSEGV on nullptr otherwise on bad_alloc.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
11a195c403 row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
If assignment to _lower_bound in the "_secondary_in_progress = true;"
case in do_read_from_primary() throws due to allocation failure, the
update section will be retried and we will take the not_moved path,
skipping the range which was discontinuous and was supposed to be read
from underlying.

Fix by redoing lookup using _lower_bound in case the section is
retried. When we retry, _primary.valid() will be false. We need to
ensure now that _lower_bound is always valid.

Fixes #2944.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5dc1ee41e4 row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh() 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
09c49b2db3 cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
f60cfa34f4 mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors
Assignment to _pos (position_in_partition) may throw. noexcept is a
remnant from the version which didn't have _pos.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
bd7b68f877 cache_streamed_mutation: Make advancing to the next range exception-safe
Changing _ck_ranges_curr and _lower_bound should be atomic, either
both fail or both succeed.  Currently it could happen that if
position_in_partition::for_range_start() fails, _ck_ranges_curr would
be advanced but _lower_bound not.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
081deec731 cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe
We need to maintain the following invariants:
 (1) no fragment with position >= _lower_bound was pushed yet
 (2) If _lower_bound > mf.position(), mf was emitted

Before this patch (1) could be violated if drain_tombstones() failed
in the middle. (2) could be violated if push_mutation_fragment()
failed.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
d1b844737a cache_streamed_mutation: Make drain_tombstones() exception-safe
If push_mutation_fragment() failed, mfo which we got from get_next()
would be lost. Fix by making sure push_mutation_fragment() won't fail.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
875fc93956 cache_streamed_mutation: Return void from start_reading_from_underlying()
The return value is no longer used.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5fb319bbb9 cache_streamed_mutation: Document invariants related to exception-safety 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
53f4452b47 streamed_mutation: Add reserve_one() 2017-11-13 20:55:13 +01:00
Tomasz Grabiec
8d69d217af lsa: Guarantee invalidated references on allocating section retry
There is existing code (e.g. use of partition_snapshot_row_cursor in
cache_streamed_mutation) which assumes that references will be
invalidated when bad_alloc is thrown from allocating_section. That is
currently the case because on retry we will attempt memory reclamation
which will invalidate references either through compaction or
eviction. Make this guarantee explicit.
2017-11-13 20:55:13 +01:00
Tomasz Grabiec
6bf1c6014f mvcc: partition_snapshot_row_cursor: Mark allocation points
This marks places which may allocate but not always do as allocation
points to increase effectiveness of testing.
2017-11-13 20:55:13 +01:00
Raphael S. Carvalho
cfd2343689 sstables: fix report in integrity check file interposer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113185842.11018-1-raphaelsc@scylladb.com>
2017-11-13 20:07:09 +01:00
Raphael S. Carvalho
cf8e12c760 checked_file_impl: remove unneeded variant of open_checked_file_dma
like in integrity_checked_file_impl, we don't need a variant of
open for default file open options.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113185412.10880-1-raphaelsc@scylladb.com>
2017-11-13 20:06:58 +01:00
Raphael S. Carvalho
564046a135 thrift: fix compilation error
thrift/server.cc:237:6:   required from here
thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object
         maybe_retry_accept(which, keepalive, std::move(ex));

gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>
2017-11-13 20:05:33 +01:00
Avi Kivity
9cd5bc4eb8 Merge "Convert streaming to flat mutation readers" from Paweł
"The following patches convert streaming and repair code to the new
flat mutation reader interface. In particular this involves changing::
 - fragment_and_freeze() -- a consumer that fragments and freezes mutations
 - checksum computation for repair which until now was using two-level
   mutation_reader/streamed_mutation interface
 - multi_range_reader -- a mutation reader that automatically fast
   forwards to between given partiton ranges"

* tag 'flat_mutation_reader-streaming/v2' of https://github.com/pdziepak/scylla: (24 commits)
  mutation_reader: drop multi_range_reader
  db: convert make_streaming_reader() to flat_mutation_reader
  tests/flat_mutation_reader: add test for multi range reader
  tests/flat_mutation_reader_assertions: add fast_forward_to()
  tests/simple_schema: add to_ring_positions() helper
  flat_mutation_reader: convert flat_multi_range_mutation_reader
  flat_mutation_reader: add partition_range_forwarding
  flat_mutation_reader: make pop_mutation_fragment() public
  flat_mutation_reader: copy multi_range_mutation_reader
  streamed_mutation: drop mutation_hasher
  tests/flat_mutation_reader: add test for partition checksum
  repair: convert partition_checksum::compute_streamed() to flat streams
  repair: make partition_hasher consume flat mutation streams
  mutation_hasher: copy mutation_hasher to repair.cc
  partition_start: make partition_tombstone() const
  partition_checksum: introduce compute() for flat_mutation_reader
  db: drop single-range make_streaming_reader()
  fragment_and_freeze: drop streamed_mutation overload
  stream_transfer_task: switch to flat_mutation_reader
  tests/flat_mutation_reader: add test for fragment_and_freeze
  ...
2017-11-13 18:56:59 +02:00
Paweł Dziepak
97767963a0 mutation_reader: drop multi_range_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
dca93bea23 db: convert make_streaming_reader() to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
98965add5b tests/flat_mutation_reader: add test for multi range reader
Based on mutation_reader.cc:test_multi_range_reader.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
d23813cd41 tests/flat_mutation_reader_assertions: add fast_forward_to() 2017-11-13 16:49:52 +00:00
Paweł Dziepak
8fc9d250c5 tests/simple_schema: add to_ring_positions() helper
Based on mutation_reader_test.cc:to_ring_position()
2017-11-13 16:49:52 +00:00
Paweł Dziepak
a9ec01d5a5 flat_mutation_reader: convert flat_multi_range_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
11e8866aee flat_mutation_reader: add partition_range_forwarding
flat_mutation_reader::partition_range_forwarding and
mutation_reader::forwarding are aliases of the same type. The change was
necessary in order to make mutation_reader::forwarding available in
flat_mutation_reader.hh even though it is included by mutation_reader.hh
2017-11-13 16:49:52 +00:00
Paweł Dziepak
009785a178 flat_mutation_reader: make pop_mutation_fragment() public
flat_mutation_reader public interface already exposes low leve
is_buffer_empty() and is_buffer_full() adding pop_mutation_fragment()
will make implementation of intermediate readers more straightforward.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
d9a2b00d4a flat_mutation_reader: copy multi_range_mutation_reader
multi_range_mutation_reader for flat mutation readers is going to be
based on the original one.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
7866e5b4a9 streamed_mutation: drop mutation_hasher 2017-11-13 16:49:52 +00:00
Paweł Dziepak
aa64b711d1 tests/flat_mutation_reader: add test for partition checksum
Based on streamed_mutation_test:test_mutation_hash
2017-11-13 16:49:52 +00:00
Paweł Dziepak
f690e2e80b repair: convert partition_checksum::compute_streamed() to flat streams 2017-11-13 16:49:52 +00:00
Paweł Dziepak
d71a14b943 repair: make partition_hasher consume flat mutation streams 2017-11-13 16:49:52 +00:00
Paweł Dziepak
2b774119a1 mutation_hasher: copy mutation_hasher to repair.cc
Repair is the exclusive user of mutation_hasher. Moving it there will
make integration with partition_checksum easier.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
af4fa6152b partition_start: make partition_tombstone() const 2017-11-13 16:49:52 +00:00
Paweł Dziepak
f648f94464 partition_checksum: introduce compute() for flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
37640f223b db: drop single-range make_streaming_reader() 2017-11-13 16:49:52 +00:00
Paweł Dziepak
e2481a89e1 fragment_and_freeze: drop streamed_mutation overload 2017-11-13 16:49:52 +00:00
Paweł Dziepak
6f1e0d3ed8 stream_transfer_task: switch to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
50a1d76c1f tests/flat_mutation_reader: add test for fragment_and_freeze
Based on streamed_mutation_test:test_fragmenting_and_freezing_streamed_mutations
2017-11-13 16:49:52 +00:00
Paweł Dziepak
f5c40e0861 flat_mutation_reader_from_mutations: take vector by value 2017-11-13 16:49:51 +00:00
Paweł Dziepak
9854b8a450 fragment_and_freeze: work on flat_mutation_readers 2017-11-13 16:49:47 +00:00
Paweł Dziepak
8bb672502d fragment_and_freeze: allow callback to stop iteration
There is a user of fragment_and_freeze() (streaming) that will need
to be able to break the loop Right now, it does that between
streamed_mutation, but that won't be possible after we switch to flat
readers.
2017-11-13 16:44:33 +00:00
Paweł Dziepak
73b8f54cf4 test/mutation_source_test: generate sets of mutations 2017-11-13 16:42:56 +00:00
Tomasz Grabiec
3536d2156c tests: row_cache: Add reproducer for issue #2948
Message-Id: <1510229584-14398-2-git-send-email-tgrabiec@scylladb.com>
2017-11-13 15:20:21 +00:00
Tomasz Grabiec
8402728747 row_cache: Call open_version() under region's allocator
partition_entry::read() calls open_version() under standard allocator,
but it may allocate a new partition version if a snapshot already
exists which was created in an earlier phase. Versions are supposed to
be allocated using region's allocator, they will be freed using
region's allocator. LSA will delegate free() to the standard allocator
correctly in this case, but it will subtract from its
_non_lsa_occupancy, assuming the allocation was done through it. This
will corrupt occupancy() for cache region.

Fixes #2948.
Message-Id: <1510229584-14398-1-git-send-email-tgrabiec@scylladb.com>
2017-11-13 15:20:08 +00:00
Avi Kivity
061f6830fa Merge "thrift/server: Ensure stop() waits for accepts" from Duarte
"Ensure stop() waits for the accept loop to complete to avoid crashes
during shutdown."

* 'thrift-server-stop/v4' of https://github.com/duarten/scylla:
  thrift/server: Restore code format
  thrift/server: Stopping the server waits for connection shutdown
  thrift/server: Abort listeners on stop()
  thrift/server: Avoid manual memory management
  thrift/server: Add move ctor for connection
  thrift/server: Extract retry logic
  thrift/server: Retry with backoff for some error types
  thrift/server: Retry accept in case of error
2017-11-13 12:48:05 +02:00
Duarte Nunes
049fbb58f3 thrift/server: Restore code format
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:21:54 +01:00
Duarte Nunes
7b25e3200a thrift/server: Stopping the server waits for connection shutdown
This patch ensures the future returned from stop() resolves only when
all connections and listeners are no longer in use.

Fixes #2657
Fixes #2942

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:21:53 +01:00
Duarte Nunes
f523a0f845 thrift/server: Abort listeners on stop()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:44 +01:00
Duarte Nunes
8e0e2363e9 thrift/server: Avoid manual memory management
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:44 +01:00
Duarte Nunes
75d04be96f thrift/server: Add move ctor for connection 2017-11-13 11:19:44 +01:00
Duarte Nunes
9d3322ff1a thrift/server: Extract retry logic
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:43 +01:00
Duarte Nunes
b5cf1a152f thrift/server: Retry with backoff for some error types
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:19 +01:00
Duarte Nunes
f367dbe1ed thrift/server: Retry accept in case of error
In case of errors like ECONNABORTED, we want to retry accepting
connections. Right now we immediately retry the accept, but in
subsequent patches we introduce a backoff for other types of errors.

We also consider fatal errors like EBADFD, which should not trigger a
retry.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:03 +01:00
Avi Kivity
d57395dce9 cql: prevent overflow when computing averages
Currently, we use type type of the column as the accumulator when we
average it. This can easily overflow, e.g. (2^31-1)+(3) = overflow.

Fix by using __int128 for the accumulator.  It's not standard, but
it's way more efficient and simpler than the alternatives.

Inspired by CASSANDRA-12417, but much simpler due to the availability
of __int128.
Message-Id: <20171112173529.30764-1-avi@scylladb.com>
2017-11-13 08:53:59 +01:00
Piotr Jastrzebski
acfc6fef55 Simplify flat_mutation_reader wrappers
If a wrapper takes a flat_mutation_reader in a constructor
then it does not have to take schema_ptr because it can obtain
it from the inner flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <88c3672df08d2ac465711e9138d426e43ae9c62b.1510331382.git.piotr@scylladb.com>
2017-11-13 08:53:34 +01:00
Avi Kivity
f8af4f507b Merge "Support for varint and decimal in aggregate functions" from Daniel
"This patch adds support for varint and decimal to aggregate functions.
Some other types (like byte or smallint) weren't supported and they are
supported by C*. So their aggregate functions were added as well.

To allow aggregate functions for big_decimal, following methods were added
to big_decimal type:
  * Division by int64_t that preservers number of decimal digits.
  * Operator += .
  * Comparison operators.

Fixes #2842."

* 'danfiala/scylla-2842-send-002' of https://github.com/hagrid-the-developer/scylla:
  tests: Add tests for aggregate functions.
  tests: Add tests for big_decimal type.
  cql3/functions: Add aggregate functions for big_decimal.
  utils/big_decimal: Added necessary operators and methods for aggregate functions.
  cql3/functions: Add aggregate functions for types for which it is trivial.
2017-11-12 17:11:33 +02:00
Daniel Fiala
bc20484c47 tests: Add tests for aggregate functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:53:22 +01:00
Daniel Fiala
ee1d69502b tests: Add tests for big_decimal type.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:53:22 +01:00
Daniel Fiala
74c5f70b0a cql3/functions: Add aggregate functions for big_decimal.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:53:13 +01:00
Daniel Fiala
ce2f010859 utils/big_decimal: Added necessary operators and methods for aggregate functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:51:29 +01:00
Daniel Fiala
115668fe70 cql3/functions: Add aggregate functions for types for which it is trivial.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 13:56:20 +01:00
Tomasz Grabiec
484dde692f Merge "make sure that cache updates don't overflow dirty memory" from Glauber
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).

However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.

This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).

After that is done, we need to make sure that dirty memory is not seen
as freed until the cache update is done. Until a particular partition is
moved to the cache it is not evictable. As a result we can OOM the
system if we have a lot of pending cache updates as the writes will not
be throttled and memory won't be made available.

This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.

I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.

Fixes #1942

* git@github.com:glommer/scylla.git glommer/update-cache-v4:
  row_cache: modernize use of seastar threads
  mutation_partition: estimate size of partition
  memtable: factor out calculation of memtable_entry memory size
  memtable: add a method to export memtable's dirty memory manager
  dirty_memory_manager: block if we hit the real dirty limit
  dirty_memory_manager: add functions to manipulate real dirty
  partition: add method to calculate memory size of a partition
  row cache: pin real dirty during cache updates.
2017-11-10 13:55:12 +01:00
Piotr Jastrzebski
e7a0732f72 Add schema_ptr to flat_mutation_reader
It is usefull to have a schema inside a flat reader
the same way we had schema inside a streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b37e0dbf38810c00bd27fb876b69e1754c16a89f.1510312137.git.piotr@scylladb.com>
2017-11-10 13:54:55 +01:00
Pekka Enberg
0c192c835c cql3: Fix 'DROP INDEX' to also drop index view
This patch fixes 'DROP INDEX' CQL statement to also drop the underlying
index view automatically so that we don't leave unused materialized
views behind.
Message-Id: <1510303421-15945-1-git-send-email-penberg@scylladb.com>
2017-11-10 10:52:08 +01:00
Duarte Nunes
73f6c9a612 Merge seastar upstream
* seastar 8040cab...11ad0b1 (7):
  > alloc_failure_injector: Fix compilation error with gcc 7.1
  > core/gate: Add is_closed() function
  > doc: code formatting and fix function call
  > doc: tutoral code formatting
  > build: adjust -Wno-error=cpp for clang
  > build: don't error out on preprocessor #warning
  > Merge 'Enhancements of allocation failure injector' from Tomasz

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-09 14:42:06 +01:00
Takuya ASADA
f607a01cc5 dist/debian: link boost statically
Since we switched scylla-boost163 which isn't provided by distribution repo,
we need to link them statically.

Fixes #2946

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510229553-29801-1-git-send-email-syuu@scylladb.com>
2017-11-09 14:51:00 +02:00
Glauber Costa
1d7617723d row cache: pin real dirty during cache updates.
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.

The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.

This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.

I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.

Fixes #1942

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 19:46:36 -05:00
Glauber Costa
c2f49da609 partition: add method to calculate memory size of a partition
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b02ab991b9 dirty_memory_manager: add functions to manipulate real dirty
There are times in which we want to add and remove real dirty memory
without impacting virtual dirty. One such example is the cache update
process, where real dirty is the limiting factor.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
a6b2226562 dirty_memory_manager: block if we hit the real dirty limit
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).

However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.

This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).

A next step is to add a controller that will increase the priority of
the tasks involving in releasing real dirty memory if we get dangerously
close to the threshold.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b98a48657e memtable: add a method to export memtable's dirty memory manager
It will be used by the cache update process to gradually return real
dirty memory to the manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
ec36b9eddc memtable: factor out calculation of memtable_entry memory size
The total size is the sum of two components. Add a method that
does that sum so this code gets easier to reuse.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
d49ecae201 mutation_partition: estimate size of partition
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.

This patch enhances the mutation_partition adding a method that returns
the total size for this partition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b836005555 row_cache: modernize use of seastar threads
For a while now we have an async() function, that simplifies the code by not
needing to issue an explicit join. This patch converts the row cache to use
async() as well, which most of our code already does. Doing so will make
it easier to make changes to update_cache.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Paweł Dziepak
b69f94fece Merge "Implement flat_mutation_reader::consume" from Piotr
"Implement flat_mutation_reader::consume and add tests for it.
For that implement flat_mutation_reader_from_mutations and
read_mutation_from_flat_mutation_reader."

* 'haaawk/flat_reader_consume_v3' of github.com:scylladb/seastar-dev:
  Add tests for flat_mutation_reader::consume
  Add tests for flat_mutation_reader utils
  Introduce read_mutation_from_flat_mutation_reader
  Make mutation_rebuilder streamed_mutation independent
  flat_mutation_reader_from_mutation: support multiple mutations
  Introduce flat_mutation_reader::consume
  Move FlattenedConsumer concept to flat_mutation_reader.hh
2017-11-08 15:08:47 +00:00
Paweł Dziepak
0373f357a8 Merge "Make memtable::make_reader return flat_mutation_reader" from Piotr
"This patchset introduces memtable::make_flat_reader that returns
flat_mutation_reader and converts internal memtable readers into
flat_mutation_readers.

It also introduces some utility methods like make_forwardable and
make_partition_snapshot_flat_reader."

* 'haaawk/flat_reader_memtable_v4' of github.com:scylladb/seastar-dev:
  Turn scanning_reader into flat_mutation_reader
  Change memtable_entry::read to return flat_mutation_reader
  Make iterator_reader independent from mutation_reader
  Introduce make_partition_snapshot_flat_reader
  Prepare partition_snapshot_flat_reader
  Introduce flat_mutation_reader_from_mutation
  Prepare flat_mutation_reader_from_mutation
  Introduce make_forwardable
  Prepare make_forwardable for flat_mutation_reader
  Introduce empty_flat_reader
  memtable: Introduce make_flat_reader
2017-11-08 14:24:26 +00:00
Piotr Jastrzebski
29d409de2f Add tests for flat_mutation_reader::consume
Make sure that flat_mutation_reader::consume stops
as it's asked by the consumer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
d42e53982d Add tests for flat_mutation_reader utils
Test flat_mutation_reader_from_mutations and
read_mutation_from_flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
4b58a05053 Introduce read_mutation_from_flat_mutation_reader
This helper method reads a single mutation from
a flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
6718ecab82 Make mutation_rebuilder streamed_mutation independent
mutation_rebuilder will be used not only with streamed_mutations
but also with flat_mutation_readers so it's better for it to be
independent from streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
aa16cd7eef flat_mutation_reader_from_mutation: support multiple mutations
Rename flat_mutation_reader_from_mutation to
flat_mutation_reader_from_mutations.

Make it work with std::vector<mutation> instead of a single
mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
bcd5415413 Introduce flat_mutation_reader::consume
This is equivalent to consume_flattened for old
mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:25:28 +01:00
Piotr Jastrzebski
9233ee7309 Move FlattenedConsumer concept to flat_mutation_reader.hh
This concept will be used both in flat_mutation_reader.hh
and mutation_reader.hh. mutation_reader.hh includes
flat_mutation_reader.hh so we have to move the concept to
make it accessible in both files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:14:51 +01:00
Piotr Jastrzebski
864d02e795 Turn scanning_reader into flat_mutation_reader
This will make memtable::make_reader more efficient.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:08:53 +01:00
Tomasz Grabiec
9e115fb7e2 Merge addition of mutation_source::make_flat_mutation_reader() from Piotr
Make it possible for a mutation_source to be created both for sources
that use old mutation_reader and new flat_mutation_reader.

Add tests for flat_mutation_reader::next_partition to run_mutation_source_tests.

* seastar-dev.git 'dev/haaawk/flat_reader_mutation_source_v3':
  Remove mutation_reader.hh dependency from flat_mutation_reader.hh
  Prepare mutation_source for more than one implementation
  Add flat reader mutation source implementation
  Add mutation_source::make_flat_mutation_reader
  Use mutation_source::make_flat_mutation_reader in tests
  Add flat_mutation_reader_assertions
  Add test for flat_mutation_reader::next_partition
2017-11-08 14:05:25 +01:00
Piotr Jastrzebski
68505a5065 Change memtable_entry::read to return flat_mutation_reader
This is the first step to move scanning_reader to
be flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
7b016527bf Make iterator_reader independent from mutation_reader
iterator_reader will be used also in flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
f499949645 Introduce make_partition_snapshot_flat_reader
This allows creation of flat_mutation_reader from MVCC.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
ced64d7571 Prepare partition_snapshot_flat_reader
This commit creates a copy of partition_snapshot_reader
and names it partition_snapshot_flat_reader.
This new class will be turned into a flat_mutation_reader
in the next commit.

The purpose of this commit is to make it easier to review the next
commit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
ed074a4f56 Introduce flat_mutation_reader_from_mutation
This is a utility method that will be handy in conversion
from mutation_reader to flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
c3a4ce842a Prepare flat_mutation_reader_from_mutation
This commit copies streamed_mutation_from_mutation
from streamed_mutation to flat_mutation_reader
and renames it to streamed_mutation_from_mutation_copy.
This copy will be used as a base for
flat_mutation_reader_from_mutation.

The purpose of this commit is to make it easier to review the next
commit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
decefe6eaa Introduce make_forwardable
It will add the ability to fast_forward_to on position_range
to flat_mutation_reader that does not have this ability.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
6da8caf26f Prepare make_forwardable for flat_mutation_reader
This commit copies make_forwardable from streamed_mutation
to flat_mutation_reader and renames it to make_forwardable_copy.
This copy will be used as a base for make_forwardable implementation
for flat_mutation_reader.

The purpose of this commit is to make it easier to review the next
commit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
647dd7f86a Introduce empty_flat_reader
This is an implementation of flat_mutation_reader
that returns nothing.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
0a9ab7ff80 memtable: Introduce make_flat_reader
This method creates a flat_mutation_reader
instead of mutation_reader. All users will be gradually
converted to the new interface. make_reader is implemented
using make_flat_reader and will be removed once all users
are migrated.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
3661aca7ee Add test for flat_mutation_reader::next_partition
This is added to the run_mutation_source_tests suite.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
1c9e4ba04c Add flat_mutation_reader_assertions
This will be usefull in tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
4bca2210bf Use mutation_source::make_flat_mutation_reader in tests
Use the new call in run_conversion_to_mutation_reader_tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
6efda10790 Add mutation_source::make_flat_mutation_reader
This will be used as an intermediate state of migration
from mutation_reader to flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
93e8b43e7b Add flat reader mutation source implementation
This will be used by sources that are migrated to
flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:41:12 +01:00
Piotr Jastrzebski
1a7936561e Prepare mutation_source for more than one implementation
There will be a second implementation that will be used by
sources that are converted to flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:41:12 +01:00
Piotr Jastrzebski
e80007559b Remove mutation_reader.hh dependency from flat_mutation_reader.hh
It's not needed and causes cyclic dependency when we need
flat_mutation_reader in mutation_source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:41:12 +01:00
Duarte Nunes
f50b7c240f tests/view_schema_test: Wrap view queries in eventually()
...instead of wrapping the base table queries, since those will
immediately succeed. This fixes ocasional failures.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171107170221.4309-1-duarte@scylladb.com>
2017-11-08 09:20:43 +02:00
Duarte Nunes
328b908574 tests/view_schema_test: Avoid non-pk restrictions
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Change some test cases to not rely on them.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171107165138.3176-1-duarte@scylladb.com>
2017-11-08 09:20:11 +02:00
Duarte Nunes
cb9daec8fd thrift: Preserve query order for some verbs
f44131226a introduced a regression where for some verbs we would
return partitions in their natural sort order, but since thrift
partition ranges can wrap-around, what we need to preserve is query
order.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171103201118.18175-1-duarte@scylladb.com>
2017-11-07 17:00:48 +00:00
Paweł Dziepak
5a4b46f555 Merge "Fix exception safety related to range tombstones in cache" from Tomasz
Fixes #2938.

* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
  tests: range_tombstone_list: Add test for exception safety of apply()
  tests: Introduce range_tombstone_list assertions
  cache: Make range tombstone merging exception-safe
  range_tombstone_list: Introduce apply_monotonically()
  range_tombstone_list: Make reverter::erase() exception-safe
  range_tombstone_list: Fix memory leaks in case of bad_alloc
  mutation_partition: Fix abort in case range tombstone copying fails
  managed_bytes: Declare copy constructor as allocation point
  Integrate with allocation failure injection framework
2017-11-07 15:30:52 +00:00
Pekka Enberg
b515cca5e2 tests/view_schema_test: Disable non-PK restriction tests
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Disable the tests, but don't remove them because they
will be resurrected once CASSANDRA-13832 is fixed.

Message-Id: <1510052422-3478-1-git-send-email-penberg@scylladb.com>
2017-11-07 16:42:00 +02:00
Tomasz Grabiec
0123eb876c tests: range_tombstone_list: Add test for exception safety of apply() 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
dedb8a6a15 tests: Introduce range_tombstone_list assertions 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
bbca83d4c0 cache: Make range tombstone merging exception-safe
range_tombstone_list::apply() has no exception safety guarantees about
the logical state. The target mutation_partition in cache should be
assumed to be left in unspecified state. In particular, some of the
preexisting overlapping tombstones may be removed and not reinserted,
so the cache would be missing some of the range tombstone information
in case the whole allocating section fails.

Use apply_monotonically() which provides the needed guarantees.

Fixes #2938.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
9c620e0246 range_tombstone_list: Introduce apply_monotonically() 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
2fe53ac617 range_tombstone_list: Make reverter::erase() exception-safe
erase_undo_op() constructor takes ownership of *it, and destroys it
when it goes out of scope. If emplace_back() fails, *it would be
destroyed before being removed from its container (_dst._tombstones).
Fix by making sure _ops.emplace_back() won't fail.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
6190f9fc63 range_tombstone_list: Fix memory leaks in case of bad_alloc
If insert() fails, the allocated range_tombstone would not be freed.
Use alloc_strategy_unique_ptr.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
ca3e72266f mutation_partition: Fix abort in case range tombstone copying fails
If exception is thrown from _row_tombstones.apply(), _rows will be
left uncleared. This will trigger assertion in bi::set_member_hook
destructor, which assrts that the hook is not linked.

Always clear _rows.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
5348d9f596 managed_bytes: Declare copy constructor as allocation point
Because of the small size optimization, not all copies will call the
allocator, so allocation failure injection may miss this site if the
value is not large enough. Make the testing more effective by marking
this place explicitly as an allocation point.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
34ccf234ea Integrate with allocation failure injection framework 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
2d3f3ab2b8 Update seastar submodule
* seastar d71922c...8040cab (4):
  > util: Introduce support for allocation failure injection
  > Adding dpdk-port-index as a command line option with default value of 0
  > core/sharded: Introduce invoke_on_others()
  > noncopyable_function: improve support for capturing mutable lambdas
2017-11-07 15:29:45 +01:00
Paweł Dziepak
80cfcc357f Merge "Config fixes" from Calle
"Fixes #2933

Fixes regressions introduced by config restructuring.
Allows "base" config to handle errors by warning, while
other uses can opt otherwise."

[pdziepak: resolved merge conflict]

* 'calle/cfgfix' of github.com:scylladb/seastar-dev:
  config_test: Use error handler (ignore errors) + add error test
  config: Resurrect command line aliases that where lost
  main: Use error handler for config parse
  config_file: Add optional "error_handler" to yaml parse functions
2017-11-06 11:40:37 +00:00
Amos Kong
76ab8bf292 scylla_setup: parse --no-ec2-check option
The option was introduced by commit e645b0f ("dist/common/scripts: move
EC2 configuration verification to 'scylla_ec2_check'"), but it doesn't
parsed the option at all.

Fixes #2934

Signed-off-by: Amos Kong <amos@scylladb.com>
Acked-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <df21356e528c6e161f73f4408a201fedef8f52d9.1509744954.git.amos@scylladb.com>
2017-11-06 12:06:54 +02:00
Calle Wilund
21b2c2e310 config_test: Use error handler (ignore errors) + add error test
Fixes #2933

Uses handler on main test, ignoring the invalid option present.
Also adds test to verify error handling works as expected.
2017-11-06 09:58:16 +00:00
Calle Wilund
959d729428 config: Resurrect command line aliases that where lost 2017-11-06 09:54:46 +00:00
Calle Wilund
f1dd698600 main: Use error handler for config parse
Treat all errors as loggable errors/warnings. Preserving previous
behaviour.
2017-11-06 09:54:09 +00:00
Calle Wilund
287b6fd8bd config_file: Add optional "error_handler" to yaml parse functions
Allowing parse errors / unknown options to be ignored.
2017-11-06 09:53:05 +00:00
Amos Kong
d9f16cf23c trivial: fix a typo in warning message
> std::invalid_argument: Option memtable_allocation_typeis not applicable

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <37d293a4eadfb8a58acaf96f80b1d2e943530c6b.1509947604.git.amos@scylladb.com>
2017-11-06 09:41:07 +02:00
Duarte Nunes
d8e0b47e75 Merge 'CQL secondary index queries' from Pekka
"This patch series adds support for secondary index queries using the
backing index view that's created when CREATE INDEX statement is
executed.

Example:

  -- Create keyspace and table:

  CREATE KEYSPACE ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};

  CREATE TABLE ks.users (
    userid uuid,
    name text,
    email text,
    country text,
    PRIMARY KEY (userid)
  );

  -- Create secondary indexes:

  CREATE INDEX ON ks.users (email);

  CREATE INDEX ON ks.users (country);

  -- Insert some data:

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Bondie Easseby', 'beassebyv@house.gov', 'France');

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Demetri Curror', 'dcurrorw@techcrunch.com', 'France');

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Langston Paulisch', 'lpaulischm@reverbnation.com', 'United States');

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Channa Devote', 'cdevote14@marriott.com', 'Denmark');

  -- Query on the secondary-index backed non-primary keys:

  SELECT * FROM ks.users WHERE email = 'beassebyv@house.gov';

   userid | country | email | name
  --------+---------+-------+------
   022238c8-5213-44b5-959e-4e3e1b032f85 |        France |         beassebyv@house.gov |    Bondie Easseby

  (1 rows)

  SELECT * FROM ks.users WHERE country = 'France';

   userid                               | country | email                   | name
  --------------------------------------+---------+-------------------------+----------------
   2152d85a-61f6-4eab-af4d-e7e7d0872319 |  France |     beassebyv@house.gov | Bondie Easseby
   59fddb6d-bfc9-4636-a9a0-85383fd815ee |  France | dcurrorw@techcrunch.com | Demetri Curror

Known imitations:

- Only regular column indexes return results. Indexing primary key
  components like clustering keys return empty result set because of
  index view query partition key serialization issues that will be fixed
  in subsequent patches.

- Secondary index queries are not paginated, which can cause problems
  for queries that return a large number of rows.

- Multiple restrictions don't work correctly if one of them is backed by
  a secondary-index.

- Only one secondary-indexed restriction per query is supported -- other
  restrictions are ignored.

- Compound partition keys are not supported.

- ALLOW FILTERING on non-primary key columns does not work correctly
  without secondary index (see issue #2200)."

* 'penberg/cql-2i-queries/v2' of github.com:penberg/scylla:
  tests/cql_query_test: Add test case for secondary index queries
  cql3: Secondary-index backed select statements
  index: Fix index view schema when primary key component is indexed
  tests/cql_query_test: Fix view creation in test_duration_restrictions()
  cql3/restrictions: Add statement_restrictions::index_restrictions() helper
  index: Implement index::supports_expression() for EQ operator
  cql3: Make operator_type class non-copyable
  index: Fix index::supports_expression() operator parameter type
  cql3: Implement statement_restriction index validation
2017-11-04 01:51:55 +01:00
Tomasz Grabiec
2e96069f2f tests: perf_cache_eviction: Switch to time-series like workload
Before the patch we appended and queried at the front. Insert at the
front instead, so that writes and reads overlap. Stresses eviction and
population more.
Message-Id: <1506369562-14892-1-git-send-email-tgrabiec@scylladb.com>
2017-11-03 13:45:41 +00:00
Tomasz Grabiec
92e3449d59 mutation_reader: Do not call fast_forward_to() on a reference to a capture
The range reference is supposed to be valid as long as the reader is
used, not just around fast_forward_to().

Introduced in a6b9186cab
Message-Id: <1509710642-12713-1-git-send-email-tgrabiec@scylladb.com>
2017-11-03 12:09:42 +00:00
Amos Kong
4762326f35 dist/redhat: Fix dependence version issue
The spec file requires two different version, it causes conflict.

The problem was introduced in commit
6893ad46b8 ("dist/redhat: Switch to
g++-7/boost-1.63 on CentOS7").

Fixes #2931

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f0d74c4ae0d325d7e2bd827f56a36330b9ef19eb.1509703504.git.amos@scylladb.com>
2017-11-03 13:40:22 +02:00
Paweł Dziepak
0de651d617 Merge "Mark whole query range continuous in cache" from Tomasz
"We currently can't insert row entries at any position_in_partition,
but only at full keys and after all keys. If a query range has bounds
such that we have to insert a dummy entry at non-representable position
then information about range continuity will not be fully populated.
In particular, single-row queries of a row which is not present in sstables
will miss when repeated again.

The series fixes the problem by marking the whole query range as continuous
by inserting dummy entries at boundaries when necessary.

Refs #2579."

* tag 'tgrabiec/cache-range-continuity-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add test for population of single rows
  tests: Add test for population of continuity
  tests: mutation_reader_assertions: Introduce produces_compacted()
  mutation: Introduce apply(mutation_fragment)
  cache: Document invariants of cache_streamed_mutation::_lower_bound
  cache_streamed_mutation: Special-case population for singular ranges
  query: Introduce is_single_row()
  cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction
  cache_streamed_mutation: Override continuity of older versions when populating
  cache_streamed_mutation: Mark whole query range as continuous
  tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition
  cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows
  cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases
  row_cache: Make read_context::key() valid before reading from underlying starts
  mutation_partition: Allow creating rows_entry at any clustered position_in_partition
  position_in_partition: Do not use -2 and +2 weights
  clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range
  mutation_partition: Remove delegating_compare()
  mvcc: Print iterators in operator<< for partition_snapshot_row_cursor
  mvcc: Introduce partition_snapshot_row_weakref
  mvcc: Make the null state of partition_snapshot::change_mark explicit
  mvcc: Add partition_snapshot::region() getter
  mvcc: Add partition_snapshot::schema() getter
  position_in_partition: Introduce before_key()
  position_in_partition: Introduce min()
  position_in_partition: Introduce for_static_row()
2017-11-03 11:05:49 +00:00
Paweł Dziepak
ab12981491 test.py: make sure that tests/memory_footprint is being run
While not being a real unit tests memory_footprint can be a quite useful
tool and running it among other tests will ensure that we will notice
when it gets broken.
Message-Id: <20171102160233.6756-2-pdziepak@scylladb.com>
2017-11-03 11:46:30 +01:00
Paweł Dziepak
4cda3170d6 tests/memory_footprint: do not create two cache instances
When created cache registers several metrics, since attempts to create
an already existing metrics result in an exception being thrown it is no
longer possible to have two cache instances at the same time. This is
exactly what happens in memory_footprint: one (useless) cache object is
created through a call to do_with_cql_env() and, then, memory_footprint
explicitly creates another one (not a useless one).

The tests itself doesn't really need a full cql environment and the only
reason it was added is so that storage_service is initialised and various
code paths can query for the available cluster features. This can be
done in a much lightweight way using storage_service_for_tests.

Fixes memory_footprint failure (until next time we decide there is
nothing wrong with globals).

Message-Id: <20171102160233.6756-1-pdziepak@scylladb.com>
2017-11-03 11:46:30 +01:00
Amos Kong
f2ff431b75 dist/redhat: Fix baseurl of 3rdparty repo
$basearch isn't parsed as expected, the finaly baseurl is wrong.
We only have x86_64 arch in external 3rdparty repository, and
the conf file is only for x86_64, so it's fine to use hardcode
x86_64.

The problem was introduced by commit
b5e83ebd94 ("dist/redhat: switch 3rdparty
packages to external build service").

Fixes #2930

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <708f46a7c36623e86fee278462c80db1eff3b820.1509700430.git.amos@scylladb.com>
2017-11-03 11:51:34 +02:00
Pekka Enberg
aeea83172b tests/cql_query_test: Add test case for secondary index queries 2017-11-03 10:12:58 +02:00
Pekka Enberg
9048f741ad cql3: Secondary-index backed select statements
This patch adds support for secondary-index backed select statements.
Current select_statement class is split into two separate classes:
primary_key_select_statement that retains regular query behavior and
indexed_table_select_statement that introduces the new secondary-index
backed query logic. One of the two behaviors is selected at query
preparation time to minimize overhead for non-indexed queries.
2017-11-03 10:12:58 +02:00
Pekka Enberg
3150962cb7 index: Fix index view schema when primary key component is indexed
This fixes index view schema to exclude indexed column when a primary
key component like clustering key is indexed. This fixes a server crash
when CREATE INDEX statement is executed on a clustering key column.
2017-11-03 10:12:58 +02:00
Pekka Enberg
3c90607988 tests/cql_query_test: Fix view creation in test_duration_restrictions()
The materialized view created in test_duration_restriction() restricts
on a non-PK column. Since Scylla's ALLOW FILTERING and secondary index
validation path is broken, once we start to do secondary index queries,
query processor thinks there's a secondary index backing that non-PK
column and fails because it's unable to find such column.

Fix up the view to only trigger the duration type validation error we're
interested in here.
2017-11-03 10:12:58 +02:00
Pekka Enberg
c243a0c8fc cql3/restrictions: Add statement_restrictions::index_restrictions() helper 2017-11-03 09:10:43 +02:00
Pekka Enberg
678a6f6e2f index: Implement index::supports_expression() for EQ operator 2017-11-03 09:10:43 +02:00
Pekka Enberg
04b482146c cql3: Make operator_type class non-copyable
The operator_type class is really an enumeration, which is not supposed
to be copied.
2017-11-03 09:10:43 +02:00
Pekka Enberg
1ae9343f68 index: Fix index::supports_expression() operator parameter type
The cql3::operator_type is supposed to be passed around as const
reference, not by value; otherwise equality won't work.
2017-11-03 09:10:43 +02:00
Pekka Enberg
3e3c580f74 cql3: Implement statement_restriction index validation 2017-11-03 09:10:43 +02:00
Botond Dénes
ce03a4d2c7 test.py: print failed test summary if there are failed tests
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3a913b111552276ab94dfb83738244699550f929.1507894597.git.bdenes@scylladb.com>
2017-11-02 11:49:14 +00:00
Tomasz Grabiec
16d6222a96 tests: row_cache: Add test for population of single rows 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
bbf8ccb709 tests: Add test for population of continuity 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
3ad5666098 tests: mutation_reader_assertions: Introduce produces_compacted() 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
749f5770df mutation: Introduce apply(mutation_fragment) 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
a76202df4f cache: Document invariants of cache_streamed_mutation::_lower_bound
(cherry picked from commit b52813279d30782270ac83856233f18787b28b7e)
2017-11-02 12:16:17 +01:00
Tomasz Grabiec
328faf695e cache_streamed_mutation: Special-case population for singular ranges
This is an optimization which avoids creating dummy entries around row
entry when populating a singular range.
2017-11-02 12:16:09 +01:00
Tomasz Grabiec
90796893ee query: Introduce is_single_row() 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
0fd57cdff5 cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
8c41a3eb43 cache_streamed_mutation: Override continuity of older versions when populating
Fixes the case of continuity not being populated when the row which is
the upper bound of the population range belongs to a non-latest
version. In such case we wouldn't mark the range as continuous,
because we can't modify rows of non-latest versions. To fix this,
create an empty entry in latest version which will just override the
continuity flag of the old entry.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
65ed490e1c cache_streamed_mutation: Mark whole query range as continuous
Before this patch only ranges between returned row fragments were
marked as continuous.  In the extreme case, there could be no such
fragments, in which case next read would miss as well. To avoid this,
mark whole query range as continuous by inserting dummy entries when
necessary.

Refs #2579.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
552d7a683a tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
d4928eb1b7 cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows
Current code will not mark the range as continuous if the previous
entry does not come in the latest version. Fix that by switching
to partition_snapshot_row_pointer, which is capable of checking
in older versions as necessary.

Also, we avoid the key comparison if we know that the iterator
is still valid.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
835d17ee37 cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
af4a9a4a30 row_cache: Make read_context::key() valid before reading from underlying starts
So that we can call cache_streamed_mutation::can_populate() before
we start reading from underlying. Will be needed in upcoming changes
which insert dummy entries when falling back to underlying.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
72028bb048 mutation_partition: Allow creating rows_entry at any clustered position_in_partition
In preparation for supporting setting continuity of arbitrary clustering range.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
9ac1b515e1 position_in_partition: Do not use -2 and +2 weights
::weight() is using those values for excl_end and excl_start in order
to be able to represent non-overlapping ranges. In their model the end
bound is inclusive. We don't need this, since position_range has end
bound exclusive.

This change makes that:

  position_in_partition::after_key(y)
    == position_in_partition::for_range_end(clutering_range::make({x}, {y})
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
4b25fa1130 clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range
position_range is end-exclusive. The reader might have returned a tombstone
which is not really relevant for the range.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
409adc045a mutation_partition: Remove delegating_compare()
It can't work with rows_entry at any position_in_partition,
so we need to drop it.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
b4954f55b9 mvcc: Print iterators in operator<< for partition_snapshot_row_cursor 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
ad156b5986 mvcc: Introduce partition_snapshot_row_weakref 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
967cabcaf2 mvcc: Make the null state of partition_snapshot::change_mark explicit 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
4b7933543d mvcc: Add partition_snapshot::region() getter 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
9cf30f19ae mvcc: Add partition_snapshot::schema() getter 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
34cb13939f position_in_partition: Introduce before_key() 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
cc06c328ef position_in_partition: Introduce min() 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
d05f130b09 position_in_partition: Introduce for_static_row() 2017-11-02 11:05:19 +01:00
Calle Wilund
8c257c40b4 storage_service: Only replicate token metadata iff modified in on_change
Fixes #2869

Message-Id: <20171101105629.22104-1-calle@scylladb.com>
2017-11-01 14:56:55 +02:00
Jesse Haber-Kucharsky
da5c486e49 Add coding-style.md referencing Seastar
While it would be nice if we could reference the file corresponding to
the exact version of Seastar pinned as a Scylla submodule, GitHub does
not support this.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <b7565acd1ccb8bce42b0edf00221922e78e1c9ef.1508274655.git.jhaberku@scylladb.com>
2017-10-30 16:52:29 -07:00
Duarte Nunes
74a4cf8bb1 thrift/handler: multiget_{slice, count} always returns queried keys
This patch changes the way the multiget_{slice, count} verbs return
their results, by ensuring a queried key that produced no results is
still present in the returned map, associated with an empty list.

This is not required by the thrift interface, and it is a performance
step back, but matches the behavior of Apache Cassandra.

Said behavior is relied upon by projects like JanusGraph, whose
integration with Scylla motivated this patch.

Fixes #2900

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-2-duarte@scylladb.com>
2017-10-30 16:48:58 -07:00
Duarte Nunes
f44131226a thrift/handler: Use map for column_visitor aggregation
Most common operations, like multiget_count and multiget_slice, return
maps. So, instead of keeping a vector internally in column_visitor
that we later transform into a map, keep a map that we transform into
a vector for the uncommon operations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-1-duarte@scylladb.com>
2017-10-30 16:48:55 -07:00
Takuya ASADA
a56dd79d69 dist/redhat: moving gcc-7.1 to gcc-7.2
We mistakenly merged the patch witch compiles Scylla using gcc-7.1, it need to
fix correct version (gcc-7.2).

Message-Id: <1508875618-31659-1-git-send-email-syuu@scylladb.com>
2017-10-30 14:26:43 -07:00
Tomasz Grabiec
b4e3c0946a cache_streamed_mutation: Avoid copy of decorated_key
Message-Id: <1509060503-17483-1-git-send-email-tgrabiec@scylladb.com>
2017-10-26 16:51:27 -07:00
Pekka Enberg
9ba3920fd7 Merge "cql3/query_processor: Clean-up" from Jesse
"This series cleans-up the query processor header and source file,
 including deleting dead Java code.

 There are no functional or interface changes.

 I've run all unit tests and observed no failures."

* 'jhk/clean_up_qp/v2' of github.com:hakuch/scylla:
  cql3/query_processor: Fix formatting
  cql3/query_processor: Organize headers
  cql3/query_processor.hh: Consolidate `public` and `private` sections
  cql3/query_processor: Remove dead Java code
2017-10-21 21:48:06 +03:00
Jesse Haber-Kucharsky
66c4abe4fb cql3/query_processor: Fix formatting
Lines are now less than 120 columns and formatting conforms to the Seastar coding standards document.
2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky
edb83c0014 cql3/query_processor: Organize headers 2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky
ed6a3179a1 cql3/query_processor.hh: Consolidate public and private sections 2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky
50cfa8a7b8 cql3/query_processor: Remove dead Java code 2017-10-21 13:53:03 -04:00
Avi Kivity
ef8587a910 Merge seastar upstream
* seastar 8babd1f...d71922c (11):
  > configure.py: add -Wno-sign-compare to compile Boost.Test with gcc-7
  > log: Print nested exceptions
  > reactor: do not account non idle activity for total idle time calculation
  > execution_stage: defer execution less aggressively
  > Fix -Wreturn-type warnings
  > cpu scheduler: make _reciprocal_shares_times_2_32 wider to avoid overflow problems
  > noncopyable_function add bool operator
  > execution_stage: make make_execution_stage return a named type
  > memory: support overriding the default allocator page size
  > memory: fix crash during startup with large page_size
  > core: io_destroy is missing when destructing reactor, which causes io_context leak
2017-10-21 16:38:32 +03:00
Takuya ASADA
bc76d34e34 dist/debian: handle python scripts correctly on package builder
We are failing to build .deb package on pbuilder due to lack of build time
dependencies so we need add those packages on Build-Depends, also we need to
follow Debian packaging style for the package contains python scripts.

Fixes #2918

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457215-11552-1-git-send-email-syuu@scylladb.com>
2017-10-21 16:38:20 +03:00
Takuya ASADA
6893ad46b8 dist/redhat: Switch to g++-7/boost-1.63 on CentOS7
Switch to g++-7/boost-1.63 on CentOS7, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457509-13122-2-git-send-email-syuu@scylladb.com>
2017-10-21 12:28:40 +03:00
Takuya ASADA
7f38634080 dist/debian: Switch to g++-7/boost-1.63 on Ubuntu 14.04/16.04
Switch to g++-7/boost-1.63 for Ubuntu 14.04/16.04 that newly provided via
our 3rdparty PPA.

To make Scylla compilable with boost-1.63/g++-7, we need to disable following
warnings:
 - misleading-indentation
 - overflow
 - noexcept-type
Compile error message:
https://gist.github.com/syuu1228/96acc640c56c3316df5ce6911d60beea

Seastar also has similar problem, it needs to disable 'sign-compare', detail
is in a patch for Seastar.

This update also fixes current Ubuntu 14.04/16.04 compilation error problem,
since errors were come from too old g++/boost.

Fixes #2902
Fixes #2903

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457509-13122-1-git-send-email-syuu@scylladb.com>
2017-10-21 12:28:38 +03:00
Avi Kivity
d6cd44a725 Revert "Merge 'Single key sstable reader optimization' from Botond"
This reverts commit 5e9cd128ad, reversing
changes made to 1f4e6759a7. Tomek found
some serious issues.
2017-10-19 12:47:21 +03:00
Botond Dénes
9bd4d7cbb2 Readd x right to configure.py (removed by 05db87e06)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <41d505277fe29b275aba65477cafc9275b393a64.1508395961.git.bdenes@scylladb.com>
2017-10-19 11:30:06 +03:00
Duarte Nunes
5e9cd128ad Merge 'Single key sstable reader optimization' from Botond
"When reading a single row it is possible that the read will be satisfied
by just reading from one of the data source candidates. To exploit this
an optimization is employed which sorts data source candidates by their
timestamp and reads mutations from the most recent to the oldest. When
all needed cells are present and their earliest timestamp is still
later than the latest one of the remaining data source the read can be
terminated early.
However this optimization also has the possibility to backfire as the
data sources are read sequentially, so if all of them has to be read
eventually then we will end up worse then without it.
Thus the optimization can be disabled up-front or enabled to only run
until its efficiency degrades below a certain threshold.
Also counters are added to column-families to make it possible to
observe how well it performs.

Benchmarking

Benchmarking was done with disabled cache and at a constant op rate of
4k (1/3 of the max op rate on my box), against 3 sstables containing the
same 10000 rows.

1) Optimization turned off (all sstables read paralelly)
latency mean              : 1.3 [simple:1.3]
latency median            : 1.0 [simple:1.0]
latency 95th percentile   : 2.4 [simple:2.4]
latency 99th percentile   : 2.9 [simple:2.9]
latency 99.9th percentile : 8.0 [simple:8.0]
latency max               : 13.5 [simple:13.5]

2) Optimization turned on, best case (1 of 3 sstables read)
latency mean              : 0.6 [simple:0.6]
latency median            : 0.6 [simple:0.6]
latency 95th percentile   : 1.0 [simple:1.0]
latency 99th percentile   : 1.2 [simple:1.2]
latency 99.9th percentile : 4.4 [simple:4.4]
latency max               : 13.4 [simple:13.4]

3) Optimization turned on, best case, IN query (1 of 3 sstables read)
latency mean              : 0.7 [simple_in:0.7]
latency median            : 0.6 [simple_in:0.6]
latency 95th percentile   : 1.1 [simple_in:1.1]
latency 99th percentile   : 1.4 [simple_in:1.4]
latency 99.9th percentile : 5.4 [simple_in:5.4]
latency max               : 16.8 [simple_in:16.8]

4) Optimization turned on, worst case (3 of 3 sstables read sequentally)
latency mean              : 2.8 [simple:2.8]
latency median            : 2.3 [simple:2.3]
latency 95th percentile   : 5.4 [simple:5.4]
latency 99th percentile   : 6.5 [simple:6.5]
latency 99.9th percentile : 13.5 [simple:13.5]
latency max               : 19.2 [simple:19.2]

5) Optimization turned on, mid case (2 of 3 sstables read sequentally)
latency mean              : 1.4 [simple:1.4]
latency median            : 1.1 [simple:1.1]
latency 95th percentile   : 2.7 [simple:2.7]
latency 99th percentile   : 3.2 [simple:3.2]
latency 99.9th percentile : 7.7 [simple:7.7]
latency max               : 15.1 [simple:15.1]"

Ref #324

* 'bdenes/optimize_single_row_read_v6' of github.com:denesb/scylla:
  Add unit tests for single_key_sstable_reader
  Add counters for the single-key reader optimization
  Add single_key_parallel_scan_threshold option
  single_key_sstable_reader: optimize single-row queries
  single_key_sstable_reader: move reading code into it's own method
  Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()
2017-10-18 16:38:53 +01:00
Duarte Nunes
1f4e6759a7 tests: Fix compile errors introduced in c468e5981
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1508337315-8224-1-git-send-email-duarte@scylladb.com>
2017-10-18 16:38:18 +01:00
Botond Dénes
c3bd89ad63 Add unit tests for single_key_sstable_reader 2017-10-18 17:24:03 +03:00
Botond Dénes
dfe312ca3a Add counters for the single-key reader optimization
Add two counters, one to determine how many of the reads fall into the
optimization, and a second one to determine it's effectiveness.

The first one is single_key_reader_optimization_hit_rate. It contains
the rate of reads that the optimization applies to out of all the reads
that go into the single_key_sstable_reader.

The second one, single_key_reader_optimization_extra_read_proportion is
a histogram of the efficiency of the optimization. It contains the
proportion of extra data-sources read. It's a number between 0 and 1,
where 0 is the best case (only one data-source was read) and 1 is the
worst case (all data-sources were read eventually). This is the same
number that is used for the threshold option (see previous patch).
Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1]
range.
Note that single_key_parallel_scan_threshold effectively provides an
upper bound for the proportion as the optimization is turned off as soon
as it goes above that number.

The counters are disabled if single_key_parallel_scan_threshold is set
to 0 disabling the optimization entirely.
2017-10-18 17:24:03 +03:00
Botond Dénes
08502f2d48 Add single_key_parallel_scan_threshold option
This option regulates when exactly the single-key optimization is
considered ineffective and turned off.
The threshold is the proportion of the extra data source candidates that
can be read before the optimization is considered ineffective and
disabled. The proportion is calculated as follows:
    (read_data_sources - 1) / (total_data_sources - 1)

We substract 1 from the read_data_sources and total_data_sources to
effectively measure the rate of *extra* data sources we read. This
makes sure that the proportion is meaningful even if e.g. we have only
have a total of 2 data-sources and we read only 1 (best case).

Whenever this number goes above the threshold the optimization is
disabled. The threshold is number between 0 and 1, 0 forces the
optimization off and 1 forces it on. Increase the treshold to favor
throughput over latency for single-row reads, decrease the treshold to
improve latency at the expense of throughput.

If the threshold is > 0 (it's not force disabled) and the optimization
is disabled due to a read crossing the threshold, we will issue
"probing" reads (every 100th read) to determine if the optimization is
worth re-enabling. Probing reads are allowed to run through the
optimization path and if they go below the threshold the optimization is
re-enabled.
2017-10-18 17:24:03 +03:00
Botond Dénes
3c1fa3ecc1 single_key_sstable_reader: optimize single-row queries
For single-row queries that only query atomic cells one can put a lower
bound on the timestamps which may affect the query results and thus rule
out entire data sources. This allows the query to read only those
sstables that actually contribute to the result.
To do this we incrementally move through the sstables overlapping with
the query range, checking after each read mutation whether we already
have a value for all required cells and whether the lower-bound of their
timestamps is higher than the upper-bound of the timestamps of all the
remaining data-sources. When this condition is met we terminate the
read.
2017-10-18 17:24:03 +03:00
Botond Dénes
5fc44c4307 single_key_sstable_reader: move reading code into it's own method 2017-10-18 17:24:03 +03:00
Botond Dénes
6cdeca1846 Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns() 2017-10-18 17:24:03 +03:00
Botond Dénes
7aceb14395 Fix compile errors in tests/config_test.cc introduced by c468e5981
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <2700ac3987c3a229eb7083ce6f5d390012a3b66c.1508336217.git.bdenes@scylladb.com>
2017-10-18 15:20:45 +01:00
Paweł Dziepak
c28e31eac4 database: fix build (auto shards&) 2017-10-18 13:10:00 +01:00
Duarte Nunes
446e5f53db database: Avoid superfluous shards_for_this_sstable vector copies
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171018112643.40411-1-duarte@scylladb.com>
2017-10-18 15:00:52 +03:00
Duarte Nunes
044b8deae4 Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.

Fixes #2855."

* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
  gms/gossiper: Remove periodic replication of endpoint state map
  gossiper: Check for features in the change listener
  gms/gossiper: Replicate changes incrementally to other shards
  gms/gossiper: Document validity of endpoint_state properties
  storage_service: Update token_metadata after changing endpoint_state
  gms/gossiper: Process endpoints in parallel
  gms/gossiper: Serialize state changes and notifications for given node
  utils/loading_shared_values: Allow Loader to return non-future result
  gms/gossiper: Encapsulate lookup of endpoint_state
  storage_service: Batch token metadata and endpoint state replication
  utils/serialized_action: Introduce trigger_later()
  gossiper: Add and improve logging
  gms/gossiper: Don't fire change listeners when there is no change
  gms/gossiper: Allow parallel apply_state_locally()
  gms/gossiper: Avoid copies in endpoint_state::add_application_state()
  gms/failure_detector: Ignore short update intervals
2017-10-18 10:13:25 +01:00
Duarte Nunes
c468e59817 Merge 'Extract config file mechanism + allow additional' from Calle
"Extracts the yaml/boost-po aspects of the "self-describing" db::config
into an abstract type.

db::config is then reimplemented in said type, removing some of the
slightly cumbersome entanglement with seastar opts (log).

Adds a main hook for additional configuration files (options + file)"

* 'calle/config' of github.com:scylladb/seastar-dev:
  main/init: Add registerable configuration objects
  db::config: Re-implement on utils/config_file.
  utils::config_file: Abstract out config file to external type
2017-10-18 09:50:53 +01:00
Tomasz Grabiec
f570e41d18 gms/gossiper: Remove periodic replication of endpoint state map
For large clusters the map can be big and cause latency problems.
Since we now actively replicate changes, this is no longer needed.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
84c7b63c51 gossiper: Check for features in the change listener
In preparation for removal of periodic replication
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
2d5fb9d109 gms/gossiper: Replicate changes incrementally to other shards
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
28c9609370 gms/gossiper: Document validity of endpoint_state properties 2017-10-18 08:49:53 +02:00
Tomasz Grabiec
cf113ed295 storage_service: Update token_metadata after changing endpoint_state
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).

Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
5cc83b9b3c gms/gossiper: Process endpoints in parallel
Makes state application faster due to increased parallelism.

Refs #2855.

Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:

Before:

DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms

After:

DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
8f01e08690 gms/gossiper: Serialize state changes and notifications for given node
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.

This patch fixes the problem by serializing state changes and
listeners for each node.

The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running.  There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
f7a7e97095 utils/loading_shared_values: Allow Loader to return non-future result 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
6fccf7f4d0 gms/gossiper: Encapsulate lookup of endpoint_state 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
6263b0ebb6 storage_service: Batch token metadata and endpoint state replication
Replication needs to be serialized. We can batch replication requests
which are waiting to start. Use serialized_action, which does this.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
2e2ae4671e utils/serialized_action: Introduce trigger_later()
Can be used instead of trigger() to improve batching.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
41ffefd194 gossiper: Add and improve logging 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
0ed84710d9 gms/gossiper: Don't fire change listeners when there is no change
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal().  We
should avoid calling them more than necessary.

Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.

Fixes #2867
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
c780a74b58 gms/gossiper: Allow parallel apply_state_locally()
It is serialized since e428d06f40. This causes regression in
performance of application state propagation due to reduced
parallelism.

Processing states for each node has high latency due to memtable
flushes triggered by update_tokens() and commitlog syncs done by
system.peers updates, if commitlog sync mode is set to "batch".  We
have high internal concurrency for these, so increasing parallelism
significantly reduces time to process all states.

Fixes #2855.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
f20a805eca gms/gossiper: Avoid copies in endpoint_state::add_application_state() 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
a71624d58d gms/failure_detector: Ignore short update intervals
Failure detector decides that a node is down if it hasn't received a change of
its heartbeat for longer than ~11 times the average of past intervals between
updates.

If there are multiple incoming ACKs containing information about the
same node, we may detect and report a change for each of them. This
will cause failure_detector to establish that the average report
period is in milliseconds. After the update storm is over, it will
claim the node failure very soon, because report period will now be a
large multiple of the average.

Fix by not counting short updates into the calculation of average
arrival time.

Fixes #2861.
2017-10-18 08:49:52 +02:00
Calle Wilund
12a54805ea main/init: Add registerable configuration objects
Allowing plugging in command line arguments + "parse-points"
for configs outside db/config
2017-10-18 00:52:04 +00:00
Calle Wilund
4bd98f7296 db::config: Re-implement on utils/config_file.
Re-use config abstraction, and de-couple the seastar logging 
parts a little bit more.
2017-10-18 00:51:54 +00:00
Calle Wilund
05db87e068 utils::config_file: Abstract out config file to external type
Handling all the boost::commandline + YAML stuff.
This patch only provides an external version of these functions,
it does not modify the db::config object. That is for a follow-up
patch.
2017-10-18 00:51:41 +00:00
Pekka Enberg
ae92055b52 Merge "Bring histogram closer to what Prometheus expects" from Glauber
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/

One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.

The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.

2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.

About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:

"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."

It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."

Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>

* 'prometheus-histograms' of github.com:glommer/scylla:
  storage_proxy: change reporting of estimated histograms
  estimated_histogram: bring histogram closer to what prometheus expects.
2017-10-17 20:23:10 +03:00
Takuya ASADA
3cab5557e5 dist/debian: fix Debian not to use new dependency package names
We moved to new dependency package names like
antlr3-c++-dev to scylla-antlr35-c++-dev when we moved to ppa on Ubuntu, but
Debian still uses old dist/debian/dep packages.
So keep using old style package names.

Fixes #2831

Message-Id: <1508245175-2184-1-git-send-email-syuu@scylladb.com>
2017-10-17 16:39:35 +03:00
Daniel Fiala
f5629b3a23 types: Use std::pair instead of std::tuple to avoid compile-time error with explicit constructor.
Fixes #2895.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171017071316.2836-1-daniel@scylladb.com>
2017-10-17 12:32:43 +01:00
Duarte Nunes
baeec0935f Replace query::full_slice with schema::full_slice()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.

Refs #2885

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:53 +02:00
Duarte Nunes
fbb4c9edda schema: Provide all-selecting partition slice
This patch introduces schema::full_slice(), which returns a
partition_slice selecting the full clustering range, as well as all
static and regular columns. No options aside from the default are
set in that partition_slice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-1-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:35 +02:00
Tomasz Grabiec
6d5a0f8a98 db: Add debug-level logging related to streaming
Message-Id: <1505896395-30203-1-git-send-email-tgrabiec@scylladb.com>
2017-10-16 18:49:10 +01:00
Paweł Dziepak
d9abb75bfa tests/perf_simple_query: fix counter update query
Message-Id: <20171016125334.4423-1-pdziepak@scylladb.com>
2017-10-16 19:41:31 +02:00
Calle Wilund
cc28cf838c password_auth: Return actual generated salt from gensalt
Fixes: 2898
Typo error in gensalt(). Only returned selected hash method, not
the random salt bytes. Does not prevent the hash function from
operating, but strength is ever so reduced.
Message-Id: <20171016130505.25593-2-calle@scylladb.com>
2017-10-16 14:07:46 +01:00
Calle Wilund
57c5f13166 password_auth: Keep crypt_data as thread local
Fixes: 2887
Speeds up password hashing ever so slightly.
Message-Id: <20171016130505.25593-1-calle@scylladb.com>
2017-10-16 14:07:42 +01:00
Paweł Dziepak
8c3b7fea81 Merge "Introduce new API and converters from/to old mutation_reader" from Piotr
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."

* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
  Add tests for flat_mutation_reader
  Introduce conversion from flat_mutation_reader to mutation_reader
  Introduce conversion from mutation_reader to flat_mutation_reader
  Introduce flat_mutation_reader
  Extract FlattenedConsumer concept using GCC6_CONCEPT
  Introduce partition_end mutation_fragment
  Introduce a position for end of partition
  Introduce partition_start mutation_fragment
  Introduce FragmentConsumer
  Introduce a position for partition start
  streamed_mutation: Extract concepts using GCC6_CONCEPT macro
2017-10-16 12:14:23 +01:00
Gleb Natapov
bd09ce7cd4 gdb: Add new command task_histogram
The command scans random set of objects in a small pool (or, optionally
only objects of a certain size) for vptrs and builds a histogram, so that
most often used vptrs can be easily found. The command is useful to find
"memory leaks" caused by creating of too many tasks of a certain type
which is usually a result of unlimited parallelism somewhere.

Message-Id: <20171015081634.GB21092@scylladb.com>
2017-10-15 12:12:42 +03:00
Piotr Jastrzebski
5f34559b78 Add tests for flat_mutation_reader
Those tests run mutation source test for all sources
using conversion to and from flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:59 +02:00
Piotr Jastrzebski
31733a7eeb Introduce conversion from flat_mutation_reader to mutation_reader
This will be used in transition from mutation_reader
to flat_mutation_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:59 +02:00
Piotr Jastrzebski
6a66bee788 Introduce conversion from mutation_reader to flat_mutation_reader
This will be used in transition from mutation_reader
to flat_mutation_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:59 +02:00
Piotr Jastrzebski
748205ca75 Introduce flat_mutation_reader
This reader operates on mutation_fragments instead of
streamed_mutations.

Each partition starts with a partition_header fragment
and ends with end_of_partition fragment.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:40 +02:00
Tomasz Grabiec
b74b06808e tests: row_cache: Add test for concurrent population of partition entries
Message-Id: <1507815478-20269-2-git-send-email-tgrabiec@scylladb.com>
2017-10-12 15:55:33 +01:00
Tomasz Grabiec
083b9cddef row_cache: Fix handling of concurrent partition population
This fixes a regression introduced in 27a3b4bca9 (master only).

partition_range_cursor assumes that as long as references are valid,
_end is valid as well. But if new entries were inserted before _end,
it may not, if the new entries fall after the query range. This may
result in reads returning partitions from outside the query range.
Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>
2017-10-12 15:55:20 +01:00
Tomasz Grabiec
68fe1a5bee utils/loading_cache: Fix compilation on older compilers
Message-Id: <1507728312-10585-1-git-send-email-tgrabiec@scylladb.com>
2017-10-12 14:55:34 +03:00
Raphael S. Carvalho
25a4f152cd sstables: remove dead sstable method
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171012072905.12737-1-raphaelsc@scylladb.com>
2017-10-12 11:58:39 +02:00
Raphael S. Carvalho
16dd0d15fc sstables: make get_shards_for_this_sstable return const ref
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171012072850.12681-1-raphaelsc@scylladb.com>
2017-10-12 11:58:23 +02:00
Pekka Enberg
1701fc2e50 Merge "gms/gossiper: Multiple cleanups" from Duarte
"Based on the functions get_endpoint_state_for_endpoint_ptr(),
get_application_state_ptr() and
endpoint_state::get_application_state_ptr(), this series
cleanups miscelaneous functions related to the gossiper.

It not only removes duplicated code, but also omits many copies.

All pointer usages have been audited for safety."

Acked-by: Asias He <asias@scylladb.com>
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>

* 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits)
  gms/endpoint_state: Remove get_application_state()
  service/storage_service: Avoid copies in prepare_replacement_info()
  service/storage_service: Cleanup get_application_state_value()
  service/storage_service: Cleanup handle_state_removing()
  service/storage_service: Cleanup get_rpc_address()
  locator/reconnectable_snitch_helper: Avoid versioned_value copies
  locator/production_snitch_base: Cleanup get_endpoint_info()
  service/migration_manager: Avoid copies in is_ready_for_bootstrap()
  service/migration_manager: Cleanup has_compatible_schema_tables_version()
  service/migration_manager: Fix usages of get_application_state()
  cache_hit_rate: Avoid copies in get_hit_rate()
  gms/endpoint_state: Avoid copies in is_shutdown()
  service/load_broadcaster: Avoid copy in on_join()
  gms/gossiper: Cleanup get_supported_features()
  gms/gossiper: Cleanup get_gossip_status()
  gms/gossiper: Cleanup seen_any_seed()
  gms/gossiper: Cleanup get_host_id()
  gms/gossiper: Removed dead uses_vnodes() function
  gms/gossiper: Cleanup uses_host_id()
  gms/gossiper: Add get_application_state_ptr()
  ...
2017-10-11 13:45:36 +03:00
Duarte Nunes
f67a553b96 gms/endpoint_state: Remove get_application_state()
It is no longer used, as all callsites have moved to
get_application_state_ptr().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
e9358c1c83 service/storage_service: Avoid copies in prepare_replacement_info()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
674f5d8eaf service/storage_service: Cleanup get_application_state_value()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
0ccb9211d7 service/storage_service: Cleanup handle_state_removing()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
bdee795876 service/storage_service: Cleanup get_rpc_address()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2f05d7423a locator/reconnectable_snitch_helper: Avoid versioned_value copies
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
28d63a76df locator/production_snitch_base: Cleanup get_endpoint_info()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
03e6fc95ba service/migration_manager: Avoid copies in is_ready_for_bootstrap()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
72ca6b34ef service/migration_manager: Cleanup has_compatible_schema_tables_version()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
976324bbb8 service/migration_manager: Fix usages of get_application_state()
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
bb89b97cbb cache_hit_rate: Avoid copies in get_hit_rate()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
9d5c6e0c72 gms/endpoint_state: Avoid copies in is_shutdown()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
25b0654312 service/load_broadcaster: Avoid copy in on_join()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
92df519b91 gms/gossiper: Cleanup get_supported_features()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
39f71f7d12 gms/gossiper: Cleanup get_gossip_status()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
db660f1e08 gms/gossiper: Cleanup seen_any_seed()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
88dd97fe8e gms/gossiper: Cleanup get_host_id()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
95079795ce gms/gossiper: Removed dead uses_vnodes() function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
7db7704edc gms/gossiper: Cleanup uses_host_id()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2984bdab29 gms/gossiper: Add get_application_state_ptr()
This patch introduces the get_application_state_ptr() function, which
allows access to a versioned_value of a particular endpoint.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
f41748af81 gms/gossiper: Cleanup notify_failure_detector()
Now that we have get_endpoint_state_for_endpoint_ptr(), which does not
return a copy and allows mutating the actual state, we can use it
instead of repeating the lookup code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2210d10552 gms/gossiper: Cleanup is_alive()
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
ceef45a6fe gms/gossiper: Const-qualify functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
955aee1588 gms/gossiper: Cleanup convict()
Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify
logging, and also protect expensive operations by checking the log
level.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
cf99a41226 gms/gossiper: Add non-const get_endpoint_state_for_endpoint_ptr()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
d0fba1a113 gms/failure_detector: Simplify alive/dead endpoint count
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
dc65cda1a3 gms/failure_detector: Fix if/else style to include braces
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Raphael S. Carvalho
67c5c8dc67 sstables: do not recompute shards for all tables after each compaction
For every finished compaction, we were calculating shards for all
existing tables. With ignore_msb set to 0, it's probably not a big
deal, but if ignore_msb is like 12 and LCS is used (meaning thousands
of tables possibly), the operation may stall the reactor for a
considerable amount of time. That's fixed by caching shards.

Fixes #2875.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171011053424.22308-1-raphaelsc@scylladb.com>
2017-10-11 11:45:01 +03:00
Tomasz Grabiec
66a15ccd18 gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>
2017-10-10 18:27:43 +01:00
Gleb Natapov
36d9225e40 scylla-gdb: print number of allocated objects as an integer instead of float
Message-Id: <20171010151835.GT23527@scylladb.com>
2017-10-10 18:19:44 +03:00
Avi Kivity
4ad3900d8d Merge "gossiper: Optimize endpoint_state lookup" from Duarte
"gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive. This
series introduces a function that returns a pointer and avoids
the copy.

Fixes #764"

* 'endpoint-state/v2' of https://github.com/duarten/scylla:
  gossiper: Avoid endpoint_state copies
  endpoint_state: const-qualify functions
  storage_service: Remove duplicate endpoint state check
2017-10-10 17:29:22 +03:00
Piotr Jastrzebski
f325fef362 Extract FlattenedConsumer concept using GCC6_CONCEPT
This concept will be used in flat_mutation_reader::consume

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
46727f12e0 Introduce partition_end mutation_fragment
This type of mutation_fragment will be used in new mutation_reader
to signal the end of the current partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
adffc80619 Introduce a position for end of partition
This position will be used for mutation fragment
that represents the end of partition.

This position sorts after all other mutation fragments.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
2516b42752 Introduce partition_start mutation_fragment
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
1f4fb6dd4a Introduce FragmentConsumer
This concept helps define StreamedMutationConsumer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:05:44 +02:00
Duarte Nunes
ceebbe14cc gossiper: Avoid endpoint_state copies
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.

This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).

Fixes #764

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:48:02 +01:00
Duarte Nunes
bc976b4773 endpoint_state: const-qualify functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:30:28 +01:00
Duarte Nunes
198b1b76b5 storage_service: Remove duplicate endpoint state check
We already performed the check, so we don't need to do it again.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:25:34 +01:00
Avi Kivity
c0687a9761 sstables: replace naked new with make_lw_shared
Fallout from the sstables dependency reduction patches.
Message-Id: <20171010121134.26342-1-avi@scylladb.com>
2017-10-10 13:21:46 +01:00
Tomasz Grabiec
46c7e06e56 locator: Optimize token_metadata::is_member()
Currently it's linear in the number of tokens in the system in the
worst case. We could use the knowledge which _topology has to make it
O(1).

Fixes #2873.

Message-Id: <1507630182-13410-1-git-send-email-tgrabiec@scylladb.com>
2017-10-10 14:27:54 +03:00
Tomasz Grabiec
44faaafc29 cache_streamed_mutation: Read static row with cache region locked
_snp->static_row() allocates and needs reference stability.
Message-Id: <1507555031-11567-1-git-send-email-tgrabiec@scylladb.com>
2017-10-09 15:55:53 +01:00
Avi Kivity
8d81ec92f6 gdb: adjust 'scylla memory' command for fallback small pools
Seastar small pools can now fall back to smaller spans. Adjust
the 'scylla memory' command accordingly.
Message-Id: <20171005123935.13503-1-avi@scylladb.com>
2017-10-09 11:44:03 +02:00
Botond Dénes
dead2617ce mp_row_consumer: remove unnecessary _reasource_tracker member
Leftovers from a43901f84.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <88237d9cd97feeca47e12ec4af89c90f1a3a6bb5.1507535176.git.bdenes@scylladb.com>
2017-10-09 10:59:40 +03:00
Botond Dénes
af083d6507 Merge mutation_reader related test cases into mutation_reader_test
The following tests were merged:
* combined_mutation_reader_test
* restricted_reader_test

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <db6b5b3c2d30cfaa720fff07c859649a180cff95.1507299293.git.bdenes@scylladb.com>
2017-10-08 17:33:55 +03:00
Avi Kivity
fd1d35d4af Update seastar submodule
* seastar c62bbf9...8babd1f (9):
  > Enhanced support for Travis CI: build with and without DPDK support, use varioius compilers (GCC 5/6/7)
  > backtrace: Allow whitespace after the backtrace addresses
  > test.py: fix typo in noncopyable_function_test
  > utils: introduce noncopyable_function
  > Revert "utils: introduce noncopyable_function"
  > utils: introduce noncopyable_function
  > Add seastar-addr2line helper script to decode backtraces
  > execution_stage: pass scheduling_group to constructor
  > reactor: preempt tasks when a signal is received
2017-10-08 16:36:10 +03:00
Avi Kivity
98e69482bf Merge "Add support for CAST AS functions" from Daniel
"This series implements CAST AS functions in scylla.

It allows to use expressions of the form CAST(x AS type) in select statements.
Primary motivation for this functions came from aggregate functions, because
function avg(.) gives rounded results for interger columns. Now it is possible
to convert such column to float/double and obtain floating point results:

    SELECT ... avg(cast(x as double)), ...

Fixes #2280."

* 'danfiala/2280-patch-series-v2' of https://github.com/hagrid-the-developer/scylla:
  tests: Add test for CAST AS functions.
  cql3: Add support for CAST AS functions to ANTLR grammar.
  cql3/selectable: Add selectable::with_cast for CAST AS functions.
  cql3/functions: Add support for CAST AS functions.
  types:: Add support for CAST AS functions.
  types: Moved code that implements conversion of types' values to string.
2017-10-08 12:55:07 +03:00
Botond Dénes
046a1f9b05 sstables: Get rid of [[deprecated]] index_reader::get_index_entries()
Change test code (the only consumers) to read index by partitions.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6111e92b5e0729bfa2e76fd848215804174067a.1507297154.git.bdenes@scylladb.com>
2017-10-08 12:18:52 +03:00
Daniel Fiala
9e11bfe8fa tests: Add test for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:05:53 +02:00
Daniel Fiala
4dd504b9ac cql3: Add support for CAST AS functions to ANTLR grammar.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
7fe653f08c cql3/selectable: Add selectable::with_cast for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
ca092a0b7d cql3/functions: Add support for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
61570e4a73 types:: Add support for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
e2c0a57ecf types: Moved code that implements conversion of types' values to string.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Botond Dénes
a43901f842 row_consumer: de-virtualize io_priority() and resource_tracker()
Fixes #2830

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <448a1f739ab8c88a7a5562bce8dce5ae6efdf934.1507302530.git.bdenes@scylladb.com>
2017-10-06 18:50:12 +01:00
Botond Dénes
d2b294dc06 loading_cache: prepend this-> to method calls on captured this
To make gcc 6.3 happy.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <849402e20a1ffa6f603eff4fe295981a94b9ca79.1507282527.git.bdenes@scylladb.com>
2017-10-06 12:09:34 +02:00
Vlad Zolotarov
bc9d17963f test.py: add loading_cache_test
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1507137724-2408-3-git-send-email-vladz@scylladb.com>
2017-10-05 15:30:07 +01:00
Vlad Zolotarov
1394e781be utils + cql3: use a functor class instead of std::function
Define value_extractor_fn as a functor class instead of std::function.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1507137724-2408-2-git-send-email-vladz@scylladb.com>
2017-10-05 15:29:51 +01:00
Duarte Nunes
a011eb72c2 Merge branch 'CQL secondary index backing views' from Pekka
"This patch series adds backing materialized view for secondary indices.
When a new index is created with the 'CREATE INDEX' statement, a backing
materialized view is created automatically.

For example, assuming the following table:

  CREATE TABLE ks1.users (
    userid uuid,
    email text,
    PRIMARY KEY (userid)
  );

When the following index is created:

  CREATE INDEX user_email ON ks1.users (email);

The following materialized view is also created:

  cqlsh> DESCRIBE ks1.users;

  <snip>

  CREATE MATERIALIZED VIEW ks1.user_email_index AS
      SELECT email, userid
      FROM ks1.users
      WHERE email IS NOT NULL
      PRIMARY KEY (email, userid)
      WITH CLUSTERING ORDER BY (userid ASC)
      AND bloom_filter_fp_chance = 0.01
      AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
      AND comment = ''
      AND compaction = {'class': 'SizeTieredCompactionStrategy'}
      AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
      AND crc_check_chance = 1.0
      AND dclocal_read_repair_chance = 0.1
      AND default_time_to_live = 0
      AND gc_grace_seconds = 864000
      AND max_index_interval = 2048
      AND memtable_flush_period_in_ms = 0
      AND min_index_interval = 128
      AND read_repair_chance = 0.0
      AND speculative_retry = '99.0PERCENTILE';

CQL queries will use the backing materialized view as part of queries on
indexed columns to fetch the primary keys."

* 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev:
  schema_tables: Create backing view for indices
  database: Kill obsolete secondary index manager stub
  cql3: Wire up secondary index manager
  cql3/restrictions: Add term_slice::is_supported_by() function
  index: Add secondary_index_manager::create_view_for_index()
  index: Add target_parser::parse() helper
  cql3/statements: Add index_target::from_sstring() helper
  index: Add secondary_index_manager::get_dependent_indices()
  index: Add secondary_index_manager::reload()
  index: Add secondary_index_manager::list_indexes()
  index: Add index class
  index: Pass column_family to secondary_index_manager constructor
  database: Make secondary index manager per-column family
2017-10-05 12:08:14 +01:00
Pekka Enberg
4045e1ec09 schema_tables: Create backing view for indices
This patch wires calls to secondary index manager reload() in
merge_tables_and_views() and changes make_update_indices_mutations() to
also create mutations for the backing materialized view. After this
patch, "CREATE INDEX" CQL statement also creates a materialized view.
2017-10-05 10:07:44 +03:00
Pekka Enberg
5d30ad5e1a database: Kill obsolete secondary index manager stub 2017-10-05 10:07:44 +03:00
Pekka Enberg
3a27f2e812 cql3: Wire up secondary index manager 2017-10-05 10:07:44 +03:00
Pekka Enberg
feae924c8c cql3/restrictions: Add term_slice::is_supported_by() function 2017-10-05 10:07:44 +03:00
Pekka Enberg
ed4c96c025 index: Add secondary_index_manager::create_view_for_index()
This patch adds a create_view_for_index() function, which creates a
view_ptr for index_metadata.
2017-10-05 10:07:44 +03:00
Pekka Enberg
a809ea902e index: Add target_parser::parse() helper 2017-10-05 10:07:44 +03:00
Pekka Enberg
9f07af8224 cql3/statements: Add index_target::from_sstring() helper 2017-10-05 10:07:44 +03:00
Pekka Enberg
50943ce592 index: Add secondary_index_manager::get_dependent_indices() 2017-10-05 10:07:44 +03:00
Glauber Costa
189ef02596 storage_proxy: change reporting of estimated histograms
We are currently collapsing the histograms in 16 points, exponentially
increasing in value, starting from 1.

While reducing the number of points is a worthy goal, the current
configuration caps us at 4ms. Our latencies tend to be higher than this.

Starting from 1 is also a bit of an exhaggeration: rarely are our
latencies in that range. This patch changes reporting so that we
report 20 points starting from 32.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-10-04 20:01:15 -04:00
Glauber Costa
fc4416abcc estimated_histogram: bring histogram closer to what prometheus expects.
Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/

One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.

The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
   that the highest latency we can differentiate is 4ms. After that,
   everything falls into the same bin.

2) The format that prometheus expects is that each bin will contain
   the total number of points seen *up until that bin*, while we
   currently export the total number of points that falls between bins.
   IOW, it is a cummulative histogram.

About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:

 "Note that we divide the sum of both buckets. The reason is that the
  histogram buckets are cumulative. The le="0.3" bucket is also contained
  in the le="1.2" bucket; dividing it by 2 corrects for that."

It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that.

This patch changes the histogram format to be more in line with what
prometheus expect.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-10-04 20:01:13 -04:00
Avi Kivity
ab65b42bb6 size_estimates: remove ambiguity in call to std::ref()
The call to std::ref() is not namespace-qualified, and so can conflict
with seastar::ref().

Fix by naming std::ref() explicitly.
Message-Id: <20171004155250.4960-1-avi@scylladb.com>
2017-10-04 18:31:40 +02:00
Duarte Nunes
953888d0d0 Merge "Auth: pluggable auth + "transitional" auth-objects" from Calle
"Makes authorizer/authenticator actually pluggable (by class name)
and adds a "Transitional" type for both, conforming to the DSE
definition of the types.

The idea is to allow a rolling upgrade of a cluster to
authentication op by first making all clients provide credentials
(ignored by non-auth), then node by node enable auth with
transitional handlers, then ensure user DB is populated and
distributed, and finally rollingly enable strict auth for
each node. Pfew."

Fixes #2836

* auth: Transitional auth wrappers
  auth: Make authenticator/authorizer use actual name based lookup
2017-10-04 12:46:16 +02:00
Calle Wilund
611e00646b auth: Transitional auth wrappers
Similar to DSE objects with similar name. Basically ignores
all authentication/authorization except "superuser" login. All others
sessions are treated as anonymous.
Note: like DSE counterparts, a client session must still _use_
authentication to be able to connect, even though the actual content of
the auth is mostly ignored.
2017-10-04 12:44:44 +02:00
Calle Wilund
b96a7ae656 auth: Make authenticator/authorizer use actual name based lookup
Allowing for pluggable auth objects.

Note: requires "class_registrator: Fix qualified name matching +
provider helpers" patch previously sent.
2017-10-04 12:44:44 +02:00
Calle Wilund
801ee44cb8 class_registrator: Fix qualified name matching + provider helpers
Should not assume namespace "org", nor should we allow "loose"
substring matching.
2017-10-04 12:43:42 +02:00
Calle Wilund
3c509e0333 class_registrator: Allow different return types
Allows registry to give back, for example, shared_ptr etc instead of
solely unique_ptr. If a registry is defined with seastar/std
shared/lw_shared/unique_ptr as "BaseType", the type will assume
this is the intended result type.
2017-10-04 12:43:42 +02:00
Avi Kivity
bdbbfe9390 Merge "Make restricting_mutation_reader more accurate" from Botond
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed.

Fixes #2692."

* 'bdenes/restricting_mutation_reader-v5' of https://github.com/denesb/scylla:
  Update reader restriction related metrics
  Add restricted_reader_test unit test
  restricted_mutation_reader: restrict based-on memory consumption
  mutation_reader.hh: Move restricted_reader related code
2017-10-04 12:43:58 +03:00
Paweł Dziepak
fdfa6703c3 Merge "loading_shared_values and size limited and evicting prepared statements cache" from Vlad
"
The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where
I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map).

It turned out that we already have the classes that do almost what I needed:
   - utils::loading_cache
   - sstables::shared_index_lists

Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the
new hinted handoff code.

This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists
API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition
of an entry to the set (PATCH1).

Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3).

PATCH4 optimizes the loading_cache for the long timer period use case.

But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache
had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries
are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been
used long enough.

This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after
they are created.

Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading"
conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements
caches to utils::loading_cache.

This also fixes #2474."

* 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla:
  tests: loading_cache_test: initial commit
  cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
  cql3: prepared statements cache on top of loading_cache
  utils::loading_cache: make the size limitation more strict
  utils::loading_cache: added static_asserts for checking the callbacks signatures
  utils::loading_cache: add a bunch of standard synchronous methods
  utils::loading_cache: add the ability to create a cache that would not reload the values
  utils::loading_cache: add the ability to work with not-copy-constructable values
  utils::loading_cache: add EntrySize template parameter
  utils::loading_cache: rework on top of utils::loading_shared_values
  sstables::shared_index_list: use utils::loading_shared_values
  utils: introduce loading_shared_values
2017-10-04 09:13:32 +01:00
Daniel Fiala
1133838b9f types: Add data_type_for for varint and decimal, data_value constructor for simple_date_type.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171004044040.21631-1-daniel@scylladb.com>
2017-10-04 10:52:57 +03:00
Tomasz Grabiec
f506339582 tests: perf_fast_forward: Auto-create test directory
To avoid exception due to missing directory.

Message-Id: <1506081627-12933-1-git-send-email-tgrabiec@scylladb.com>
2017-10-03 15:36:37 +03:00
Botond Dénes
fea6214a0a Update reader restriction related metrics
Update description of existing reader count metrics, add memory
consumption metrics. Use labels to distinguish between system, user and
streaming reads related metrics.
2017-10-03 12:44:17 +03:00
Botond Dénes
3280fbc4d4 Add restricted_reader_test unit test 2017-10-03 12:44:17 +03:00
Botond Dénes
47e07b787e restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-10-03 12:44:12 +03:00
Botond Dénes
0a07e9e7c7 mutation_reader.hh: Move restricted_reader related code
In preparation of make_restricted_reader taking a mutation_source as
its argument.
2017-10-03 12:39:22 +03:00
Avi Kivity
78eae8bf48 Revert "Merge "Make restricting_mutation_reader more accurate" from Botond"
This reverts commit c6e5dcc556, reversing
changes made to 19b21a0ab2. Failes to build,
plus author has more changes.
2017-10-03 11:58:59 +03:00
Pekka Enberg
641f28da02 cql3/statements: Clean up select_statement class definition
We have some historical #ifdef'd code that really ought to be removed by now...

Message-Id: <1507015932-8165-1-git-send-email-penberg@scylladb.com>
2017-10-03 11:17:32 +03:00
Avi Kivity
c6e5dcc556 Merge "Make restricting_mutation_reader more accurate" from Botond
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."

Fixes #2692.

* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
  Update reader restriction related metrics
  Add restricted_reader_test unit test
  restricted_mutation_reader: restrict based-on memory consumption
  mutation_reader.hh: Move restricted_reader related code
2017-10-03 11:15:34 +03:00
Daniel Fiala
19b21a0ab2 types: Allow 'T' as a date-time separator in timestamps.
* Letter 'T' is specified in ISO 8601 and also in Cassandra
  documentation.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171003073558.19257-1-daniel@scylladb.com>
2017-10-03 11:10:11 +03:00
Avi Kivity
3cc1c2c387 Merge seastar upstream
* seastar 899fc4e...c62bbf9 (6):
  > Merge "CPU Scheduler for seastar" from Avi
  > reactor: set SCHED_FIFO policy for timer thread
  > future: mark future::wait() as noexcept
  > shared_promise: Make get_shared_future() const-qualified
  > Remove pessimizing and redundant std::move()-s reported by Clang-tidy utility
  > Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339
2017-10-02 20:47:32 +03:00
Avi Kivity
dd5ab75d04 range: add missing include
Message-Id: <20171002144608.5032-1-avi@scylladb.com>
2017-10-02 16:49:24 +02:00
Avi Kivity
5ed6d1b176 dist: enable CAP_SYS_NICE
Allow scylla to use SCHED_FIFO for the timer thread for more accurate
scheduling.
Message-Id: <20171001121500.28318-1-avi@scylladb.com>
2017-10-02 16:32:00 +02:00
Avi Kivity
dbce5158a3 Update ami submodule
* dist/ami/files/scylla-ami 5ffa449...be90a3f (1):
  > amazon kernel: enable updates
2017-10-02 17:07:09 +03:00
Piotr Jastrzebski
83fd22face Add test to reproduce #2854
When memtable gets flushed, existing mutation_readers created
for it stop handling fast_forward_to correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f580ac59f3fcec53e7c78ad7a8b6374eb36958c6.1506690042.git.piotr@scylladb.com>
2017-09-29 15:17:53 +02:00
Piotr Jastrzebski
2583207d9d Fix memtable scanning_reader::fast_forward_to
If memtable is flushed then call fast_forward_to on _delegate.
Otherwise call iterator_reader::fast_forward_to.

Fixes #2854

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6bf1c8bafce845ef945698ce4d722c3c8606e632.1506690042.git.piotr@scylladb.com>
2017-09-29 15:17:39 +02:00
Asias He
c0b965ee56 gossip: Better check for gossip stabilization on startup
This is a backport of Apache CASSANDRA-9401
(2b1e6aba405002ce86d5badf4223de9751bf867d)

It is better to check the number of nodes in the endpoint_state_map
is not changing for gossip stabilization.

Fixes #2853
Message-Id: <e9f901ac9cadf5935c9c473433dd93e9d02cb748.1506666004.git.asias@scylladb.com>
2017-09-29 08:57:25 +02:00
Tomasz Grabiec
d75f243a8b Update seastar submodule
Fixes #2770.
Fixes #2819.

* seastar 92fdce2...899fc4e (14):
  > scollectd: increment the metadata iterator with the values
  > Enable Travis CI builds for Seastar.
  > tests: Fix httpd test compilation error caused by unconditionally explicit tuple constructor in GCC5: scylladb/seastar#326
  > core::shared_future: add available() and failed() methods
  > rpc: make sure that _write_buf stream is always properly closed
  > log: Fail on attempt to register logger with the same name twice
  > Merge "Make backtraces useful on ASLR-enabled machines as well" from Botond
  > reactor: add option to bypass fsync
  > future-util: modernize do_until() implementation
  > future-util: fix do_until() API to not have forwarding references
  > input_stream: add rvalue variant of input_stream::consume()
  > logger: remove extra spaces after timestamp
  > tutorial: lifetime management
  > Fix broken link for fsqual failure message
2017-09-28 15:27:34 +02:00
Piotr Jastrzebski
6069bab755 Cache single queries to non-existing partitions
This way we don't need to query sstables again
when the query is repeated.

Fixes #1533

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <8f8559ff19c534dbbb7c9ef6c28271cec607ba20.1506521461.git.piotr@scylladb.com>
2017-09-27 16:15:18 +02:00
Tomasz Grabiec
b704710954 migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.

Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull.  It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.

The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.

Fixes sporadic failure in dtest:

  repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>
2017-09-27 12:00:07 +01:00
Tomasz Grabiec
7a58fb5767 gossiper: Allow waiting for feature to be enabled
Message-Id: <1506428715-8182-1-git-send-email-tgrabiec@scylladb.com>
2017-09-27 11:57:06 +01:00
Raphael S. Carvalho
63eb9f61c0 db: use correct dirty memory manager for system column families
Dirty memory manager for non-system column families was being used
when applying mutations to system cfs.
That previously lead to deadlock when updating history. Basically,
write disable waits on compaction, and compaction waits on a write
that would release dirty memory for updating compaction history.

Only using the correct dirty manager wouldn't solve this problem
if write is disabled for system cf, but the problem is completely
solved in addition to previous change which updates history
outside the sstable lock.

Refs #2769.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>
2017-09-26 19:51:31 +02:00
Raphael S. Carvalho
e34c1db642 db: update compaction history outside the sstable write lock
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.

Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.

[1]: when updating compaction history.

Fixes #2769.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
2017-09-26 19:51:12 +02:00
Asias He
4b1034b9cd storage_service: Remove the stream_hints
Our hinted handoff implementation will not use the
db::system_keyspace::HINTS system table to store hints.
No need to stream them.

Acked-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <3b9190e250b54321ceb87767f4722c7458d41797.1506391500.git.asias@scylladb.com>
2017-09-26 19:05:21 +03:00
Paweł Dziepak
af1976bc30 Merge "Fix cache reader skipping rows in some cases" from Tomasz
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.

Fixes #2834."

* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
  tests: mvcc: Add test for partition_snapshot_row_cursor
  tests: row_cache: Add test for concurrent population
  tests: row_cache: Make populate_range() accept partition_range
  tests: Add simple_schema::make_ckey_range()
  cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
  mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
  mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
  mvcc: Keep track of all iterators in partition_snapshot_row_cursor
  mvcc: Make partition_snapshot_row_cursor printable
2017-09-26 15:09:58 +01:00
Tomasz Grabiec
3eb251e3a4 tests: perf_fast_forward: Fail if ran with more than one shard
The test reads only from local shard, if ran with more shards,
current shard will miss some of the data.

Message-Id: <1506081609-12811-1-git-send-email-tgrabiec@scylladb.com>
2017-09-26 15:23:10 +03:00
Calle Wilund
dd2b8821a4 everywhere_strategy: Make get_natural_endpoints handle non-init state
Make get_natural_endpoints return local address iff token metadata
is not yet setup (since that is the one address we already know of).

If a request has a consistency level requiring more endpoints, it
will still fail, but for calls with, for example, CL=ONE, at startup
we will succeed, and more or less act like local strategy. Yet,
further down the line, have data distributed as desired.

Acked-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20170926113512.15707-1-calle@scylladb.com>
2017-09-26 15:21:30 +03:00
Asias He
98e9049820 gossip: Print SCHEMA_TABLES_VERSION correctly
Found this when debugging gossip with debug print. The application state
SCHEMA_TABLES_VERSION was printed as UNKNOWN.
Message-Id: <d7616920d2e6516b5470a758bcf9c88f3d857381.1506391495.git.asias@scylladb.com>
2017-09-26 08:38:28 +02:00
Tomasz Grabiec
e5e9886014 tests: mvcc: Add test for partition_snapshot_row_cursor 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
e4adc9c600 tests: row_cache: Add test for concurrent population 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
a3fb7ce660 tests: row_cache: Make populate_range() accept partition_range 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
dd7af02251 tests: Add simple_schema::make_ckey_range() 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
e83cd508f6 cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
We were checking if the cursor is up_to_date(), but this is not enough
to guarantee that the cursor is valid, merely that its iterators are
valid. The cursor may be invalidated even if its iterators are valid
if there was an insertion after cursor's position.

Fixes #2834.
2017-09-25 11:21:58 +02:00
Tomasz Grabiec
2f8d91043d mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
The cursor maintains a heap of iterators in all versions. If rows were
inserted before the latest version's iterator, cursor would not see
them. Fix by redoing the lookup for iterators not in the current row
in maybe_refresh().

Refs #2834.
2017-09-25 11:21:58 +02:00
Tomasz Grabiec
09d99b0358 mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid() 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
4ee11641c0 mvcc: Keep track of all iterators in partition_snapshot_row_cursor
Will be needed when updating the iterator for latest version. Before
this change, such iterator could be neither in _current_row nor in
_heap.

Besides that, this will allow user to always access the iterator of
latest version, which enables some optimizations in the future of
avoiding unnecessary lookups. get_iterator_in_latest_version() is now
always valid.
2017-09-25 11:21:58 +02:00
Tomasz Grabiec
a8cbd34dde mvcc: Make partition_snapshot_row_cursor printable 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
8e46d15f91 storage_service: Register features before joining
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>
2017-09-25 09:13:02 +01:00
Tomasz Grabiec
b92dcb0284 storage_service: Extract register_features()
Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>
2017-09-25 09:12:46 +01:00
Tomasz Grabiec
d11d696072 tests: mutation_source_tests: Fix use-after-scope on partition range
Message-Id: <1506096881-3076-1-git-send-email-tgrabiec@scylladb.com>
2017-09-22 19:13:47 +02:00
Botond Dénes
015ac042a8 combined_mutation_reader_test: remove unneeded includes
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a388efa6fc93049f4d69c049764cc9225a04bce4.1506098363.git.bdenes@scylladb.com>
2017-09-22 18:45:04 +02:00
Botond Dénes
a7984a9908 combined_mutation_reader_test: remove leftover debug logging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <96e61fcd2543ec84921f1b2188d7248e55e7efe0.1506097635.git.bdenes@scylladb.com>
2017-09-22 18:44:47 +02:00
Tomasz Grabiec
5def901a92 sstables: Don't register logger with the same name twice
There can be one logger with given name. This was causing
--logger-log-level sstable=trace to not work for the majority of log
points.

Message-Id: <1505902259-4561-1-git-send-email-tgrabiec@scylladb.com>
2017-09-20 16:40:06 +03:00
Piotr Jastrzebski
98c359d7de Introduce a position for partition start
This position will be used for mutation fragment
that represents the start of a partition.

This position sorts before static row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-09-20 11:34:03 +02:00
Piotr Jastrzebski
e1f7d1f25d streamed_mutation: Extract concepts using GCC6_CONCEPT macro
It makes it easier to actually use those concepts.

Lambdas passed to mutation_fragment::visit have to declare
return type otherwise compiler fails with:

internal compiler error: Segmentation fault

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-09-20 11:34:03 +02:00
Tomasz Grabiec
02d41864af Merge "Fix miss opportunity to update gossiper features" from Asias
The gossiper checks if features should be enabled from its timer
callback when it detects that endpoint_state_map changed, that is
different than shadow_endpoint_state_map.

shadow_endpoint_state_map is also assigned from endpoint_state_map in
storage_service::replicate_tm_and_ep_map(), called from
storage_service::on_change()

Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so
that we won't miss gossip feature update.

Fixes #2824

* git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1:
  gossip: Move the _features_condvar signal code to
    maybe_enable_features
  gossip: Make maybe_enable_features public
  storage_service: Check gossip feature update in
    replicate_tm_and_ep_map
2017-09-20 11:16:37 +02:00
Asias He
ebc3bada12 storage_service: Check gossip feature update in replicate_tm_and_ep_map
This is another place we can update endpoint_state_map in addition to
gossiper::run().

Call the gossiper:maybe_enable_features() so that we won't miss gossip
feature update.
2017-09-20 16:58:33 +08:00
Asias He
6022b7423a gossip: Make maybe_enable_features public
It will be needed by storage_service.
2017-09-20 16:58:33 +08:00
Asias He
68c7a391b5 gossip: Move the _features_condvar signal code to maybe_enable_features
It is easier to call to features update logic outside gossiper.
2017-09-20 16:58:32 +08:00
Asias He
173cba67ba storage_service: Remove rpc client on all shards in on_dead
We should close connections to nodes that are down on all shards instead
of the shard which runs the on_dead gossip callback.

Found by Gleb.
Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>
2017-09-20 10:23:31 +02:00
Botond Dénes
43dba8f173 Update reader restriction related metrics
Update description of existing reader count metrics, add memory
consumption metrics.
2017-09-20 11:16:21 +03:00
Botond Dénes
b2db29dc65 Add restricted_reader_test unit test 2017-09-20 11:15:45 +03:00
Botond Dénes
33e97e7457 restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-09-20 11:14:35 +03:00
Botond Dénes
e4a9e55e0d mutation_reader.hh: Move restricted_reader related code
In preparation of make_restricted_reader taking a mutation_source as
its argument.
2017-09-20 11:12:57 +03:00
Tomasz Grabiec
741ec61269 streaming: Fix streaming not streaming all ranges
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.

Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.

Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
2017-09-20 10:33:59 +03:00
Avi Kivity
5b0cb28af9 Merge "row_cache: Call fast_forward_to() outside allocating section" from Tomasz
"On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.

We shouldn't call fast_forward_to() with the cache region locked.

Fixes #2791."

* 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev:
  cache_streamed_mutation: De-futurize cursor movement
  cache_streamed_mutation: Call fast_forward_to() outside allocating section
  cache_streamed_mutation: Switch from flags to explicit state machine
2017-09-19 17:11:22 +03:00
Botond Dénes
96c6d54a5c incremental_reader_selector: Remove unecessary check for duplicated next_token
The next_token will never be the same as the current _selector_position,
unless they are both maximum_token, which is already handled.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9c54ae07a18d201185027c9b533bcb5256bead8a.1505826102.git.bdenes@scylladb.com>
2017-09-19 16:42:02 +03:00
Avi Kivity
12c393dd16 build: default to gold linker
Better, faster.
Message-Id: <20170919115737.12084-1-avi@scylladb.com>
2017-09-19 14:02:31 +02:00
Avi Kivity
a31ade54e0 streamed_mutation: optimize merge_mutations() if only one mutation
If we read a partition from a single sstable (a fairly common case),
we can bypass mutation_merger and just return the input.
Message-Id: <20170918181418.14021-1-avi@scylladb.com>
2017-09-19 11:00:59 +01:00
Botond Dénes
8cb953b58b incremental_reader_selector: don't create readers unconditionally on ff
When fast-forwarding check that the new position is past the selector
before attempting to create new readers. Also don't clear the set of
already created readers and don't overwrite the selector position.

Fixes #2807

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <514f69005eb29c2a3359f098d40abf588900b76f.1505811064.git.bdenes@scylladb.com>
2017-09-19 11:27:47 +02:00
Asias He
8f8273969d gossip: Do not wait for echo message in mark_alive
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.

Apache (tm) Cassandra (tm) sends echos in the background.

In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.

Fixes #2787
Fixes #2797

Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
2017-09-19 10:49:00 +03:00
Raphael S. Carvalho
1524426deb sstables: Fix compaction correctness of higher-level tables
When incremental_reader_selector is used for compaction, it will
first call incremental selector of partitioned sstable set with
minimum token that will result in first interval being skipped,
which means not everything being compacted. The interval is
skipped because iterator is incorrectly advanced when token
lies before it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918021446.15920-1-raphaelsc@scylladb.com>
2017-09-19 09:59:30 +03:00
Avi Kivity
55e0b63e65 storage_proxy: scan more nodes exponentially to achieve target result set size
The current sequential scan can take a long time on a small or empty table
with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to
exponential scan reduces the time.

Fixes #1230.
Message-Id: <20170912173803.8277-1-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Avi Kivity
e44517851e untyped_result_set: reduce dependencies
Forward-declare untyped_result_set and untyped_result_set_row, and remove
the include from query_processor.hh.
Message-Id: <20170916170859.27612-3-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Avi Kivity
0317746822 untyped_result_set: make untyped_result_set::row a namespace scope class
Makes it possible to forward-declare, with the aim of reducing dependencies.
Message-Id: <20170916170859.27612-2-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Pekka Enberg
9ebd8be82b index: Add secondary_index_manager::reload()
This patch adds a reload() function, which updates the secondary index
manager index map to match underlying column family indices.
2017-09-18 14:31:35 +03:00
Duarte Nunes
16d2e4e81b Merge 'reduce sstables.hh coupling' from Glauber
"sstables.hh is already too big, and it is soon to become bigger with the
inclusion of the read_monitor, to pair it with the write_monitor.

It's a good opportunity for us to reduce sstables.hh dependencies by
moving the write monitor to its own reader. One obvious caller is
already changed so we don't need to include sstables.hh anymore."

* 'progress-monitor' of https://github.com/glommer/scylla:
  sstables: do not include sstables.hh from memtable glue
  sstables: move write_monitor to its own header
2017-09-18 13:31:32 +02:00
Pekka Enberg
2ae6b141e5 index: Add secondary_index_manager::list_indexes() 2017-09-18 14:27:35 +03:00
Avi Kivity
a2f26f7b29 log_histogram: rename to log_heap
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
2017-09-18 12:44:05 +02:00
Amnon Heiman
8d668a9dc0 API: storage_service repair_async_status to return proper error code
This patch change the implementation of storage_service
repair_async_status to throw an exception, this way a 400 return code
will be returned.

Fixes #2794

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917080533.6612-1-amnon@scylladb.com>
2017-09-18 09:08:26 +03:00
Vlad Zolotarov
cea15486c4 tests: loading_cache_test: initial commit
Test utils::loading_shared_values and utils::loading_cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 22:19:15 -04:00
Vlad Zolotarov
66568be969 cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
   - Add the corresponding metrics to the query_processor:
      - Evictions count.
      - Current entries count.
      - Current memory footprint.

Fixes #2474

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 22:19:15 -04:00
Vlad Zolotarov
8f912b46b1 cql3: prepared statements cache on top of loading_cache
This is a template class that implements caching of prepared statements for a given ID type:
   - Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow
     this memory limit - the less recently used entries are going to be evicted so that the new entry could
     be added.
   - The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size
     functor class that returns a number of bytes for a given prepared statement (currently returns 10000
     bytes for any statement).
   - The cache entry is going to be evicted if not used for 60 minutes or more.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 22:19:11 -04:00
Vlad Zolotarov
9a43398d6a utils::loading_cache: make the size limitation more strict
Ensure that the size of the cache is never bigger than the "max_size".

Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
4e72a56310 utils::loading_cache: added static_asserts for checking the callbacks signatures
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
a13362e74b utils::loading_cache: add a bunch of standard synchronous methods
Add a few standard synchronous methods to the cache, e.g. find(), remove_if(), etc.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
fa2f8162a5 utils::loading_cache: add the ability to create a cache that would not reload the values
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.

In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
a60a77dfc8 utils::loading_cache: add the ability to work with not-copy-constructable values
Current get(...) interface restricts the cache to work only with copy-constructable
values (it returns future<Tp>).
To make it able to work with non-copyable value we need to introduce an interface that would
return something like a reference to the cached value (like regular containers do).

We can't return future<Tp&> since the caller would have to ensure somehow that the underlying
value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like
pointer to that value.

"Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr
and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while
it keeps a "reference" to an object stored in the cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
c24d85f632 utils::loading_cache: add EntrySize template parameter
Allow a variable entry size parameter.
Provide an EntrySize functor that would return a size for a
specific entry.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
6024014f92 utils::loading_cache: rework on top of utils::loading_shared_values
Get rid of the "proprietary" solution for asynchronous values on-demand loading.
Use utils::loading_shared_values instead.

We would still need to maintain intrusive set and list for efficient shrink and invalidate
operations but their entry is not going to contain the actual key and value anymore
but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value
pair value.

In general, we added another level of dereferencing in order to get the key value but since
we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set
this should not translate into an additional set lookup latency.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
d56684b1a5 sstables::shared_index_list: use utils::loading_shared_values
Since utils::loading_shared_values API is based on the original shared_index_list
this change is mostly a drop-in replacement of the corresponding parts.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
ec3fed5c4d utils: introduce loading_shared_values
This class implements an key-value container that is populated
using the provided asynchronous callback.

The value is loaded when there are active references to the value for the given key.

Container ensures that only one entry is loaded per key at any given time.

The returned value is a lw_shared_ptr to the actual value.

The value for a specific key is immediately evicted when there are no
more references to it.

The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed
every time a new value is added (asynchronously loaded).

The container has a rehash() method that would grow or shrink the container as needed
in order to get the load factor into the [0.25, 0.75] range.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:06 -04:00
Glauber Costa
2227ae3f19 sstables: do not include sstables.hh from memtable glue
There is no need to include the whole sstables.hh file in
memtable-sstable.hh anymore. All we need is the shared_sstable
definition and the progress monitor.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-09-15 14:16:35 -04:00
Glauber Costa
51829f528d sstables: move write_monitor to its own header
Soon I am about to introduce a read monitor, and pairing infrastructure
to manage it. Having it all living in sstables.hh force to include it
everytime, even in places that don't really need it.

Move to its own header.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-09-15 14:09:07 -04:00
Tomasz Grabiec
804722b6c8 tests: perf_fast_forward: Fix use-after-scope on partition range
Message-Id: <1505489249-16806-1-git-send-email-tgrabiec@scylladb.com>
2017-09-15 16:34:41 +01:00
Tomasz Grabiec
7b5b461067 cache_streamed_mutation: De-futurize cursor movement
start_reading_from_underlying() doesn't return future<>
any more, so we can simplify this.
2017-09-15 15:41:55 +02:00
Tomasz Grabiec
22019577cc cache_streamed_mutation: Call fast_forward_to() outside allocating section
On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.

We shouldn't call fast_forward_to() with the cache region locked.

Fixes #2791.
2017-09-15 15:41:55 +02:00
Tomasz Grabiec
3b790a1e80 cache_streamed_mutation: Switch from flags to explicit state machine
We're in one state at a time, so it's better to express it as a single
variable rather than N independent flags.

In preparation before adding more states.
2017-09-15 15:41:55 +02:00
Glauber Costa
eb93d5f8ad database: pass a monitor as a parameter to memtable writer
Right now we pass a permit to the memtable writer and that permit is
used insite write_memtable_to_sstable to compose a write_monitor.

We would like to extend the write_monitor to include other things, that
right now are not available as parameters to write_memtable_to_sstable -
and which are possibly too specialized to be.

The solution for that is to pass the write_monitor instead of the permit
to the writer. Conceptually, that also makes sense because the
write_monitor is something the sstable writer is aware of. Permits, on
the other hand, are a database concept that is alien to the sstable
writer.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170915032836.21154-1-glauber@scylladb.com>
2017-09-15 12:26:56 +02:00
Duarte Nunes
8378fe190a Merge 'Fix schema version mismatch during rolling upgrade from 1.7' from Tomasz
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:

    storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d

The situation should heal after whole cluster is upgraded.

Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.

Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.

The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."

* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
  tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
  tests: cql_test_env: Enable all features in tests
  schema_tables: Make make_scylla_tables_mutation() visible
  migration_manager: Disable pulls during rolling upgrade from 1.7
  storage_service: Introduce SCHEMA_TABLES_V3 feature
  schema_tables: Don't alter tables which differ only in version
  schema_mutations: Use mutation_opt instead of stdx::optional<mutation>
2017-09-15 10:27:47 +02:00
Tomasz Grabiec
c657eec4cf tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case 2017-09-14 20:26:31 +02:00
Tomasz Grabiec
f0fdf75e7c tests: cql_test_env: Enable all features in tests 2017-09-14 20:26:31 +02:00
Tomasz Grabiec
571cac95ed schema_tables: Make make_scylla_tables_mutation() visible
For tests.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
5a92c18e63 migration_manager: Disable pulls during rolling upgrade from 1.7
If there is a schema pull during rolling upgrade among a two 2.0 nodes,
then schema merge will delete the persisted schema version. When the node
loads that table again, e.g. on restart, it will generate a version
which is different than the one which 1.7 nodes use. This will
cause reads and writes to fail.

To avoid this, disable pulls until all nodes are upgraded.

Fixes #2802.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
713d75fd51 storage_service: Introduce SCHEMA_TABLES_V3 feature 2017-09-14 20:26:31 +02:00
Tomasz Grabiec
f943d2efbf schema_tables: Don't alter tables which differ only in version
We apply deletion of scylla_tables.version to the incoming schema
mutations so that table schema version is recalculated after merge.
The mutations which we read from local schema tables may not have it
deleted in which case all tables would be considered as differing on
the presence of the version field. Avoid this by deleting the field
from old mutations as well.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
99272087e6 schema_mutations: Use mutation_opt instead of stdx::optional<mutation> 2017-09-14 20:26:31 +02:00
Takuya ASADA
7662271fc9 dist/ami: show correct message when scylla-ami-setup.service failed
After the service started, a state of the service may become
"failed", "active" or "activating".
But our script does not accept scylla-ami-setup.service become "failed" state,
in result the script shows up wrong message.
So we handle these three types of state correctly.

Fixes #2759

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1504589079-1986-1-git-send-email-syuu@scylladb.com>
2017-09-14 12:40:05 +03:00
Tomasz Grabiec
0911fbbdef row_cache: Fix row_cache::update_invalidating()
evict() doesn't guarantee that the whole partition is discontinuous.
In particular, partition tombstone cannot be marked as discontinuous.
The parts which are still continuous must be updated.

Broken after c78047fa5b.

Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>
2017-09-14 10:58:25 +03:00
Asias He
0ec574610d locator: Get rid of assert in token_metadata
In commit 69c81bcc87 (repair: Do not allow repair until node is in
NORMAL status), we saw a coredump due to an assert in
token_metadata::first_token_index.

Throw an exception instead of abort the whole scylla process.
Message-Id: <c110645cee1ee3897e30a3ae1b7ab3f49c97412c.1504752890.git.asias@scylladb.com>
2017-09-14 10:33:02 +03:00
Gleb Natapov
31e803a36c storage_proxy: wire up percentile speculative read properly
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).

Fixes #2757

Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
2017-09-14 10:31:26 +03:00
Gleb Natapov
0842faecef estimated_histogram: fix overflow error handling
Currently overflow values are stored in incorrect bucket (last one
instead of special "overflow" one) and percentile() function throws
if there is overflow value. The patch fixes the code to store overflow
value in corespondent bucket and makes percentile() to take it into
account instead of throwing.

Message-Id: <20170911131752.27369-2-gleb@scylladb.com>
2017-09-14 10:31:21 +03:00
Asias He
5ff0b113c9 gossip: Fix indentation in apply_state_locally
Message-Id: <2bdefa8d982ad8da7452b41e894f41d865b83b0b.1505356245.git.asias@scylladb.com>
2017-09-14 10:09:50 +03:00
Tomasz Grabiec
65ca8eebb8 mutation_partition: Print rows_entry's position instead of key
For dummy rows, _key doesn't reflect the right position.

Message-Id: <1505317040-6783-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 20:49:28 +03:00
Avi Kivity
ca8e3c4a78 Merge "Evict from partition snapshots in cache" from Tomasz
"This series fixes the problem of active reads causing OOM due to the fact that
partition snapshots they hold are not evictable. In particular, a single scan
of a partition larger than memory will bad_alloc due to itself.

After this, when partition entry is evicted from cache, data in all the snapshots
is also evicted. We still don't have row-level eviction, but this series lays some
grounds for it by making cache readers prepared for the possibility of rows
being evicted.

Fixes #2775.
Fixes #2730."

* tag 'tgrabiec/snapshot-evicition-in-cache-v1' of github.com:scylladb/seastar-dev:
  tests: Add test for partition_entry::evict()
  mutation_partition: Introduce range continuity checking methods
  mutation_partition: Enable rows_entry::compare() on position_in_partition_views
  tests: Extract mvcc tests to separate file
  tests: row_cache: Add evicition tests
  tests: simple_schema: Add new_tombstone() helper
  tests: streamed_mutation_assertions: Introduce produces(mutation&)
  streamed_mutation: Allow setting buffer capacity
  row_cache: Evict partition snapshots
  mvcc: Introduce partition_entry::evict()
  row_cache: Handle eviction in partition reader
  tests: row_cache_test: Don't assume mvcc snapshots are not evictable
  row_cache: Reuse allocation_strategy::invalidate_references()
  row_cache: Don't invalidate references on insertion
  lsa: Move reclaim counter concept to allocation_strategy level
  mvcc: Ensure partition_snapshot always destroys versions using proper allocator
  mvcc: Encapsulate reference stability check in partition_snapshot
  mvcc: Store LSA region reference in partition_snapshot
2017-09-13 20:48:33 +03:00
Tomasz Grabiec
a45b1ef4bc sstables: Make atomic_deletion_manager logger static
So that it's visible to the framework at boot and --logger-log-level can
be used on it.

Message-Id: <1505286578-21904-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 20:35:41 +03:00
Tomasz Grabiec
b8f62e86de tests: Add test for partition_entry::evict() 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
455a1b0d24 mutation_partition: Introduce range continuity checking methods 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
abc489e99d mutation_partition: Enable rows_entry::compare() on position_in_partition_views
For full symmetry with existing overloads.
2017-09-13 17:47:04 +02:00
Tomasz Grabiec
d76b141b34 tests: Extract mvcc tests to separate file 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
2dfb3b95a5 tests: row_cache: Add evicition tests 2017-09-13 17:47:03 +02:00
Tomasz Grabiec
204ec9c673 tests: simple_schema: Add new_tombstone() helper 2017-09-13 17:47:03 +02:00
Tomasz Grabiec
5b1adfa542 tests: streamed_mutation_assertions: Introduce produces(mutation&) 2017-09-13 17:47:03 +02:00
Tomasz Grabiec
cb16b038ef streamed_mutation: Allow setting buffer capacity
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
c78047fa5b row_cache: Evict partition snapshots
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.

Fixes #2775.
Fixes #2730.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
b6ae5783cd mvcc: Introduce partition_entry::evict()
The operation frees as much memory as possible, marking affected
mutation elements as discontinuous.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
fa2c26342c row_cache: Handle eviction in partition reader 2017-09-13 17:38:08 +02:00
Tomasz Grabiec
99aa3d1964 tests: row_cache_test: Don't assume mvcc snapshots are not evictable
The test was not updating the underlying mutation source but still
expecting to get the right data after calling invalidate().  If
snapshots are evictable, that's not guaranteed. Apply to underlying as
well, so data is read from underlying if necessary.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
adb159d51b row_cache: Reuse allocation_strategy::invalidate_references()
Modification count in the tracker is redundant, we can rely on
allocator's invalidation counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
27a3b4bca9 row_cache: Don't invalidate references on insertion
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.

Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
87be474c19 lsa: Move reclaim counter concept to allocation_strategy level
So that generic code can detect invalidation of references. Also, to
allow reusing the same mechanism for signalling external reference
invalidation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
4053c801e2 mvcc: Ensure partition_snapshot always destroys versions using proper allocator
partition_snapshot is managed by lw_shared_ptr. Currently it is
assumed that before it dies, maybe_merge_versions() is called on it,
which destroyes it in the right allocator context. It's not very
safe. This patch improves safety by using the right allocator in
snapshot's destructor.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
cda86abdbc mvcc: Encapsulate reference stability check in partition_snapshot 2017-09-13 17:38:08 +02:00
Tomasz Grabiec
2df6f356b1 mvcc: Store LSA region reference in partition_snapshot
Will be useful for improving encapsulation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
4c920c9891 tests: cql_test_env: Use cancel_prior_atomic_deletions()
This fixes a failure in view_schema_test, which starts many instances
of single_node_cql_env. cancel_atomic_deletions() causes later
deletions to fail, which causes some of the test cases to fail.
Message-Id: <1505311250-3118-2-git-send-email-tgrabiec@scylladb.com>
2017-09-13 17:11:34 +03:00
Tomasz Grabiec
dc0860ac70 sstables: Introduce cancel_prior_atomic_deletions()
Like cancel_atomic_deletions() but doesn't fail later deletions.
Message-Id: <1505311250-3118-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 17:11:33 +03:00
Tomasz Grabiec
8a425cedc6 tests: cql_test_env: Cancel pending sstable deletions on shutdown
Fixes a hang on shutdown with --smp 2 in perf_fast_forward. The hang
is in sstables::await_background_jobs_on_all_shards(), which is
waiting on sstable deletions. Not all shards agree to delete certain
sstables, because e.g. not all shards decide to compact them
yet. Cancel those deletes after database is stopped on all shards,
like we do in main.cc

Fixes #2796.

Message-Id: <1505292239-26032-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 11:56:48 +03:00
Asias He
c84dcabb8f gossip: Use boost::copy_range in apply_state_locally
boost::copy_range is better because the vector is allocated with the
correct size instead of growing when the inserter is called.

[avi: also crashes less]

Message-Id: <b19ca92d56ad070fca1e848daa67c00c024e3a4d.1505291199.git.asias@scylladb.com>
2017-09-13 11:33:15 +03:00
Tomasz Grabiec
b3a8ba5af6 gdb: Introduce "scylla find" command
Finds live objects on seastar heap of current shard which contain
given value. Prints results in 'scylla ptr' format.

Example:

  (gdb) scylla find 0x600005321900
  thread 1, small (size <= 512), live (0x6000000f3800 +48)
  thread 1, small (size <= 56), live (0x6000008a1230 +32)

Message-Id: <1505284614-19577-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 11:22:23 +03:00
Asias He
fa9d47c7f3 gossip: Fix a log message typo in compare_endpoint_startup
Message-Id: <c4958950e1108082b63e08ab81ee2177edc9b232.1505286843.git.asias@scylladb.com>
2017-09-13 09:54:56 +02:00
Glauber Costa
ecad1be161 compaction_strategy: add missing header
compaction_strategy.hh throws an exception, but it doesn't add the
exception header. It is working in-tree because of inclusion order,
but it broke one of my yet-out-of-tree changes.

In any case, it is best to add the headers we will need to the files,
and that is what this patch does.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170912233326.26114-1-glauber@scylladb.com>
2017-09-13 08:40:15 +02:00
Raphael S. Carvalho
ef18b1162b sstables/compaction_manager: rename and better explain reshard function
submit doesn't properly describe the function and also improve explanation
of the relationship between function itself and its job parameter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170912032034.23043-1-raphaelsc@scylladb.com>
2017-09-12 12:25:17 +03:00
Avi Kivity
1bd207a306 sstables: merge filter.cc into sstables.cc
filter.cc has just two smallish functions, which are part of the sstable
class. Move them to sstables.cc where the rest of the class members are defined.
Message-Id: <20170912080541.7836-1-avi@scylladb.com>
2017-09-12 10:06:52 +02:00
Tomasz Grabiec
423142ec81 tests: row_cache_test: Fix abort in debug mode
The test used apply() variant which assumed that it was invoked in a
seastar thread, which is no longer the case after commit d22fdf4. Fix
by copying outisde cache update, and use non-deferring apply() variant
for cache update.

Message-Id: <1505200142-3650-1-git-send-email-tgrabiec@scylladb.com>
2017-09-12 10:57:36 +03:00
Tomasz Grabiec
3f527e028d Merge "Reduce dependencies on sstables.hh" from Avi
This patchset reduces includes of sstables.hh, reducing compile time
by both reducing the amount of code compiled, and the amount of
needless recompiles caused by false dependencies.  It does so by
replacing lw_shared_ptr<sstable>, which requires a complete class,
with a new custom type shared_sstable, which allows an incomplete
sstable class definition.

* https://github.com/avikivity/scylla deps2/v2.1
  database: change truncate() to flush while compaction is disabled
  database: make run_with_compaction_disabled() a non-template
  database: add indirection to compaction_manager instance
  database: remove dependency on compaction.hh and compaction_manager.hh
  size_estimates_virtual_reader.hh: add missing include
  system_keyspace: add missing include
  main: add missing include
  storage_service: add missing include
  repair: add missing include
  compaction.hh: add missig include and forward declaration
  compaction_manager: add missing include
  shared_index_lists.hh: add missing include
  perf_fast_forward: add missing include
  sstable_mutation_test: add missing include
  sstables: extract version and format enum into a separate header file
  database.hh: add missing forward declaration for
    foreign_sstable_open_info
  cql_test_env: add forward declaration
  database: make column_family::disable_sstable_write() out-of-line
  sstables: introduce make_sstable() as a shortcut for
    make_lw_shared<sstable>
  treewide: use shared_sstable, make_sstable in place of
    lw_shared_ptr<sstable>
  sstables: use support for lw_shared_ptr with incomplete type for
    shared_sstable
  sstables: reduce dependencies
  streaming: remove unneeded includes
2017-09-12 09:56:46 +02:00
Tomasz Grabiec
ee1e7732a6 database: Create tables with continuous cache
When table is created, it doesn't contain any data, so we can mark the whole
data range as continuous in cache. This way reads will immediately hit, and
flushes will populate. If sstables are later attached, the attaching process
is supposed to invalidate affected ranges (and it does).

Fixes #2536.

Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>
2017-09-12 10:53:07 +03:00
Avi Kivity
85a6a2b3cb streaming: remove unneeded includes 2017-09-12 10:43:39 +03:00
Avi Kivity
578bf55371 sstables: reduce dependencies
Use shared_sstable.hh instead of sstables.hh.
2017-09-12 10:43:36 +03:00
Avi Kivity
07feaf9c4c sstables: use support for lw_shared_ptr with incomplete type for shared_sstable
Use the lw_shared_ptr deleter support to define shared_sstable without
pulling the definition of class sstable, reducing compile time and
dependencies if only shared_sstable is needed.
2017-09-12 10:43:05 +03:00
Avi Kivity
f7023501d6 treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable>
Since shared_sstable is going to be its own type soon, we can't use the old alias.
2017-09-12 10:43:05 +03:00
Avi Kivity
1a3cdffbc1 sstables: introduce make_sstable() as a shortcut for make_lw_shared<sstable>
shared_sstable will soon not be an alias for lw_shared_ptr<sstable>, so we need
another factory function.
2017-09-12 10:43:05 +03:00
Avi Kivity
88b91c84a1 database: make column_family::disable_sstable_write() out-of-line
Reduces dependencies.
2017-09-12 10:43:05 +03:00
Avi Kivity
02028df9b1 cql_test_env: add forward declaration
Not worthwhile to add a new #include for this.
2017-09-12 10:43:05 +03:00
Avi Kivity
02e5bf1c20 database.hh: add missing forward declaration for foreign_sstable_open_info
Supplied by an incidental include now, but it will be gone soon.
2017-09-12 10:43:05 +03:00
Avi Kivity
c4bafd912c sstables: extract version and format enum into a separate header file
This allows removing a dependency on sstables.hh later on.
2017-09-12 10:43:05 +03:00
Avi Kivity
5ebb15b9d4 sstable_mutation_test: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
fdab47ab32 perf_fast_forward: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
ca2d0b4efb shared_index_lists.hh: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
eb62b2c00d compaction_manager: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
0efa444a56 compaction.hh: add missing includes 2017-09-12 10:42:45 +03:00
Avi Kivity
7ca029c8f1 database_fwd.hh: add column_family forward declaration 2017-09-12 10:41:28 +03:00
Avi Kivity
4751402709 build: disable -fsanitize-address-use-after-scope on CqlParser.o
The parser generator somehow confuses the use-after-scope sanitizer, causing it
to use large amounts of stack space. Disable that sanitizer on that file.
Message-Id: <20170905110628.18047-1-avi@scylladb.com>
2017-09-11 19:42:26 +02:00
Avi Kivity
43a72254ff repair: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
aebab377d9 storage_service: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
a3b8089bd4 main: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
0aaefe665b system_keyspace: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
d3cde2e2be size_estimates_virtual_reader.hh: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
9b540eccb0 database: remove dependency on compaction.hh and compaction_manager.hh 2017-09-11 20:09:45 +03:00
Avi Kivity
f9c8c1ddc2 database: add indirection to compaction_manager instance
Allows making it forward-declared later on, reducing dependencies.
2017-09-11 20:09:45 +03:00
Avi Kivity
9d0aaa941a database: make run_with_compaction_disabled() a non-template
Allows reducing dependencies down the line, and un-templating
non-performance-critical functions is a good thing.
2017-09-11 20:09:45 +03:00
Avi Kivity
6b5514a3df database: change truncate() to flush while compaction is disabled
In preparation to make run_with_compaction_disabled() a non-template,
we want to remove any non-copyable captures (so the function can be
an std::function, which requires copyability). Move the flush within
the compaction disabled region. This changes the behavior, but it shouldn't
matter.
2017-09-11 20:09:45 +03:00
Avi Kivity
14fd4168dc Merge seastar upstream
* seastar 31b925d...92fdce2 (3):
  > shared_ptr: allow incomplete classes in lw_shared_ptr<>
  > Update DPDK to 17.05
  > future: pass func as mutable to lambda arg of handle_exception[_type]
2017-09-11 20:09:04 +03:00
Tomasz Grabiec
95b3eaac97 debug: Allow running scylla_row_cache_report.stp script against a running process
Message-Id: <1504776359-16424-1-git-send-email-tgrabiec@scylladb.com>
2017-09-11 14:17:30 +03:00
Avi Kivity
fe019ad84d Merge "Refuse to load non-Scylla counter sstables" from Paweł
"These patches make Scylla refuse to load counter sstables that may
contain unsupported counter shards. They are recognised by the lack of
the Scylla component.

Fixes #2766."

* tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla:
  db: reject non-Scylla counter sstables in flush_upload_dir
  db: disallow loading non-Scylla counter sstables
  sstable: add has_scylla_component()
2017-09-11 13:28:44 +03:00
Tzach Livyatan
83eab5c8d7 Remove comment about Too high number of concurrent compactions from scylla_compaction_manager_compactions help
It should never happen and its not clear what too high stands for

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170911085645.21222-1-tzach@scylladb.com>
2017-09-11 13:27:35 +03:00
Gleb Natapov
d0d8bdf615 storage_proxy: remove unused parameter from get_restricted_ranges() function
Message-Id: <20170911084653.GH24167@scylladb.com>
2017-09-11 11:58:44 +02:00
Gleb Natapov
f66e9377d4 storage_proxy: do not keep reference to a keyspace during write
A keyspace can be deleted while write is ongoing, so the object cannot
be used after defer point. The keyspace reference is only used to check
how many replies a write operation should wait for and this can be
precalculated during write handler creation.

Fixes #2777

Message-Id: <20170911084436.GG24167@scylladb.com>
2017-09-11 11:57:00 +02:00
Asias He
bb9dbc5ade storage_service: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>
2017-09-10 18:10:24 +03:00
Botond Dénes
9ebeb9d5ce Fix --Wreturn-type warnings in tests: use abort() instead of assert(0)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <95927f933411302e84d57d169ee0147def7bc643.1504890922.git.bdenes@scylladb.com>
2017-09-10 17:09:53 +03:00
Gleb Natapov
9137446109 api: uses correct statistics for storage proxy range histograms.
Message-Id: <20170910073458.GB1870@scylladb.com>
2017-09-10 16:18:36 +03:00
Pekka Enberg
d2632ddf1d Merge "gossip: optimize apply_state_locally for large cluster" from Asias
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.

This series improves apply_state_locally for large cluster:

 - Tune the order of applying endpoint_state
 - Serialize apply_state_locally
 - Avoid copying of the gossip state map

Fixes #2404"

* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
  gossip: Avoid copying with apply_state_locally
  gossip: Serialize apply_state_locally
  gossip: Tune the order of applying endpoint_state in apply_state_locally
  gossip: Introduce is_seed helper
  gossip: Pass const endpoint_state& in notify_failure_detector
  gossip: Pass reference in notify_failure_detector
2017-09-08 11:41:43 +03:00
Asias He
57dd3cb2c5 gossip: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <52c24d7dfe082ee926f065a6268d83fcb31ddc28.1504832289.git.asias@scylladb.com>
2017-09-08 10:59:42 +03:00
Asias He
e98ce7887b gossip: Avoid copying with apply_state_locally
Move the std::map<inet_address, endpoint_state> map from the gossip
ack/ack2 message directly and move it around in apply_state_locally to
avoid copying the map.
2017-09-08 15:19:48 +08:00
Asias He
fd879b4e09 gossip: Serialize apply_state_locally
apply_state_locally will be called when gossip ack/ack2
message is received. It will use the std::map<inet_address,
endpoint_state>& map to update the endpoint state.

However, we can receive multiple such gossip ack/ack2  messages from
multiple peer nodes in parallel. Currently, we process them in parallel.
It is better to apply all the states from one node then move to apply
all the states from another node than interleaving. Because it is more
important to have the state of the whole cluster than to have a bit
newer state from another peer (if it is newer), especially when the node
boots up and runs its first round of gossip exchange.

After this patch, we apply the whole gossip state contained in the
gossip ack/ack2 message before applying the next one.
2017-09-08 15:19:47 +08:00
Asias He
9ccba950ba gossip: Tune the order of applying endpoint_state in apply_state_locally
We currently always apply the endpoint_state in the order of the
endpoint ip address. This is not good because some of the endpoint's
state is always applied earlier than the others.

In large cluster, the number of endpoints can be large, it takes time to
apply all of them. To make it more fair, we apply the endpoint_state
randomly.

Apply the seed node's state earlier because in bootstrap, we will check
if we have seen the seed node in storage_service::bootstrap. In #2404,
the bootstrap failed because, the joining node hasn't apply the seed
node's state when storage_service::bootstrap runs.
2017-09-08 15:19:47 +08:00
Asias He
c5456ed38f gossip: Introduce is_seed helper
To check if a endpoint is a seed node.
2017-09-08 15:19:47 +08:00
Asias He
32edd95241 gossip: Pass const endpoint_state& in notify_failure_detector 2017-09-08 15:19:47 +08:00
Asias He
46e562cbfa gossip: Pass reference in notify_failure_detector
In large cluster, the map can be large. Pass reference to avoid copying.
2017-09-08 15:19:47 +08:00
Glauber Costa
db846326f8 compaction: remove dead code
This code has no more users. Bury it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170908005305.29925-1-glauber@scylladb.com>
2017-09-08 08:17:15 +02:00
Tomasz Grabiec
57dc988475 Update seastar submodule
* seastar 85ca12d...31b925d (19):
  > net/byteorder: fix 64 bit ntohq and htonq on big endian machines
  > core, util: fix compilation on non-x86 processors
  > core/memory: Fix SIGSEGV in small_pool::add_more_objects()
  > log: remove debug leftovers
  > Merge "TLS state machine fixes" from Calle
  > logger: allow adjusting the timestamp style for stdout logs
  > thread: make thread_context::s_main portable
  > core: add seastar::cache_line_size constant
  > Add detach() to input_stream and output_stream
  > Install dependencies for Arch Linux.
  > tls: Guard non-established sockets in sesrefs + more explicit close + states
  > tls: Make vec_push fully exception safe
  > basic_sstring: resize uses sstring
  > Merge "Add and correct unit tests" from Jesse
  > tcp: enforce 1-byte maximum segment invariant with zero window
  > tcp: verify 1-byte maximum segment invariant during send with zero window
  > memory: reduce small_pool vulnerability to fragmentation further
  > Prometheus: avoid merging all metrics family
  > net: Fix possible NULL pointer dereference.
2017-09-07 10:34:27 +02:00
Avi Kivity
d9ee2ad9f0 chunked_vector: avoid boost::small_vector with old boost versions
Apparently older boost versions have a bug resulting in a double-free
in boost::container::small_vector. Use std::vector instead.

Fixes #2748.

Tested-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20170903170207.21635-1-avi@scylladb.com>
2017-09-07 09:32:51 +03:00
Tomasz Grabiec
121cd8cb6c tests: Fix cql_query_test.cc::test_duration_restrictions
validate_request_failure() assumed that the future returned by execute_cql()
is always ready, which doesn't have to be the case, and caused aborts
in debug mode build.

Message-Id: <1504701342-13300-1-git-send-email-tgrabiec@scylladb.com>
2017-09-06 15:49:03 +03:00
Tomasz Grabiec
3986486cb3 tests: cql_test_env: Avoid exceptions to make debugging easier
Message-Id: <1504701375-13491-1-git-send-email-tgrabiec@scylladb.com>
2017-09-06 15:48:59 +03:00
Paweł Dziepak
e401d2d50b db: reject non-Scylla counter sstables in flush_upload_dir
Scylla already refuses to load counter sstables that do not have Scylla
component. However, if this happens because of 'nodetool refresh'
command the existing protection will trigger after sstables have been
moved to the data directory. This is too later, so an additional check
is added when the upload directory is scanned.
2017-09-06 12:04:26 +01:00
Paweł Dziepak
6a5e8bace1 db: disallow loading non-Scylla counter sstables
Scylla does not support local and remote counter shards. This means that
it is unsafe to directly load sstables that may contain them.
2017-09-06 12:03:58 +01:00
Paweł Dziepak
ebc538f4a3 sstable: add has_scylla_component()
has_scylla_component() is going to be used to verify that an sstable has
been generated by a recent version of Scylla. This would make it
possible to reject sstables that may be unsafe to load (e.g. sstables
containing legacy counter shards).
2017-09-06 12:03:45 +01:00
Avi Kivity
a59e375aad Merge "Support termination of repair jobs" from Asias
"This series implements the missing API to terminate all repairs.

For example:

$ curl -X POST  --header "Accept: application/json"
"http://127.0.0.1:10000/storage_service/force_terminate_repair"

With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.

On top of this, we can support termination of single repair job instead all
jobs.

Fixes #2105"

* tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev:
  repair: Support termination of repair jobs
  repair: Track repair_info
  repair: Intorduce repair id to repair_info map
  api: Add force_terminate_repair API
  streaming: Add abort to stream_plan
  streaming: Add abort_all_stream_sessions for stream_coordinator
  streaming: Introduce streaming::abort()
  streaming: Make stream_manager and coordinator message debug level
  streaming: Check if _stream_result is valid
  streaming: Log peer address in on_error
  streaming: Introduce received_failed_complete_message
2017-09-06 12:58:05 +03:00
Avi Kivity
31706ba989 Merge "Fix Scylla upgrades when counters are used" from Paweł
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.

The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.

A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.

Fixes #2752."

* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
  tests/counter: verify counter_id ordering
  counter: check that utils::UUID uses int64_t
  mutation_partition_serializer: use old counter ordering if necessary
  mutation_partition_view: do not expect counter shards to be sorted
  sstables: write counter shards in the order expected by the cluster
  tests/sstables: add storage_service_for_tests to counter write test
  tests/sstables: add test for reading wrong-order counter cells
  sstables: do not expect counter shards to be sorted
  storage_service: introduce CORRECT_COUNTER_ORDER feature
  tests/counter: test 1.7.4 compatible shard ordering
  counters: add helper for retrieving shards in 1.7.4 order
  tests/counter: add tests for 1.7.4 counter shard order
  counters: add counter id comparator compatible with Scylla 1.7.4
  tests/counter: verify order of counter shards
  tests/counter: add test for sorting and deduplicating shards
  counters: add function for sorting and deduplicating counter cells
  counters: add counter_id::operator>
2017-09-05 14:20:55 +03:00
Paweł Dziepak
ed68a75b75 tests/counter: verify counter_id ordering 2017-09-05 10:52:54 +01:00
Paweł Dziepak
cdf7ba76f1 counter: check that utils::UUID uses int64_t 2017-09-05 10:46:03 +01:00
Paweł Dziepak
4aa72c6454 mutation_partition_serializer: use old counter ordering if necessary
Until the cluster is fully upgraded from a version that uses the
incorrect counter shard ordering it is essential to keep using it lest
the old nodes corrupt the data upon receiving mutations with a counter
shard ordering they do not expect.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
b540516e5e mutation_partition_view: do not expect counter shards to be sorted 2017-09-05 10:32:48 +01:00
Paweł Dziepak
84edb5a1f2 sstables: write counter shards in the order expected by the cluster
If the feature signaling that we have switched to the correct ordering
of counter shards is not enabled it means that the user still can do a
rollback to a version that expects wrong ordering. In order to avoid any
disasters when that happens write sstables using the 1.7.4 order until
we know for sure that it is no longer needed.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
2b614201a7 tests/sstables: add storage_service_for_tests to counter write test
Writing a counters to a sstable is going to require cluster feature
information, which requires accessing some singletons.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
5007c9290a tests/sstables: add test for reading wrong-order counter cells 2017-09-05 10:32:48 +01:00
Paweł Dziepak
3e1d09e71d sstables: do not expect counter shards to be sorted 2017-09-05 10:32:48 +01:00
Paweł Dziepak
ecd2bf128b storage_service: introduce CORRECT_COUNTER_ORDER feature
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
1e03c4acbe tests/counter: test 1.7.4 compatible shard ordering 2017-09-05 10:32:47 +01:00
Paweł Dziepak
067e429881 counters: add helper for retrieving shards in 1.7.4 order 2017-09-05 10:32:47 +01:00
Paweł Dziepak
fd25a09db2 tests/counter: add tests for 1.7.4 counter shard order 2017-09-05 10:32:47 +01:00
Paweł Dziepak
a93e8ce185 counters: add counter id comparator compatible with Scylla 1.7.4 2017-09-05 10:32:47 +01:00
Paweł Dziepak
b0f67c1680 tests/counter: verify order of counter shards 2017-09-05 10:32:47 +01:00
Paweł Dziepak
27397b5dad tests/counter: add test for sorting and deduplicating shards 2017-09-05 10:32:47 +01:00
Paweł Dziepak
e0c2379f26 counters: add function for sorting and deduplicating counter cells
Due to a bug in an implementation of UUID less compare some Scylla
versions sort counter shards in an incorrect order. Moreover, when
dealing with imported correct data the inconsistencies in ordering
caused some counter shards to become duplicated.
2017-09-05 10:32:39 +01:00
Paweł Dziepak
74af818eaf counters: add counter_id::operator> 2017-09-04 18:25:47 +01:00
Avi Kivity
4b06a2e95d Merge "Fix exception safety in cache update related paths" from Tomasz
* 'tgrabiec/make-row-cache-update-exception-safe' of github.com:scylladb/seastar-dev:
  row_cache: Improve safety of cache updates
  row_cache: Extract invalidate_sync()
  memtable: Mark mark_flushed() as noexcept
  database: Add non-throwing try_trigger_compaction()
  database: Make add_sstable() have strong exception guarantees
  row_cache: Don't require presence checker to be supplied externally
  database: Supply presence checker in sstable snapshots
  mutation_source: Introduce mutation_source::make_partition_presence_checker()
  mutation_reader: Move definitions up in the header
  mutation_reader: Use constructor delegation to reduce code duplication
  row_cache: Make populate() preserve continuity
  row_cache: Allow marking as fully continuous on construction
  database: Add missing serialization of sstable set udpate and cache invalidation
2017-09-04 18:37:42 +03:00
Tomasz Grabiec
d22fdf4261 row_cache: Improve safety of cache updates
Cache imposes requirements on how updates to the on-disk mutation source
are made:
  1) each change to the on-disk muation source must be followed
     by cache synchronization reflecting that change
  2) The two must be serialized with other synchronizations
  3) must have strong failure guarantees (atomicity)

Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.

Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail.  The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.

In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.

This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.

The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.

The database::_cache_update_sem goes away, serialization is done
internally by the cache.

The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.

Fixes #2754.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
b0f3efa577 row_cache: Extract invalidate_sync() 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
673a22f8e1 memtable: Mark mark_flushed() as noexcept
Callers rely on that.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
bf75b882ae database: Add non-throwing try_trigger_compaction() 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
116d4ae02b database: Make add_sstable() have strong exception guarantees
If insert() fails, we left the database with stats updated, but
sstable not being attached.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
56e3ce05db row_cache: Don't require presence checker to be supplied externally
The API is simpler and safer this way.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
df787afe6a database: Supply presence checker in sstable snapshots 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
8a9f0f86e7 mutation_source: Introduce mutation_source::make_partition_presence_checker()
Every mutation source can have a presence checker. By default all
answer "maybe contains".

Having this on mutation_source level will be useful for simplifying
cache update flow. The cache can ask the right snapshot for a presence
checker rather than relying on database to know when and how to make
the right one which preserves all invariants.

This will be especially useful once all updates of the underlying
mutation source of cache (e.g. sstable list) will have to go through
cache for safety reasons.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
065feb1b7b mutation_reader: Move definitions up in the header 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
4e4839082b mutation_reader: Use constructor delegation to reduce code duplication 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
1a2f17d42c row_cache: Make populate() preserve continuity 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
bc3112a187 row_cache: Allow marking as fully continuous on construction
Will be needed in tests.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
ab8632b225 database: Add missing serialization of sstable set udpate and cache invalidation
Commit e3ad676433 missed a few places.

It is required to serialize sstable list update and cache synchronization
in order to preserve partition update isolation.

Fixes #2746.
2017-09-04 10:04:29 +02:00
Piotr Jastrzebski
dd5dc75605 Stop calling _local_cache.stop in at_exit.
This removes a race condition that was causing #2721

Fixes #2721

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ad060fab43d63c17db9f811c421d7ab26e5e57c8.1503933021.git.piotr@scylladb.com>
2017-09-03 15:55:48 +03:00
Asias He
e14bb7b1d5 repair: Remove #if'ed code in repair_ranges
It is unlikely we will use parallel_for_each version in repair_ranges.
Get rid of the dead code.

Message-Id: <31a9366adfe0262512a326ef9703aa0bba05e1fb.1503996138.git.asias@scylladb.com>
2017-09-03 11:13:02 +03:00
Avi Kivity
0524cbbd72 Merge db/config.cc cleanups from Jesse
* 'jhk/config_hygiene/v1' of https://github.com/hakuch/scylla:
  db/config.cc: Clarify documentation for `typed_value_ex`
  db/config.cc: Fix formatting and warnings
  db/config.cc: Remove unnecessary `mutable` on lambdas
  db/config.cc: Remove unused variables
2017-09-03 11:08:53 +03:00
Botond Dénes
a980ff6463 Use abort() instead of assert + throw in unreachable code
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <393c3730111dfe090c44d8fc2e31602956a7d008.1504022425.git.bdenes@scylladb.com>
2017-09-03 11:07:27 +03:00
Raphael S. Carvalho
22701346de sstables/stcs: avoid needless copy of bucket in get_buckets()
In addition, remove bucket by iterator which is faster.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170903000315.16338-1-raphaelsc@scylladb.com>
2017-09-03 10:46:48 +03:00
Avi Kivity
551eb75eb0 Update AMI submodule
* dist/ami/files/scylla-ami b41e5eb...5ffa449 (3):
  > amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x)
  > Prevent dependency error on 'yum update'
  > scylla_create_devices: don't raise error when no disks found
2017-08-31 15:13:52 +03:00
Glauber Costa
e642aee3f7 database: wait for asynchronous operations to end before closing CF
This was part of "add gate for generic async operations to column family" but
somehow didn't make it into the final patch.

Add the missing piece.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170830164205.4497-1-glauber@scylladb.com>
2017-08-31 11:16:30 +03:00
Avi Kivity
23d3ca56a1 Merge "optional integrity checker of sstable component writes" from Raphael
"optional interposer that will check integrity of writes to sstable components.
The option name is enable_sstable_data_integrity_check, it's disabled by default
and can be enabled via config file.
It will provide enough details that will help to find the root of the issue.

if disk failed for example, we would've something like the following reported:
ERROR 2017-08-17 09:18:11,577 [shard 0] sstable - integrity check failed for ./data/data/system_schema/aggregates-924c55872e3a345bb10c12f37c1ba895/system_schema-aggregates-ka-111-Scylla.db, stage: read after write verification, write: 4096 bytes to offset 0, reason: data read from underlying storage isn't the same as written, mismatch at byte 0:
 data written sample:   10000001000000010000001a00000001
 date read sample:      00000000000000000000000000000000"

* 'integrity_check_interposer_v3' of github.com:raphaelsc/scylla:
  sstables: optionally check integrity of sstable component writes
  sstables: remove unneeded new_sstable_component_file variant
  db/config: add sstable_data_integrity_check option
  sstables: introduce file interposer for integrity check
2017-08-31 11:08:12 +03:00
Raphael S. Carvalho
a84fbde8c8 sstables: optionally check integrity of sstable component writes
If config file's sstable_data_integrity_check option is enabled,
new integrity check interposer will be used in addition to the
existing one. Performance is expected to drop because of all
the integrity checks for every write. This new interposer will
provide detailed info when integrity check fails.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-31 02:27:50 -03:00
Raphael S. Carvalho
04ea4daa7e sstables: remove unneeded new_sstable_component_file variant
can get rid of it because file_open_options is optional in
reactor::open_file_dma()

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-31 02:17:34 -03:00
Raphael S. Carvalho
0218d6fd8f db/config: add sstable_data_integrity_check option
If enabled, interposer for checking integrity of sstable component
writes will be used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-30 13:57:08 -03:00
Raphael S. Carvalho
f76b609cf5 sstables: introduce file interposer for integrity check
optional interposer that will check integrity when writing to
sstable components. It will provide enough details that will
help to find the root of the issue, which may come from lower
level layers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-30 11:55:36 -03:00
Jesse Haber-Kucharsky
eddf34d005 test.py: Add missing tests
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <6fc5e810495801e646ccc41c16b581c8eceeda22.1504030666.git.jhaberku@scylladb.com>
2017-08-30 09:58:12 +01:00
Asias He
471e8b341f repair: Support termination of repair jobs
This patch implements the missing API to terminate all repairs.

For example:

$ curl -X POST  --header "Accept: application/json"
"http://127.0.0.2:10000/storage_service/force_terminate_repair"

With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.

Fixes #2105
2017-08-30 15:19:52 +08:00
Asias He
07d9dc03ec repair: Track repair_info
Make repair_info a shared pointer and store them in _repairs map so we
can find by the repair id and access them later.
2017-08-30 15:19:52 +08:00
Asias He
5c9732c645 repair: Intorduce repair id to repair_info map
The maps are stored in a vector. The vector has smp::count elements, each
element will be accessed by only one shard.

The add_repair_info, remove_repair_info and get_repair_info helpers
are added.
2017-08-30 15:19:51 +08:00
Asias He
6dc62c6215 api: Add force_terminate_repair API
The api /storage_service/force_terminate is supposed to be
/storage_service/force_terminate_repair.

scylla-jmx uses /storage_service/force_terminate api.
So instead of renaming it, it is better to add a new name for it.
2017-08-30 15:19:51 +08:00
Asias He
9c8da2cc56 streaming: Add abort to stream_plan
It can be used by the user of stream_plan to abort the stream sessions.
Repair will be the first user when aborting the repair.
2017-08-30 15:19:51 +08:00
Asias He
475b7a7f1c streaming: Add abort_all_stream_sessions for stream_coordinator
It will abort all the sessions within the stream_coordinator.
It will be used by stream_plan soon.
2017-08-30 15:19:51 +08:00
Asias He
fad34801bf streaming: Introduce streaming::abort()
It will be used soon by stream_plan::abort() to abort a stream session.
2017-08-30 15:19:50 +08:00
Asias He
7fba7cca01 streaming: Make stream_manager and coordinator message debug level
When we abort a session, it is possible that:

node 1 abort the session by user request
node 1 send the complete_message to node 2
node 2 abort the session upon receive of the complete_message
node 1 sends one more stream message to node 2 and the stream_manager
for the session can not be found.

It is fine for node 2 to not able to find the stream_manager, make the
log on node 2 less verbose to confuse user less.
2017-08-30 15:19:50 +08:00
Asias He
be573bcafb streaming: Check if _stream_result is valid
If on_error() was called before init() was executed, the
_stream_result can be invalid.
2017-08-30 15:19:44 +08:00
Asias He
8a3f6acdd2 streaming: Log peer address in on_error 2017-08-30 15:18:27 +08:00
Asias He
eace5fc6e8 streaming: Introduce received_failed_complete_message
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
2017-08-30 15:18:27 +08:00
Asias He
cc18da5640 Revert "gossip: Make bootstrap more robust"
This reverts commit b56ba02335.

After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.

Remove the extra wait before removing the joining node from gossip
membership.

Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>
2017-08-30 10:14:11 +03:00
Avi Kivity
48b9e47f7d Revert "row_cache: Add missing handling for failures happening outside the updating thread"
This reverts commit f9feb310ab (requested by author).
2017-08-29 19:26:02 +03:00
Tomasz Grabiec
f9feb310ab row_cache: Add missing handling for failures happening outside the updating thread
Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b.

Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:22 +03:00
Tomasz Grabiec
5d2f2bc90b lsa: Mark region::merge() as noexcept
It seems to satisfy this, and row_cache::do_update() will rely on it to simplify
error handling.

Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:17 +03:00
Asias He
8fa35d6ddf messaging_service: Get rid of timeout and retry logic for streaming verb
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.

There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.

This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.

Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
2017-08-29 17:20:00 +03:00
Botond Dénes
d1209c548a Fix -Wreturn-type warnings
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <99f7a006daaa78eb87720ac51c394093398bc868.1504013915.git.bdenes@scylladb.com>
2017-08-29 16:41:09 +03:00
Tomer Sandler
f1eb6a8de3 node_health_check: Various updates
- Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore).
- Removed curl command (no longer using the api_address), instead using scylla --version
- Added -v flag in iptables command, for more verbosity
- Added support to for OEL (Oracle Enterprise Linux) - minor fix
- Some text changes - minor
- OEL support indentation fix + collecting all files under /etc/scylla
- Added line seperation under cp output message

Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828131429.4212-1-tomer@scylladb.com>
2017-08-29 15:15:10 +03:00
Paweł Dziepak
90c77c89ae test.py: add missing compress_test
Message-Id: <20170829105331.27078-1-pdziepak@scylladb.com>
2017-08-29 13:05:11 +02:00
Paweł Dziepak
d5fa07f6df Merge "sstables: switch from deque<> to a custom container" from Avi
Large deques require contiguous storage, which may not be available (or may
be expensive to obtain).  Switch to new custom container instead, which allocates
less contiguous storage.

Allocation problems were observed with the summary and compression info. While
there is work to reduce compression info contiguous space use, this solves
all std::deque problems (and should not conflict with that work).

Fixes #2708

* tag '2708/v6' of https://github.com/avikivity/scylla:
  sstables: switch std::deque to chunked_vector
  tests: add test for chunked_vector
  utils: add a new container type chunked_vector
2017-08-29 11:11:01 +01:00
Avi Kivity
5224ab9c92 Merge "Fix sstable reader not working for empty set of clustering ranges" from Tomasz
"Fixes #2734."

* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
  tests: Introduce clustering_ranges_walker_test
  tests: simple_schema: Add missing include
  sstables: reader: Make clustering_ranges_walker work with empty range set
  clustering_ranges_walker: Make adjacency more accurate
2017-08-29 10:28:49 +03:00
Asias He
a36141843a gossip: Switch to seastar::lowres_system_clock
The newly added lowres_system_clock is good enough for gossip
resolution. Switch to use it.

Message-Id: <fe0e7a9ef1ea0caffaa8364afe5c78b6988613bf.1503971833.git.asias@scylladb.com>
2017-08-29 10:16:25 +03:00
Asias He
2701bfd1f8 gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints
The _unreachable_endpoints will be accessed in fast path soon by the
hinted hand off code.

Message-Id: <500d9cbb2117ab7b070fd1bd111c5590f46c3c3a.1503971826.git.asias@scylladb.com>
2017-08-29 10:15:55 +03:00
Tomasz Grabiec
05e0ca6546 tests: Introduce clustering_ranges_walker_test 2017-08-28 21:08:55 +02:00
Tomasz Grabiec
dcbc1282a9 tests: simple_schema: Add missing include 2017-08-28 21:00:06 +02:00
Tomasz Grabiec
48dabc8262 sstables: reader: Make clustering_ranges_walker work with empty range set
Such queries can be issued by counter updates which involve only
static row.

Causes failure in test_query_only_static_row invoked from
sstable_mutation_test. See commit 6572f38, which fixed the problem in
cache reader.

Fixes #2734.
2017-08-28 21:00:06 +02:00
Tomasz Grabiec
071badce3b clustering_ranges_walker: Make adjacency more accurate
Current check considered some adjacent range tombstones as overlapping
with the ranges. Making this more accurate will become more important
after we will rely on putting p_i_p::after_all_clustered_rows() in
_current_start in out-of-range state.
2017-08-28 21:00:06 +02:00
Jesse Haber-Kucharsky
abf4c1688d db/config.cc: Clarify documentation for typed_value_ex 2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky
7374f9d86f db/config.cc: Fix formatting and warnings 2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky
90666e5744 db/config.cc: Remove unnecessary mutable on lambdas 2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky
449bd60480 db/config.cc: Remove unused variables 2017-08-28 10:08:29 -04:00
Botond Dénes
eec451bcf8 segmented_offsets: use _current_bucket_segment_index consistently
Previously _current_bucket_segment_index was used differently depending on
whether update_position_trackers() is used in a random or sequential
access. In the former case was used as the absolute index of the segment
(independent of the buckets) and in the latter as the relative index of
the segment within its bucket. This caused problems when there was a
switch between random and sequential access, meaning one could get different
results for an at() call depending on what was the previous at() call.
Fix this by consistently using _current_bucket_segment_index as - like its
name suggest - the bucket relative segment index.

Ref #1946.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7f68ac1d32c80e8dea6dfa11be02acaa961bce2a.1503924927.git.bdenes@scylladb.com>
2017-08-28 16:14:25 +03:00
Avi Kivity
fa8d0fe4d0 Revert "Revert "Revert "Revert "Merge "Compress in-memory compression-info" from Botond""""
This reverts commit 238877a0c6.  A fix was found and will be committed
shortly.
2017-08-28 16:14:13 +03:00
Tomer Sandler
83f249c15d node_health_check: added line seperation under cp output message
Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828124307.2564-1-tomer@scylladb.com>
2017-08-28 15:44:13 +03:00
Tomasz Grabiec
16c1b0fb6b Merge "Reduce dependencies on types.hh" from Avi
* 'deps1/v1' of https://github.com/avikivity/scylla:
  types.hh: extract marshal_exception from types.hh into a new file
  utils: remove dependency on types.hh
  locator: add missing include "log.hh"
  supervisor: remove dependency on init.hh
  tracing: add missing include "log.hh"
  gms: remove unneeded #include "types.hh"
2017-08-28 13:58:46 +02:00
Avi Kivity
4e67bc9573 Merge "Fixes for skipping in sstable reader" from Tomasz
* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Add more tests for fast forwarding across partitions
  sstables: Fix abort in mutation reader for certain skip pattern
  sstables: Fix reader returning partition past the query range in some cases
  sstables: Introduce data_consume_context::eof()
2017-08-28 12:48:02 +03:00
Tomasz Grabiec
3241018c79 tests: mutation_source_test: Add more tests for fast forwarding across partitions 2017-08-28 10:30:08 +02:00
Tomasz Grabiec
65e488c150 sstables: Fix abort in mutation reader for certain skip pattern
The problem happens for the following sequence of events:

 1) reader stops in the middle of some partition before it
    skips to another partition range

 2) reader is fast forwarded to a partition range which has no data in
    the sstable. There are some partitions between the previous
    partition range and the one we skip to

 3) the reader is asked for next partition

The problem was that mutation_reader::fast_forward_to() was putting
the reader in _read_enabled == false state in step 2, but
data_consume_context was not fast forwarded to the range. When in step
3 we were asked for the next partition, we attempted to skip using
index (because of 1). The result of the skip was some position which
is outside of the current range of data_consume_context, which causes
it to abort. To fix, add a check for _read_enabled before we try to
skip.
2017-08-28 10:28:15 +02:00
Tomasz Grabiec
dc3c8863f3 sstables: Fix reader returning partition past the query range in some cases
If index was used to skip to the next partition (because the current
partition wasn't consumed in full) and reader's partition range ends
before the data file ends, we did not detect that we're out of range
before returning a streamed_mutation. Fix by checking _context.eof()
before doing that.

Refs #2733.
2017-08-28 10:16:27 +02:00
Tzach Livyatan
12fb975282 Fix typos in metrics description
Fixes #2658

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803121732.19640-1-tzach@scylladb.com>
2017-08-28 10:48:28 +03:00
Takuya ASADA
437931f499 dist/redhat: fix dependency package name typo
scylla-libstdc++-static53 -> scylla-libstdc++53-static

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1503306027-7316-1-git-send-email-syuu@scylladb.com>
2017-08-28 10:44:40 +03:00
Tomasz Grabiec
6baad2c2e6 sstables: Introduce data_consume_context::eof() 2017-08-28 09:19:43 +02:00
Avi Kivity
171fe67a64 gms: remove unneeded #include "types.hh" 2017-08-27 15:18:57 +03:00
Avi Kivity
a9f19e37b5 tracing: add missing include "log.hh"
It's currently made available via another include, which is going away.
2017-08-27 15:18:41 +03:00
Avi Kivity
471ae5b22b supervisor: remove dependency on init.hh
Replace with a simpler dependency on log.hh
2017-08-27 15:17:55 +03:00
Avi Kivity
27d3ab20a9 locator: add missing include "log.hh"
It's currently made available via another include, which is going away.
2017-08-27 15:17:05 +03:00
Avi Kivity
7234f0f0a0 utils: remove dependency on types.hh
Replace with dependency on much smaller marshal_exception.hh.
2017-08-27 15:16:21 +03:00
Avi Kivity
93317e2f4a types.hh: extract marshal_exception from types.hh into a new file
For better or worse, marshal_exception is used from utils/, and it's not good
to have utils/ depend on types.hh. Extract marshal_exception to make it possible
to remove the dependency.
2017-08-27 15:14:55 +03:00
Avi Kivity
238877a0c6 Revert "Revert "Revert "Merge "Compress in-memory compression-info" from Botond"""
This reverts commit 9d27455744. It's still broken.

To reproduce:

  ./tools/bin/cassandra-stress write -schema compression=LZ4Compressor

(on a clean database)

.0  0x00007ffff32aa69b in raise () from /lib64/libc.so.6
.1  0x00007ffff32ac4a0 in abort () from /lib64/libc.so.6
.2  0x000000000054a0e8 in seastar::memory::abort_on_underflow (size=<optimized out>) at core/memory.cc:1189
.3  seastar::memory::allocate_large (size=<optimized out>) at core/memory.cc:1194
.4  0x000000000054b305 in seastar::memory::allocate (size=size@entry=18446744073702885265) at core/memory.cc:1227
.5  0x000000000054b45e in malloc (n=n@entry=18446744073702885265) at core/memory.cc:1452
.6  0x00000000006013e4 in seastar::temporary_buffer<char>::temporary_buffer (this=0x6010195fc800, size=18446744073702885265) at /home/avi/urchin/seastar/core/temporary_buffer.hh:72
.7  0x0000000000a3908b in seastar::input_stream<char>::read_exactly (this=0x6010053d0248, n=18446744073702885265) at /home/avi/urchin/seastar/core/iostream-impl.hh:189
.8  0x0000000000a9c77f in compressed_file_data_source_impl::get (this=0x6010053d0240) at sstables/compress.cc:499
.9  0x0000000000aa1b01 in seastar::data_source::get (this=<optimized out>) at /home/avi/urchin/seastar/core/iostream.hh:63
.10 seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}::operator()() const (__closure=__closure@entry=0x6010195fcab0) at /home/avi/urchin/seastar/core/iostream-impl.hh:204
.11 0x0000000000aa22f0 in seastar::futurize<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >::apply<seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}&>(sstables::data_consume_rows_context&&) (func=...) at /home/avi/urchin/seastar/core/future.hh:1312
.12 seastar::repeat<seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}>(sstables::data_consume_rows_context&&) (action=...) at /home/avi/urchin/seastar/core/future-util.hh:203
.13 0x0000000000a9e730 in seastar::input_stream<char>::consume<sstables::data_consume_rows_context> (consumer=..., this=<optimized out>) at /home/avi/urchin/seastar/core/iostream-impl.hh:237
.14 data_consumer::continuous_data_consumer<sstables::data_consume_rows_context>::consume_input<sstables::data_consume_rows_context> (c=..., this=<optimized out>) at sstables/consumer.hh:226
.15 sstables::data_consume_context::impl::read (this=<optimized out>) at sstables/row.cc:411
.16 sstables::data_consume_context::read (this=<optimized out>) at sstables/row.cc:437
.17 0x0000000000aafbae in sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const (__closure=<optimized out>) at sstables/partition.cc:843
.18 seastar::apply_helper<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}, std::tuple<>&&, std::integer_sequence<unsigned long> >::apply({lambda()#2}&&, std::tuple) (args=..., func=...) at ./seastar/core/apply.hh:36
.19 seastar::apply<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&, std::tuple<>&&) (args=..., func=...)
    at ./seastar/core/apply.hh:44
.20 seastar::futurize<seastar::future<> >::apply<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&, std::tuple<>&&) (args=...,
    func=...) at ./seastar/core/future.hh:1302
.21 seastar::future<>::then<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}, seastar::future<> >(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&) (
    this=this@entry=0x6010195fcbb0, func=...) at ./seastar/core/future.hh:890
.22 0x0000000000ac273f in sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const (__closure=0x6010195fcc28) at sstables/partition.cc:843
.23 seastar::do_until_continued<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}&&, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}&&, seastar::promise<>) (stop_cond=..., action=..., p=...) at /home/avi/urchin/seastar/core/future-util.hh:155
.24 0x0000000000ac29c3 in seastar::do_until<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}&&, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}&&) (action=..., stop_cond=..., this=<optimized out>) at /home/avi/urchin/seastar/core/future-util.hh:330
.25 sstables::sstable_streamed_mutation::fill_buffer (this=<optimized out>) at sstables/partition.cc:844
.26 0x0000000000ad3d2b in streamed_mutation::fill_buffer (this=0x6010195fcd10) at ./streamed_mutation.hh:489
.27 consume_flattened_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer> >, std::function<bool (streamed_mutation const&)> >(mutation_reader&, stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer> >&, std::function<bool (streamed_mutation const&)>&&) (

(gdb) p addr
$1 = {
  chunk_start = 13330037,
  chunk_len = 18446744073702885265,
  offset = 0
}
2017-08-27 13:32:37 +03:00
Avi Kivity
576e33149f Merge seastar upstream
* seastar 0083ee8...85ca12d (1):
  > Merge "Run-time logging configuration" from Jesse

Includes patch from Jesse:

"Switch to Seastar for logging option handling

In addition to updating the abstraction layer for Seastar logging in `log.hh`,
the configuration system (`db/config.{hh,cc}`) has been updated in two ways:

- The string-map type for Boost.program_options is now defined in Seastar.

- A configuration value can be marked as `UsedFromSeastar`. This is like `Used`,
  except the option is expected to be defined in the Boost.Program_options
  description for Seastar. If the option is not defined in Seastar, or it is
  defined with a different type, then a run-time exception is thrown early in
  Scylla's initialization. This is necessary because logging options which are
  now defined in Seastar were previously defined in Scylla and support for these
  options in the YAML file cannot be dropped. In order to be able to verify that
  options marked `UsedFromSeastar` are actually defined in Seastar, the
  interface for adding options to `db::config` has changed from taking a
  `boost::program_options::options_description_easy_init` (which is handle into
  a `boost::program_options::options_description` which only allows adding
  options) to taking a `boost::program_options::options_description`
  directly (which also allows querying existing options).

Scylla also fully defers to Seastar's support for run-time logging
configuration."

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <ef26cffb91bef1ae95d508187a6dd861a6c4fc84.1503344007.git.jhaberku@scylladb.com>
2017-08-27 13:11:33 +03:00
Avi Kivity
4f5b5bc8e6 Merge seastar upstream
* seastar b9f1eb7...0083ee8 (1):
  > http: Add MIME type support for JSON
2017-08-27 13:09:04 +03:00
Jesse Haber-Kucharsky
af95d3baa7 db/config.cc: Remove unused function
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5a4e4e153c2d87e838d1cf6def7a494a92a72f63.1503344007.git.jhaberku@scylladb.com>
2017-08-27 13:08:19 +03:00
Vlad Zolotarov
9b9f19606f scylla_cpuset_setup: add the description near the perftune.yaml removing
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1503600250-25169-1-git-send-email-vladz@scylladb.com>
2017-08-27 12:51:12 +03:00
Asias He
68346f7e53 repair: Use with_semaphore for sp_parallelism_semaphore
Instead of calling semaphore.signal() manually.

Message-Id: <51b7ecdebac91763a2340fe00959742810614845.1503648936.git.asias@scylladb.com>
2017-08-27 12:50:38 +03:00
Avi Kivity
2b3ee4b0a7 Merge "make cf drop more robust" from Glauber
"We have recently found two problems with the drop_column_family code
that needs addressing. The first is that exceptions in truncate() may
lead to stop() being skipped, which can cause Scylla to crash.

The other is that a truncate() issued before drop_column_family may get
the chance to execute only after the column family is already dropped
and also crash (That is issue 2726).

The second problem is the classic problem of asynchronous execution on
an object that may terminate, which we have been traditionally solving
with a gate. We add a gate to the column family that will be closed
during CF stop(), and we will require all asychronous operations to
enter it.

The immediate fix is for truncate(), where we have seen a real, concrete
problem. But it would be good to audit other code paths to make sure
that they are sane.

The most obvious ones, flush, compaction and sstable deletion are
already sane, since they are waited on explicitly during stop()."

Fixes #2726.

* 'issue-2726-v2-master' of github.com:glommer/scylla:
  database: add gate for generic async operations to column family
  database: make sure that column family is always stopped when dropped
2017-08-27 12:42:20 +03:00
Avi Kivity
1f66940134 sstables: switch std::deque to chunked_vector
Reduce susceptibility to memory fragmentation.
2017-08-26 16:44:47 +03:00
Avi Kivity
204659ef40 tests: add test for chunked_vector 2017-08-26 16:44:47 +03:00
Avi Kivity
3ba2c0652d utils: add a new container type chunked_vector
We currently use std::deque<> for when we need large random-access containers,
but deque<> requires nr_items * sizeof(T) / 64 bytes of contiguous memory, which can
exceed our 256k fragmentation unit with large sstables.  The new
container, which is a cross between deque and vector, has much lower
limitations.

Like deque, we allocate chunks of contiguous items, but they are
128k in size instead of 512. The last chunk can be smaller to avoid
allocating 128k for a really small vector.
2017-08-26 16:44:45 +03:00
Tomasz Grabiec
2ca99be27d ring_position_view: Print token instead of token pointer
Broken in e989d65539.
Message-Id: <1503667158-7544-1-git-send-email-tgrabiec@scylladb.com>
2017-08-25 14:25:21 +01:00
Glauber Costa
83323e155e database: add gate for generic async operations to column family
run_with_compaction_disabled(), which is called by truncate, has a
pretty large defer point in remove(). When the code gets to finally
execute, we can't guarantee that the column family will still be alive.

That is true in particular if we issued a drop table command following
truncate: by the time truncate gets to resume, the CF will be gone.
Before the column family is dropped, it will always call its stop()
method, which means we have an opportunity to do some waiting there. We
already wait for flushes and current compactions to end.

Traditionally, we have been solving similar problems by adding a gate
that will catch asynchronous operations and making sure that potentially
asynchronous operations will enter the gate before executing. Let's do
the same thing here. We will close() the gate during stop().

Fixes #2726

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-08-24 13:12:57 -04:00
Glauber Costa
d090e7be35 database: make sure that column family is always stopped when dropped
truncate can throw exceptions. If it does, cf->stop() will never be
called because it is contained in a .then clause instead of finally.

One of the things that truncate does - in a finally block of its own -
is initiate a final compaction. If it returns an exception nobody will
wait for that compaction to finish (since cf->stop() is the one doing
that) and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-08-24 13:01:47 -04:00
Avi Kivity
40aeb00151 Merge "consider the pre-existing cpuset.conf when configuring networking mode" from Vlad
"Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml
file and using it."

Fixes #2725.

* 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla:
  scylla_prepare: respect the cpuset.conf when configuring the networking
  scylla_cpuset_setup: rm perftune.yaml
  scylla_cpuset_setup: add a missing "include" of scylla_lib.sh
2017-08-24 18:53:22 +03:00
Vlad Zolotarov
c72eb34b89 scylla_prepare: respect the cpuset.conf when configuring the networking
Choose the networking configuration mode according to the current contents of /etc/scylla.d/cpuset.conf.

If it doesn't exist - use the default mode.
If it exists - use the mode that has been used for generation of the CPU set.

Store the configuration into the /etc/scylla.d/perftune.yaml

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-08-24 09:09:40 -04:00
Vlad Zolotarov
89285a13ac scylla_cpuset_setup: rm perftune.yaml
scylla_setup resets our configuration and perftune.yaml is a part of it.
perftune.yaml is generated based on the contents of cpuset.conf therefore we should reset
these together.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-08-24 09:09:40 -04:00
Vlad Zolotarov
d0ccfe34b9 scylla_cpuset_setup: add a missing "include" of scylla_lib.sh
The scylla_cpuset_setup uses a verify_args() function that is defined in the scylla_lib.sh.

Fixes #2716

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-08-24 09:09:40 -04:00
Paweł Dziepak
1006a946e8 mvcc: allow invoking maybe_merge_versions() inside allocating section
Message-Id: <20170823083544.4225-1-pdziepak@scylladb.com>
2017-08-24 14:30:38 +02:00
Pekka Enberg
870de26e35 index: Add index class
Add a simple index class, which represents an instantiated index.
2017-08-24 14:00:02 +03:00
Pekka Enberg
d63a650b3f index: Pass column_family to secondary_index_manager constructor
We need column family for various secondary index manager operations.
2017-08-24 14:00:02 +03:00
Pekka Enberg
981e320d54 database: Make secondary index manager per-column family
Make the secondary index manager per-column family like in Apache
Cassandra to keep CQL front-end similar between the two codebases.
2017-08-24 14:00:02 +03:00
Botond Dénes
839d1db4d3 parse(compression): add missing reinterpret_cast<char*>
std::copy_n was using value as uint64_t*, smashing the stack.
Also remove unused variable.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4e2d71fc74326965dfd98bed2347100fb6ebe43b.1503568210.git.bdenes@scylladb.com>
2017-08-24 13:38:03 +03:00
Avi Kivity
9d27455744 Revert "Revert "Merge "Compress in-memory compression-info" from Botond""
This reverts commit 9656fd79a0. A fix is now
available.
2017-08-24 13:37:35 +03:00
Tomasz Grabiec
9656fd79a0 Revert "Merge "Compress in-memory compression-info" from Botond"
This reverts commit ef85cf1cb3, reversing
changes made to de011ece52.

Vlad reports that this causes SIGSEGV on cluster restarts.

seastar::backtrace_buffer::append_backtrace() at /home/vladz/work/urchin/seastar/core/reactor.cc:274
 (inlined by) print_with_backtrace at /home/vladz/work/urchin/seastar/core/reactor.cc:289
seastar::print_with_backtrace(char const*) at /home/vladz/work/urchin/seastar/core/reactor.cc:296
sigsegv_action at /home/vladz/work/urchin/seastar/core/reactor.cc:3512
 (inlined by) operator() at /home/vladz/work/urchin/seastar/core/reactor.cc:3498
 (inlined by) _FUN at /home/vladz/work/urchin/seastar/core/reactor.cc:3494
?? ??:0
operator()<seastar::temporary_buffer<char> > at /home/vladz/work/urchin/sstables/sstables.cc:870
 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/apply.hh:44
 (inlined by) do_void_futurize_apply_tuple<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/future.hh:1270
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/future.hh:1290
 (inlined by) then<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)> > at /home/vladz/work/urchin/seastar/core/future.hh:890
 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:873
 (inlined by) do_until_continued<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>, sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>&> at /home/vladz/work/urchin/seastar/core/future-util.hh:155
do_until<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>, sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>&> at /home/vladz/work/urchin/seastar/core/future-util.hh:330
 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:874
 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/apply.hh:44
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:1302
then<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:890
 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:875
 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()> > at /home/vladz/work/urchin/seastar/core/apply.hh:44
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:1302
operator()<seastar::future_state<> > at /home/vladz/work/urchin/seastar/core/future.hh:900
 (inlined by) run at /home/vladz/work/urchin/seastar/core/future.hh:395
seastar::reactor::run_tasks(seastar::circular_buffer<std::unique_ptr<seastar::task, std::default_delete<seastar::task> >, std::allocator<std::unique_ptr<seastar::task, std::default_delete<seastar::task> > > >&) at /home/vladz/work/urchin/seastar/core/reactor.cc:2317
seastar::reactor::run() at /home/vladz/work/urchin/seastar/core/reactor.cc:2775
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/vladz/work/urchin/seastar/core/app-template.cc:142
2017-08-24 11:44:14 +02:00
Alexys Jacob
a133290694 scylla_io_setup: migrate away from deprecated string.atoi
Python 2.0 deprecated string.atoi and we should move away from it
as stated here: https://docs.python.org/2/library/string.html#string.atoi

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817134002.28124-1-ultrabug@gentoo.org>
2017-08-24 12:36:34 +03:00
Avi Kivity
dcac7125fe Merge seastar upstream
* seastar e96881a...b9f1eb7 (9):
  > httpd: indentation patch
  > httpd: handle exception when shutting down
  > stall-detector: Allow backtrace throttling to be configured
  > stall-detector: Fix messages about suppresssion not appearing
  > scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter
  > scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration
  > scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads
  > core: make current_backtrace() noexcept
  > memory: add large allocation detector stubs for default allocator
2017-08-24 11:35:28 +03:00
Piotr Jastrzebski
477068d2c3 Make streamed_mutation more exception safe
Make sure that push_mutation_fragment leaves
_buffer_size with a correct value if exception
is thrown from emplace_back.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <83398412aa78332d88d91336b79140aecc988602.1503474403.git.piotr@scylladb.com>
2017-08-23 09:37:04 +01:00
Avi Kivity
2f41ed8493 Merge "repair: Do not allow repair until node is in NORMAL status" from Asias
Fixes #2723.

* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
  repair: Do not allow repair until node is in NORMAL status
  gossip: Add is_normal helper
2017-08-23 09:44:45 +03:00
Asias He
69c81bcc87 repair: Do not allow repair until node is in NORMAL status
The following backtrace was reported by user when running repair and keeping restarting the node at the same time.

 #0  0x00007eff077281d7 in raise () from /lib64/libc.so.6
 #1  0x00007eff07729a08 in abort () from /lib64/libc.so.6
 #2  0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6
 #3  0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6
 #4  0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133
 #5  0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143
 #6  0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...)
     at locator/abstract_replication_strategy.cc:66
 #7  0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0,
     range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196
 #8  repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781
 #9  <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005
 #10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>::

It is reproduced with

1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done

2) start node 127.0.0.1, stop node 127.0.0.1 in a loop

The problem is, during boot up, the token_metadata is not replicated to all shards until
the node goes into NORMAL status.

To fix, check until node is in NORMAL status before allowing repair.

Fixes #2723
2017-08-23 14:40:04 +08:00
Asias He
65912dd1ac gossip: Add is_normal helper
It will be used by repair to check if a node is in NORMAL status.
2017-08-23 14:40:04 +08:00
Amnon Heiman
abbd78367c Add configuration to disable per keyspace and column family metrics
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.

This can cause a performance issue both on the reporting system and on
the collecting system.

This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.

Fixes #2701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
2017-08-22 19:19:54 +03:00
Botond Dénes
4f42acc956 abstract_marker::raw::prepare: add missing return statement
The function doesn't return a value in the all-false branch.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3c1976682ffc190d741c066d942b83be4463cae8.1503402721.git.bdenes@scylladb.com>
2017-08-22 15:06:18 +03:00
Paweł Dziepak
9d82a1ebfd abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Paweł Dziepak
31afc2f242 shared_index_lists: restore indentation
Message-Id: <20170821162934.25386-4-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Paweł Dziepak
93eaa95378 sstables: make shared_index_lists::get_or_load exception safe
Message-Id: <20170821162934.25386-3-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Avi Kivity
ef85cf1cb3 Merge "Compress in-memory compression-info" from Botond
"Overly large metadata can hog memory which especially hurts in setups
with bad disk/memory ratio. To ease the pain compress the in-memory
compression-info.

The compression is implemented based on Avi's idea which is to group n
offsets together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size).  Also offsets are allocated
only just enough bits to store their maximum value. The offsets are thus
packed in a buffer like so:
    arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
This of course means that stored offsets will not be aligned, not even
on a byte boundary, but the size reduction pretty convincing.
In addition, segments are stored in buckets, where each bucket has its
own base offset. In addition, segments in a buckets are optimized to
address as large of a chunk of the data as possible for a given chunk
size."

Ref #1946.

* 'bdenes/compress-compression-v3' of https://github.com/denesb/scylla:
  Add unit test for compress::offsets
  Optimise the storage of compression chunk offsets
  Add script to precompute segmented compression parameters
2017-08-22 10:30:58 +03:00
Botond Dénes
62c18da35c Add unit test for compress::offsets 2017-08-21 17:06:20 +03:00
Botond Dénes
028c7a0888 Optimise the storage of compression chunk offsets
To reduce the memory footprint of compression-info, n offsets are
grouped together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are
allocated only just enough bits to store their maximum value. The
offsets are thus packed in a buffer like so:
     arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.

The optimal value of n can be calculated for a given file_size (f) and
chunk_size (c), by finding the minima of the following function:

f(n) = (f/c)/n * (log2(f) + (n - 1)*log2((n-1)*(c + 64)))

This is done in an empirical way, using a script (see below).

Furthermore segments are stored in buckets, where each bucket has its
own base offset. Each bucket therefore can address an equal chunk of the
file and furthermore each segment in a bucket can address an equal
sub-chunk of this area.
The value of a given offset i is thus:
    bucket_base_offset_for(i) + segment_base_offset_for(i) + offset(i)

To account for the bucketed storage we calculate a local_f, which is
optimized so that a bucketful of segmented offsets can address the
largest possible chunk of f. As value of this local_f only depends on
the bucket_size (b) and c the value of n can be made independent of f
and therefore only depend on one dynamic value, c. This makes life much
simpler as we don't need to know the size of the file up-front, we can
just append buckets to the storage on demand, while the required storage
is still less than a third [1] of the original storage requirements
(std::deque<uint64>).

The table with the minima(f(n)) for different f and c values is
pre-computed by gen_segmented_compress_params.py and
stored in sstables/segmented_compress_params.hh. This script also
creates a table with the best values of local_f for the given
bucket_size. At runtime we only select the best params based on c.

[1] This was calculated for c=4K and b=4K
2017-08-21 17:06:12 +03:00
Avi Kivity
de011ece52 main: deprecate non-murmur3 partitioners more forcefully
Some (most?) users don't read logs or release notes, so they won't notice
that the ByteOrdered and Random partitioners were deprecated in 2.0. Make
them notice by refusing to start with a deprecated partitioner, unless a
switch is explicitly enabled.
Message-Id: <20170820073424.8331-1-avi@scylladb.com>
2017-08-21 14:32:22 +02:00
Avi Kivity
9f415ef870 sstables: accurate summary entry size calculation
Calculate the summary entry size correctly, so we don't end up with oversize
summaries.
Message-Id: <20170819184255.14181-2-avi@scylladb.com>
2017-08-21 14:28:57 +02:00
Avi Kivity
17c372bf0e sstables: get rid of 64kB minimum index advance to generate summary
Limiting summary entry generation to at most one summary entry
per 64k of index data can lead to large index pages, with thousands
of index entries per summary entry. These are slow to parse, and there
is no real gain from the limit, since we already enforce a size
limit on the summary.

Remove the limit and allow summary entry generation based solely on
spanned data size.

Fixes #2711.
Message-Id: <20170819184255.14181-1-avi@scylladb.com>
2017-08-21 14:26:44 +02:00
Avi Kivity
81a33df25d dht: reduce split_range_to_single_shard contiguous memory demand
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.

Fix by using a deque.

Fixes #2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
2017-08-21 14:25:45 +02:00
Piotr Jastrzebski
c602ffd610 Make Scylla ttl expiration behave like in Cassandra
Fixes #2497

[tgrabiec: reworked the title]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2f5a99dce6ef11fe0ef135c9fa0592078fc9a056.1502886874.git.piotr@scylladb.com>
2017-08-21 14:25:45 +02:00
Botond Dénes
eae33a1f19 Add script to precompute segmented compression parameters
The script generates sstables/segmented_compress_params.hh which
contains a list with the optimal number of grouped offsets for
different data and chunk sizes as well as a list with the best
nominal data sizes for different chunk sizes, given a bucket size.
Data sizes are in the range of [2**4,2**50] and chunks in the
range of [2**4, 2**30]. Data sizes that are not used with the current
bucket_size are ommited.
See next commit for details of how the calculated values are used.
2017-08-21 10:44:08 +03:00
Avi Kivity
5a2439e702 main: check for large allocations
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.

Message-Id: <20170819135212.25230-1-avi@scylladb.com>
2017-08-21 10:25:40 +03:00
Pekka Enberg
318423d50b Merge seastar upstream
* seastar 2d16aca...e96881a (4):
  > memory: add detector for large allocations
  > memory: reduce large allocations for small pools
  > net: Fix potential NULL pointer dereference in udp.cc
  > Update dpdk submodule
2017-08-21 10:24:08 +03:00
Tomasz Grabiec
8f2ca52740 tests: Run test_query_only_static_row test case on all mutation sources
The test checks behavior common to all mutation readers, so it's
better to run it against all mutation sources rather than only for
cache reader.

Message-Id: <1503072333-17995-1-git-send-email-tgrabiec@scylladb.com>
2017-08-20 12:23:28 +03:00
Raphael S. Carvalho
10eaa2339e compaction: Make resharding go through compaction manager
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.

Fixes #2671.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
2017-08-20 11:35:14 +03:00
Takuya ASADA
38b2ff617f dist/redhat: follow the change on libgcc/libstdc++ package name
Since we moved to external 3rdparty repository, we added '53' suffix on gcc
packages, so follow the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170819092039.1090-2-syuu@scylladb.com>
2017-08-19 16:01:28 +03:00
Takuya ASADA
f1b5401d1f dist/redhat: Change g++ command name on CentOS
We have added '-5.3' suffix on g++ command from scylla-gcc53-c++-5.3.1-2.2,
follow the change on scylla build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170819092039.1090-1-syuu@scylladb.com>
2017-08-19 16:01:27 +03:00
Avi Kivity
e428805ba5 Merge "Optimize query result partition and row counts" from Duarte
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.

This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."

* 'calculate-counts/v3' of github.com:duarten/scylla:
  query-result: Send row and partition count over the wire
  query::result: Optimize calculate_counts()
2017-08-17 13:41:21 +03:00
Alexys Jacob
e5ff8efea3 dist: Fix Gentoo Linux scylla-jmx and scylla-tools packages detection
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.

This fixes the detection path of the packages for scylla_setup.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
2017-08-17 13:20:43 +03:00
Nadav Har'El
7832d8a883 get rid of unused part in configure.py
Scylla's configure.py contains stuff we copied from Seastar's
configure.py, but is no longer used. Let's get rid of some of it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170813150842.12603-1-nyh@scylladb.com>
2017-08-17 12:05:44 +03:00
Duarte Nunes
1e7f0eab82 memtable: Created readers should be fast forwardable by default
mutation_reader::forwarding defaults to yes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170816180304.2121-1-duarte@scylladb.com>
2017-08-17 10:21:01 +03:00
Botond Dénes
e70cfc8f36 incremental_reader_selector: account for possibly disengaged lower bound
In addition to the constructor (fixed previously) the check for no
sstables on the first call to select() also has to be prepared for the
lower bound of the range being disengaged.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4ab1296c71814fcd492996fa36fd00fd7bbbbc7f.1502949875.git.bdenes@scylladb.com>
2017-08-17 10:07:26 +03:00
Botond Dénes
af83b7f57b incremental_reader_selector: use lazy_deref instead of tertiary operator
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f4b884c6a1f517bd654f3b27608d854b17a66e1.1502948635.git.bdenes@scylladb.com>
2017-08-17 08:45:46 +03:00
Botond Dénes
eb7eee510d combined_mutation_reader_test: use the global const objects directly
Instead of local ones.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3ec1a70e4c0198c0563dff9688bbaa7fcfcace71.1502891190.git.bdenes@scylladb.com>
2017-08-16 16:56:42 +03:00
Paweł Dziepak
784dcbf1ca sstables: initialise index metrics on all shards
Fixes #2702.

Message-Id: <20170816085454.21554-1-pdziepak@scylladb.com>
2017-08-16 15:44:26 +03:00
Avi Kivity
d7e3fbc6fe Merge seastar upstream
* seastar 2a43102...2d16aca (1):
  > fstream: do not ignore unresolved future

Fixes #2697.
2017-08-16 15:09:59 +03:00
Botond Dénes
611774b1d9 Use the incremental reader for compaction
As leveled compaction strategy stands to gain the most from
incrementally opening sstables.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <292648d3fa4ea97376c0b4360754a20132194f63.1502822066.git.bdenes@scylladb.com>
2017-08-15 21:38:04 +03:00
Takuya ASADA
0f9b095867 dist/common/scripts: prevent ignoreing flag that passed after another flag which requires parameter
When user mistakenly forgot to pass parameter for a flag, our scripts misparses
next flag as the parameter.
ex) Correct usage is '--ntp-domain <domain> --setup-nic', but passed
    '--ntp-domain --setup-nic'.
Result of that, next flag will ignore by scripts.
To prevent such behavior, reject any parameter that start with '--'.

Fixes #2609

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170815114751.6223-1-syuu@scylladb.com>
2017-08-15 18:27:32 +03:00
Duarte Nunes
c7aa3ea069 mutation_partition: Remove obsolete short read detection
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.

This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
2017-08-15 12:01:55 +01:00
Avi Kivity
8df6dd1fa0 database: make incremental_reader_selector robust vs. full-range partition_range
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.

Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
2017-08-15 11:03:22 +01:00
Avi Kivity
a35bfb3ea9 Merge seastar upstream
* seastar 47b31f6...2a43102 (1):
  > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb

Fixes #2690
2017-08-14 16:23:02 +03:00
Avi Kivity
e892a0082a Merge "Drop exhausted mutation_readers when possible" from Duarte
"Exhausted readers belonging to a combined_mutation_reader can be fast
forwarded, so we have to keep them around. However, if the reader is
not fast forwardable, then we can drop the contained readers and their
buffers."

* 'ff-reader/v2' of github.com:duarten/scylla:
  combined_mutation_reader: Drop exhausted readers if not in FF mode
  combined_mutation_reader: Remove superfluous mutation_readers list
  memtable_snapshot_source: Created readers should be fast forwardable
2017-08-14 16:20:38 +03:00
Duarte Nunes
7fb6a74302 combined_mutation_reader: Drop exhausted readers if not in FF mode
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Duarte Nunes
0b53f88a42 combined_mutation_reader: Remove superfluous mutation_readers list
The _all_readers variable can do the same job.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Duarte Nunes
77477605c1 memtable_snapshot_source: Created readers should be fast forwardable
As they're used by the cache tests.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Avi Kivity
afff29bdb9 Merge seastar upstream
* seastar edb73ab...47b31f6 (1):
  > tls: Only recurse once in shutdown code

Fixes #2691.
2017-08-14 15:09:42 +03:00
Duarte Nunes
a17cef76b2 query-result-writer: Remove unneeded field
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811102940.22747-1-duarte@scylladb.com>
2017-08-14 12:33:33 +01:00
Duarte Nunes
ec75eac37d ring_position_exponential_vector_sharder: Take ranges by rvalue
Avoids some copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170814093310.29200-1-duarte@scylladb.com>
2017-08-14 12:55:43 +03:00
Duarte Nunes
3b9a9b7321 query-result: Send row and partition count over the wire
To avoid calculating them on the coordinator side.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:29:06 +02:00
Duarte Nunes
d7bab684ea query::result: Optimize calculate_counts()
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:28:29 +02:00
Avi Kivity
cb2c5016ea Merge seastar upstream
* seastar 7a49ae5...edb73ab (11):
  > scripts: perftune.py: change the network module mode auto selection heuristic
  > net/tls: explicitly ignore ready future during shutdown
  > Use python2 explicitly as an interpreter for Python v2 scripts
  > peering_sharded_service: prevent over-run the container
  > Add link to documentation to the README.md
  > Add guidelines for contributing to Seastar
  > sharded: fix move constructor for peering_sharded_service services
  > Provide a convenient way to lazy-convert to string the values of pointers
  > tutorial: overhaul semaphores section
  > simple-stream: Make fragmented::write_substream return simple if possible
  > simple-stream: Make simple/fragmented memory output stream top level
2017-08-14 10:29:27 +03:00
Raphael S. Carvalho
050a7019b8 sstables/index_reader: fix index reader for summary entry spanning lots of keys
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.

Fixes test_sstable_conforms_to_mutation_source

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
2017-08-12 09:44:16 +03:00
Duarte Nunes
08e284a07e combined_mutation_reader: Don't drop mutation readers
This patch fixes a regression introduced in a6b9186ca.

We should keep the readers around in case a subsequent call to
fast_forward() will require them.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811160444.12795-1-duarte@scylladb.com>
2017-08-11 19:17:29 +03:00
Duarte Nunes
44b6da2e90 test.py: Add combined_mutation_reader_test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811155017.9899-1-duarte@scylladb.com>
2017-08-11 18:54:11 +03:00
Avi Kivity
dbf8625ac9 Merge "size-based sampling for sstable summary" from Raphael
"Fixes #1842."

* 'size_based_sampling_v3' of github.com:raphaelsc/scylla:
  tests: test summary entry spanning more keys than min interval
  db/config: introduce sstable_summary_ratio option
  sstables: introduce size-based sampling for sstable summary
  sstables: make components_writer::offset const qualified and uint64_t
  sstables: make writer::offset const qualified and uint64_t
2017-08-11 18:41:45 +03:00
Duarte Nunes
e7d56884c0 list_reader_selector: Prevent infinite loop
In case the readers are empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811153142.8926-1-duarte@scylladb.com>
2017-08-11 18:34:55 +03:00
Vladimir Krivopalov
003e8cf250 Use python2 explicitly as an interpreter for Python v2 scripts
Signed-off-by: Vladimir Krivopalov <vladimir.krivopalov@gmail.com>
Message-Id: <20170811032712.4362-1-vladimir.krivopalov@gmail.com>
2017-08-11 18:08:11 +03:00
Duarte Nunes
20337053ad Don't use literal lambdas
These are only available in C++17. Fixes the build after b5460c2.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-11 13:08:42 +02:00
Duarte Nunes
b5460c2990 Merge "Support duration type" from Jesse
"This patch series adds support for the `duration` type in CQL, which
was added to Cassandra in 3.10.

As part of this work, it was necessary also to add support for the
`vint` and `unsigned vint` types to the native protocol implementation,
which are part of v5 of the specification.

To test interactively, it is necessary to use cqlsh distributed with
Cassandra, as the version we distribute does not yet support the
duration type."

* 'jhk/duration_protocol/v5' of https://github.com/hakuch/scylla:
  Support `duration` CQL native type
  CQL native protocol: Add support for `vint` serialization
  duration_test.cc: Add test for printing zero duration
  duration.cc: Remove nop `const` qualifier on return type
  Change `const` qualifier declaration order for `duration`
  duration.cc: Simplify range checking
  Rename `duration` to `cql_duration`
2017-08-11 10:56:55 +01:00
Duarte Nunes
bcf21aacc2 storage_proxy: Directly call query_nonsingular_mutations_locally
Instead of duplicating the branch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811001559.25788-1-duarte@scylladb.com>
2017-08-11 09:06:01 +03:00
Duarte Nunes
a3ee99554b service/storage_proxy: Remove out of date comment
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).

Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
2017-08-11 09:04:23 +03:00
Raphael S. Carvalho
5124f94358 tests: test summary entry spanning more keys than min interval
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 01:37:06 -03:00
Raphael S. Carvalho
872412d31a db/config: introduce sstable_summary_ratio option
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 01:36:21 -03:00
Raphael S. Carvalho
8726ee937d sstables: introduce size-based sampling for sstable summary
Currently, a summary entry is added after min_index_interval index
entries were written. Not taking into account size of index entries
becomes a problem with large partitions which may create big index
entries due to promoted indexes. Read performance is affected as a
consequence because index entries spanned by summary are all read
from disk to serve request.

What we wanna do is to also add a summary entry after index reaches
a boundary. To deal with oversampling, we want to write 1 byte to
summary for every 2000 bytes written to data file (this will be
eventually made into an option in the config file).
Both conditions must be met to avoid under or oversampling.
That way, the amount of data needed from index file to satify the
request is drastically reduced.

Fixes #1842.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 00:30:12 -03:00
Raphael S. Carvalho
da7489720b sstables: make components_writer::offset const qualified and uint64_t
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-10 21:48:11 -03:00
Raphael S. Carvalho
881c479be8 sstables: make writer::offset const qualified and uint64_t
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-10 21:46:39 -03:00
Jesse Haber-Kucharsky
509626fe08 Support duration CQL native type
`duration` is a new native type that was introduced in Cassandra 3.10 [1].

Support for parsing and the internal representation of the type was added in
8fa47b74e8.

Important note: The version of cqlsh distributed with Scylla does not have
support for durations included (it was added to Cassandra in [2]). To test this
change, you can use cqlsh distributed with Cassandra.

Duration types are useful when working with time-series tables, because they can
be used to manipulate date-time values in relative terms.

Two interesting applications are:

- Aggregation by time intervals [3]:

`SELECT * FROM my_table GROUP BY floor(time, 3h)`

- Querying on changes in date-times:

`SELECT ... WHERE last_heartbeat_time < now() - 3h`

(Note: neither of these is currently supported, though columns with duration
values are.)

Internally, durations are represented as three signed counters: one for months,
for days, and for nanoseconds. Each of these counters is serialized using a
variable-length encoding which is described in version 5 of the CQL native
protocol specification.

The representation of a duration as three counters means that a semantic
ordering on durations doesn't exist: Is `1mo` greater than `1mo1d`? We cannot
know, because some months have more days than others. Durations can only have a
concrete absolute value when they are "attached" to absolute date-time
references. For example, `2015-04-31 at 12:00:00 + 1mo`.

That duration values are not comparable presents some difficulties for the
implementation, because most CQL types are. Like in Cassandra's implementation
[2], I adopted a similar strategy to the way restrictions on the `counter` type
are checked. A type "references" a duration if it is either a duration or it
contains a duration (like a `tuple<..., duration, ...>`, or a UDT with a
duration member).

The following restrictions apply on durations. Note that some of these contexts
are either experimental features (materialized views), or not currently
supported at run-time (though support exists in the parser and code, so it is
prudent to add the restrictions now):

- Durations cannot appear in any part of a primary key, either for tables or
  materialized views.

- Durations cannot be directly used as the element type of a `set`, nor can they
  be used as the key type of a `map`. Because internal ordering on durations is
  based on a byte-level comparison, this property of Cassandra was intended to
  help avoid user confusion around ordering of collection elements.

- Secondary indexes on durations are not supported.

- "Slice" relations (<=, <, >=, >) are not supported on durations with `WHERE`
   restrictions (like `SELECT ... WHERE span <= 3d`). Multi-column restrictions
   only work with clustering columns, which cannot be `duration` due to the
   first rule.

- "Slice" relations are not supported on durations with query conditions (like
  `UPDATE my_table ... IF span > 5us`).

Backwards incompatibility note:

As described in the documentation [4], duration literals take one of two
forms: either ISO 8601 formats (there are three), or a "standard" format. The ISO
8601 formats start with "P" (like "P5W"). Therefore, identifiers that have this
form are no longer supported.

Fixes #2240.

[1] https://issues.apache.org/jira/browse/CASSANDRA-11873

[2] bfd57d13b7

[3] https://issues.apache.org/jira/browse/CASSANDRA-11871

[4] http://cassandra.apache.org/doc/latest/cql/types.html#working-with-durations
2017-08-10 15:01:10 -04:00
Jesse Haber-Kucharsky
91dab1d998 CQL native protocol: Add support for vint serialization
Version 5 of the native protocol for CQL [1] adds the `vint` and `unsigned vint`
types.

An unsigned integer encoded as a `vint` has a variable size based on the
magnitude of the value. The first byte indicates the total number of bytes.

For signed integers, a "zig-zag" encoding scheme ensures that small negative
values are encoded as short-length `vint`s (0 -> 0, -1 -> 1, 1 -> 2, 2 -> 3, -2
-> 4, etc).

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
77489f843f duration_test.cc: Add test for printing zero duration
It's somewhat counter-intuitive, but Cassandra also formats zero-valued
duration values as an empty string.
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
d9c027c2dd duration.cc: Remove nop const qualifier on return type
These have no effect according to the Clang static analyzer.
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
54c3cf0201 Change const qualifier declaration order for duration
The vast majority of the code-base is written in left-`const` style, and
consistency is important.
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
1889b036b1 duration.cc: Simplify range checking 2017-08-10 14:11:23 -04:00
Avi Kivity
301358e440 Merge "Optimize combined_mutation_reader for disjoint sstable ranges" from Botond
"sstables will sometimes have narrow/disjont ranges (e.g. LCS L1+).
This can be exploited when reading from a range of sstables by opening
sstables on-demand thus saving memory, processing and potentially I/O.

To achieve this combined_mutation_reader is refactored such that the
reader selection logic is moved-out into a reader_selector class.
combined_mutation_reader now takes a reader_selector instance in its
constructor and asks it for new readers for the current ring position
on every call to operator()().

At the moment two specializations of reader_selector are provided:
* list_reader_selector which implements the current logic, that is using
    a provided mutation_reader list, and
* incremental_reader_selector which implements the on-demand opening
    logic discussed above.

Fixes #1935"

* 'bdenes/optimize_combined_reader-v6' of https://github.com/denesb/scylla:
  Add combined_mutation_reader_test unit test
  Remove range_sstable_reader
  Add incremental_reader_selector
  Add reader_selector to combined_mutation_reader
  sstable_set::incremental_selector: select() now returns a selection
2017-08-10 15:16:30 +03:00
Botond Dénes
9ee9988097 Add combined_mutation_reader_test unit test 2017-08-10 12:38:10 +03:00
Botond Dénes
3e97a5cd6b Remove range_sstable_reader
range_sstable_reader is replaced with combined_mutation_reader, using
the incremental_reader_selector.
2017-08-10 12:38:10 +03:00
Botond Dénes
bfc74f1312 Add incremental_reader_selector
incremental_reader_selector is a specialization of reader_selector for
the case when sstables have narrow and/or disjoint token ranges. To
exploit this it creates new readers on-demand when their sstable's
token range intersects with the current ring position.
2017-08-10 12:38:02 +03:00
Botond Dénes
a6b9186cab Add reader_selector to combined_mutation_reader
combined_mutation_reader now accepts as a constructor argument a
reader_selector instance whoose task is to create new readers on
each call to operator()() if needed and possible.
This way it is possible to control how readers are created through
different specializations of reader_selector.

The previous logic is refactored into list_reader_selector which
is using a pre-provided mutation_reader list and forwards all of them to
combined_mutation_reader at once.
2017-08-10 12:37:40 +03:00
Takuya ASADA
1cb0fff146 dist/common/scripts/scylla_raid_setup: handle '--disks' parameter correctly when disk list is end with ','
We should handle parameters correctly even it's malformed.
Fixes #2402

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499266239-27551-1-git-send-email-syuu@scylladb.com>
2017-08-10 11:42:33 +03:00
Takuya ASADA
8e115d69a9 dist/debian: append postfix '~DISTRIBUTION' to scylla package version
We are moving to aptly to release .deb packages, that requires debian repository
structure changes.
After the change, we will share 'pool' directory between distributions.
However, our .deb package name on specific release is exactly same between
distributions, so we have file name confliction.
To avoid the problem, we need to append distribution name on package version.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>
2017-08-10 10:53:56 +03:00
Vlad Zolotarov
1b4594b03a transport::server::process_prepare() don't ignore errors on other shards
If storing of the statement fails on any shard we should fail the whole PREPARE
request.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1502325392-31169-13-git-send-email-vladz@scylladb.com>
2017-08-10 10:32:37 +03:00
Jesse Haber-Kucharsky
352e9f60ba Rename duration to cql_duration
`std::chrono::duration` is a prolific enough name that it's best to
disambiguate.
2017-08-09 15:15:20 -04:00
Botond Dénes
94fc550e68 sstable_set::incremental_selector: select() now returns a selection
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
2017-08-09 16:27:33 +03:00
Takuya ASADA
3077416ecc dist/debian: Backport scalability fix of _Unwind_Find_FDE to out gcc for Debian 8
Since we provide custom build gcc only for Debian 8, the fix is not apply to
Ubuntu/Debian 9.

Fixes #2646

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502239191-12649-1-git-send-email-syuu@scylladb.com>
2017-08-09 12:19:52 +03:00
Avi Kivity
7217b7ab36 Merge "Use range_streamer everywhere" from Asias
"With this series, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

 - All the cluster operation shares the same code to stream
 - We can know the operation progress, e.g., we can know total number of
   ranges need to be streamed and number of ranges finished in
   bootstrap, decommission and etc.
 - All the cluster operation can survive peer node down during the
   operation which usually takes long time to complete, e.g., when adding
   a new node, currently if any of the existing node which streams data to
   the new node had issue sending data to the new node, the whole bootstrap
   process will fail. After this patch, we can fix the problematic node
   and restart it, the joining node will retry streaming from the node
   again.
 - We can fail streaming early and timeout early and retry less because
   all the operations use stream can survive failure of a single
   stream_plan. It is not that important for now to have to make a single
   stream_plan successful. Note, another user of streaming, repair, is now
   using small stream_plan as well and can rerun the repair for the
   failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions."

* tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev:
  storage_service: Use the new range_streamer interface for removenode
  storage_service: Use the new range_streamer interface for decommission
  storage_service: Use the new range_streamer interface for rebuild
  storage_service: Use the new range_streamer interface for bootstrap
  dht: Extend range_streamer interface
2017-08-09 10:00:25 +03:00
Takuya ASADA
98fc7b376d dist/redhat: install mdadm/xfsprogs on package install time
We experienced 'Constructing RAID volume...' takes too much time on some AMIs,
this is because setup script stuck at 'yum -y install mdadm xfsprogs'.
We don't have to install these packages on AMI startup time, we should
preinstall them on AMI creating time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502192796-21040-1-git-send-email-syuu@scylladb.com>
2017-08-09 09:10:34 +03:00
Piotr Jastrzebski
4137517cdc Check arguments of table_helper::setup_keyspace
to make sure all table helpers passed as arguments are
for the right keyspace.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <10edacd509880bb18180f13e8c28593d068c5c7b.1501688729.git.piotr@scylladb.com>
2017-08-08 15:55:06 +03:00
Piotr Jastrzebski
2d8a80f211 Make table_helper constructor safer
by taking keyspace name by value and storing it inside the object.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a5dab41647348ae311e023fe5592aec650c6e32a.1501688729.git.piotr@scylladb.com>
2017-08-08 15:55:06 +03:00
Daniel Fiala
06089474c9 Print warning if user uses default cluster_name
* Configuration for cluster_name is commented-out in config file.
* Default value set to empty string and if not rewritten by user then
  warning is printed and value is reset to "ScyllaDB Cluster".

Fixes #2648.

Message-Id: <20170808113322.9313-1-daniel@scylladb.com>
2017-08-08 14:47:17 +03:00
Avi Kivity
a71138fc84 config: mark column_index_size_in_kb as Used
Fixes #2681
Message-Id: <20170808100415.16296-1-avi@scylladb.com>
2017-08-08 11:08:00 +01:00
Ultrabug
2022da2405 Add overall python code QA and guidelines with flake8
ScyllaDB loves python & python loves ScyllaDB.

It would benefit the project to start enforcing some code guidelines
and basic QA with a linter along a PEP8 respect thanks to flake8.

This patch adds a tox config to at least start with an assessment
of the work to be done on all .py files in the code base.

To reduce its noise, tests on long lines (> 80char) are ignored
for now.

Signed-off-by: Ultrabug <ultrabug@gentoo.org>
Message-Id: <20170726134242.8927-1-ultrabug@gentoo.org>
2017-08-08 11:15:45 +03:00
Raphael S. Carvalho
dddbd34b52 sstables: close index file when sstable writer fails
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.

Fixes #2673.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
2017-08-08 09:53:14 +03:00
Asias He
49360992d9 storage_service: Use the new range_streamer interface for removenode
So that removenode operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
6b8dc85f12 storage_service: Use the new range_streamer interface for decommission
So that decommission operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
24584b8509 storage_service: Use the new range_streamer interface for rebuild
So that rebuild operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Asias He
f239b11a84 storage_service: Use the new range_streamer interface for bootstrap
So that bootstrap operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Asias He
6810031ba7 dht: Extend range_streamer interface
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

- All the cluster operation shares the same code to stream

- We can know the operation progress, e.g., we can know total number of
  ranges need to be streamed and number of ranges finished in
  bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
  operation which usually takes long time to complete, e.g., when adding
  a new node, currently if any of the existing node which streams data to
  the new node had issue sending data to the new node, the whole bootstrap
  process will fail. After this patch, we can fix the problematic node
  and restart it, the joining node will retry streaming from the node
  again.
- We can fail streaming early and timeout early and retry less because
  all the operations use stream can survive failure of a single
  stream_plan. It is not that important for now to have to make a single
  stream_plan successful. Note, another user of streaming, repair, is now
  using small stream_plan as well and can rerun the repair for the
  failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions.
2017-08-07 16:31:47 +08:00
Avi Kivity
86de6cc7fb Merge seastat upstream
* seastar f14d2a3...7a49ae5 (8):
  > sharded: improve support for cooperating sharded<> services
  > sharded: support for peer services
  > semaphore: add a version of with_semaphore that takes a duration timeout
  > scripts: perftune.py: fix the CPU mask generation for more than 64 CPUs
  > Revert "future-utils: make when_all() (vector variant) exception safe"
  > Revert "future-utils: fix gross compilation errors in when_all()"
  > future-utils: fix gross compilation errors in when_all()
  > future-utils: make when_all() (vector variant) exception safe

Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.
2017-08-06 17:47:47 +03:00
Avi Kivity
3edec66903 Revert "repair: Make send_repair_checksum_range timeout"
This reverts commit 98757069a5. We have the
failure detector which will detect an unresponsive node and fail the RPC.
Adding a timeout can just introduce false positives.
2017-08-06 13:09:36 +03:00
Avi Kivity
621926d914 dist: debian: escape "$" character for make 2017-08-05 16:51:03 +03:00
Avi Kivity
a471851bf1 dist: debian: add /opt/scylladb/bin to PATH so antlr can be found 2017-08-05 15:46:58 +03:00
Avi Kivity
8bdc0dd471 dist: debian: search for libaries in /opt/scylladb/lib 2017-08-05 13:18:14 +03:00
Takuya ASADA
2ff3bdba5c dist/debian: switch Ubuntu 3rdparty packages to external build service
Switch Ubuntu to launchpad ppa:
https://launchpad.net/~scylladb/+archive/ubuntu/ppa/+packages

Since switching 3rdparty on Debian is not ready yet, keep them to use scylla
3rdparty repo, also keep --rebuild-dep option and dist/debian/dep.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501866678-4922-1-git-send-email-syuu@scylladb.com>
2017-08-05 11:29:13 +03:00
Glauber Costa
4a911879a3 add active streaming reads metric
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.

This patch adds this metric so we don't fly blind.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
2017-08-05 11:06:37 +03:00
Duarte Nunes
587b6be089 dirty_memory_manager: Add missing include
Allows tests/memory_footprint to build on Ubuntu 14.04.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-04 10:15:23 +02:00
Avi Kivity
4f12068e50 dist: re-add --rebuild-dep to build_rpm.sh
For compatibility with existing scripts; ignored.
2017-08-04 07:10:18 +03:00
Takuya ASADA
b5e83ebd94 dist/redhat: switch 3rdparty packages to external build service
Drop existing 3rdparty build script/3rdparty repo, switch to Fedora Copr
https://copr.fedorainfracloud.org/coprs/scylladb/scylla-3rdparty/packages/

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170803110754.22152-1-syuu@scylladb.com>
2017-08-04 06:40:09 +03:00
Pekka Enberg
90872ffa1f docker: Disable stall detector
Fixes #2162

Message-Id: <1501759957-4380-1-git-send-email-penberg@scylladb.com>
2017-08-03 14:52:49 +03:00
Takuya ASADA
91ade1a660 dist/debian: check scylla user/group existance before adding them
To prevent install failing on the environment which already has scylla
user/group, existance check is needed.

Fixes #2389

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495023805-14905-1-git-send-email-syuu@scylladb.com>
2017-08-03 13:01:18 +03:00
Takuya ASADA
6ac254fbcb dist: change nomerges=1 on block devices during fstrim execution
We have problem to run fstrim with nomerges=2, so we need to change
the parameter to 1 during fstrim execution.
To do this, this fix changes follow things:
 - revert dropping scylla_fstrim on Ubuntu 16.04/CentOS
 - disable distribution provided fstrim script
 - enable scylla_fstrim on all distributions
 - introduce --set-nomerges on scylla-blocktune
 - scylla_fstrim call scylla-blocktune by following order:
   - 'scylla-blocktune --set-nomerges 1'
   - 'fstrim' for each devices
   - 'scylla-blocktune --set-nomerges 2'

Fixes #2649

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501531393-21109-1-git-send-email-syuu@scylladb.com>
2017-08-03 13:00:34 +03:00
Botond Dénes
0b7ac01f0f Add QtCreator project file and .gdbinit to .gitignore
Message-Id: <ff662910fe1156cdde2bda4aa5bb9cfc45bddda9.1501752340.git.bdenes@scylladb.com>
2017-08-03 12:58:35 +03:00
Avi Kivity
f38e4ff3f9 database: prevent streaming reads from blocking normal reads
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.

Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.

Fixes #2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
2017-08-03 10:23:01 +01:00
Avi Kivity
911536960a database: remove streaming read queue length limit
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.

Fixes #2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
2017-08-03 10:21:07 +01:00
Avi Kivity
e9519ca8e5 Merge "make range selects more efficient by going through digest matching stage" from Gleb
"Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This series makes
it to try matching digests first like a single partition read."

Fixes #2666.

* 'gleb/digest_scan' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: make range_slice_read_executor go through digest matching state
  storage_proxy: add capability to read data/digest for non singular ranges
  storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
2017-08-03 12:18:11 +03:00
Tzach Livyatan
d3d46a5eac Add comments on cluster_name in scylla.yaml
Fix #2316

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170730082922.21884-1-tzach@scylladb.com>
2017-08-03 12:12:15 +03:00
Gleb Natapov
d2a2a6d471 storage_proxy: make range_slice_read_executor go through digest matching state
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.

The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
2017-08-03 11:37:03 +03:00
Tzach Livyatan
99b2232c5d docs/docker: Add hostname parameter to examples
Using --hostname to give the container a meaningful name is a good
practice, and make the monitoring dashboard easier to understand

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803081027.6675-1-tzach@scylladb.com>
2017-08-03 11:14:12 +03:00
Gleb Natapov
3b7d8c8767 storage_proxy: add capability to read data/digest for non singular ranges
Currently only mutation_data read supports non singular ranges. This
patch extends data/digest reads to support them too.
2017-08-03 10:35:09 +03:00
Gleb Natapov
c619ef258b storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
2017-08-03 10:08:44 +03:00
Duarte Nunes
4c9206ba2f tests/sstable_mutation_test: Don't use moved-from object
Fix a bug introduced in dbbb9e93d and exposed by gcc6 by not using a
moved-from object. Twice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170802161033.4213-1-duarte@scylladb.com>
2017-08-03 09:45:49 +03:00
Asias He
763fa83232 repair: Fix build in repair_cf_range
The compiler does not like the mutable.
Message-Id: <83c5e8a944b72a095b8e29e9988986e6ca9cefc5.1501690749.git.asias@scylladb.com>
2017-08-02 18:57:32 +02:00
Asias He
5798625d73 repair: Singal parallelism_semaphore in case of error
If we throw after we take the semaphore and beforew the when_all
below runs, no one will increase the semaphore.

Fixes #2661
Message-Id: <49540ede4c8a6d84004e10e0f63690e3c21d72c7.1501686383.git.asias@scylladb.com>
2017-08-02 18:32:32 +03:00
Avi Kivity
ebff739a84 Merge "use paging for compaction history" from Amnon
"This series adds an option to use paging in internal query and use that for the
get compaction history function.

Internal paging will be done explicitly, to use paging, you first create a
state object (that contains the query as well) and use that state to get the
first page, the result will contain both the query result and a new state that
can be used to get the next page.

Fixes #2366"

* 'amnon/paged_compaction_history_v5' of github.com:cloudius-systems/seastar-dev:
  system_keyspace: Use paging for get compaction history
  Add paging for internal queries
  query_options: Allows creating query_options from query_options
2017-08-02 18:15:58 +03:00
Avi Kivity
ac31abf6a4 repair: don't lambda-capture repair_tracker
It is static, so it need not be captured, and some compilers complain.
2017-08-02 18:07:31 +03:00
Avi Kivity
ce60ef59f3 Revert "repair: Singal parallelism_semaphore in case of error"
This reverts commit a548eee28c. It releases
the semaphore too early (noted by Glauber).
2017-08-02 17:13:46 +03:00
Avi Kivity
b2753b0183 Merge "Fix possible repair stuck" from Asias
"This series tries to fix possible repair stuck."

Fixes #2660, #2661, #2662.

* tag 'asias/repair_stuck_v2.1' of github.com:cloudius-systems/seastar-dev:
  repair: Make send_repair_checksum_range timeout
  repair: Singal parallelism_semaphore in case of error
  repair: Fix repair_tracker done
2017-08-02 16:51:51 +03:00
Asias He
98757069a5 repair: Make send_repair_checksum_range timeout
If the verb never returns the repair will hangs forever. Make it use the
timeout version of the send_message.

Fixes #2662
2017-08-02 21:41:50 +08:00
Asias He
a548eee28c repair: Singal parallelism_semaphore in case of error
If we throw after we take the semaphore and beforew the when_all
below runs, one one will increase the semaphore.

Fixes #2661
2017-08-02 21:41:45 +08:00
Asias He
abcff4c78e repair: Fix repair_tracker done
If it throws after repair_tracker.start and before the when_all below,
the repair_tracker.done will never be called for this repair id.

Fixes #2660
2017-08-02 21:40:29 +08:00
Pekka Enberg
78f68613ce dist/docker: Reduce number of layers
One of the best practices for Dockerfiles is to minimize the number of
layers because they increase the overall image size:

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#minimize-the-number-of-layers

Consolidate our "yum install" commands to reduce the number of lauyers.

Suggested by Dean Hamstead.

Message-Id: <1501670572-8701-1-git-send-email-penberg@scylladb.com>
2017-08-02 15:21:05 +03:00
Takuya ASADA
ffbdacc1fa dist/debian: remove ant from prerequisite packages
This lines are mistakenly copied from scylla-tools, won't need for scylla-server.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498029619-1928-1-git-send-email-syuu@scylladb.com>
2017-08-02 12:12:42 +03:00
Duarte Nunes
cec41f9de6 Merge seastar upstream
* seastar fc937b8...f14d2a3 (4):
  > configure.py: Ensure tmp directory exists when getting dpdk cflags
  > checked_ptr: fix hash() compilation
  > net: fix potential use after free in posix_server_socket::accept()
  > http: removed unneeded lamda captures

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-02 10:05:08 +02:00
Asias He
cf6f4a5185 gossip: Introduce the shadow_round_ms option
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.

Fixes #2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>
2017-08-02 09:52:35 +03:00
Vlad Zolotarov
4b28ea216d utils::loading_cache: cancel the timer after closing the gate
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.

Otherwise we may end up with the armed timer after stop() method has
returned a ready future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
2017-08-01 17:21:44 +01:00
Duarte Nunes
569bbf2edd sstables/sstables: Use per-cpu noop_write_monitor
We employ a thread-per-core architecture, so don't go about sharing
seastar::shared_ptrs across cpus.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170801144153.17354-1-duarte@scylladb.com>
2017-08-01 18:10:49 +03:00
Avi Kivity
db7329b1cb Merge "Ensure correct EOC for PI block cell names" from Duarte
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.

Fixes #2333"

* 'pi-cell-name/v1' of github.com:duarten/scylla:
  tests/sstable_mutation_test: Test promoted index blocks are monotonic
  sstables: Consider eoc when flushing pi block
  sstables: Extract out converting bound_kind to eoc
2017-08-01 18:09:07 +03:00
Gleb Natapov
1da4d5c5ee cql transport: run accept loop in the foreground
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.

Fixes #2652

Message-Id: <20170801125558.GS20001@scylladb.com>
2017-08-01 17:04:14 +03:00
Avi Kivity
1e8bb972b6 compaction: fix iteration in leveled compaction droppable tombstones loop
Since get_level_count() is unsigned, it will never be negative, and
the loop may never terminate.

Message-Id: <20170719133502.13316-1-avi@scylladb.com>
2017-08-01 13:40:36 +03:00
Avi Kivity
ba2e170e4b compaction: fix return in leveled compaction droppable tombstones loop
If the loop ever terminates, we need to return something.

Message-Id: <20170719133508.13374-1-avi@scylladb.com>
2017-08-01 13:33:02 +03:00
Takuya ASADA
a998b7b3eb dist/ami: follow scylla-tools package name change on RedHat variants
Since scylla-tools generates two .rpm packages, we need to copy them to our AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170722090002.9850-1-syuu@scylladb.com>
2017-07-31 18:57:12 +03:00
Avi Kivity
7c8dea088a Merge seastar upstream
* seastar 54e940f...fc937b8 (2):
  > configure.py: Always ensure tmp directory exists
  > coding-style.md: introduce
2017-07-31 18:06:09 +03:00
Duarte Nunes
a85232dd82 Fix compilation errors on GCC 6
GCC 6 inconsistently requires explicitly calling a member function
through "this->" for lambda functions capturing "this".

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170731143755.21970-1-duarte@scylladb.com>
2017-07-31 17:40:44 +03:00
Benoît Canet
b44ba11e4c transport: Count the number of unpaged queries
Queries with query page size equal or smaller than
zero are unpaged queries.

Count these kind of queries and make them a metrics
since they can ruin the performance of the system.

Message-Id: <20170731130004.25807-2-benoit@scylladb.com>
2017-07-31 16:01:45 +03:00
Avi Kivity
3fe6731436 Merge "educe the effect of the latency metrics" from Amnon
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
   stop the last bucket."

Fixes #2650.

* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
  database: remove latency from the system table
  estimated histogram: return a smaller histogram
2017-07-31 15:58:30 +03:00
Paweł Dziepak
402799fcc0 mutation_reader: drop move_and_clear()
Since the discovery of std::exchange(x, {}) move_and_clear has become
obsolete. Beside, the name was wrong, it did not clear the vector but
recreated it meaning that any allocated memory wasn't reused (not that
it mattered in the existing usages).

Message-Id: <20170731123549.10887-1-pdziepak@scylladb.com>
2017-07-31 15:51:19 +03:00
Gleb Natapov
87bc3f7e7f configure.py: use user provided compiler flags when checking for features
User provided compiler flags my change an outcome of the test.

Message-Id: <20170724111520.GA18230@scylladb.com>
2017-07-31 15:33:06 +03:00
Avi Kivity
f4b2a1ef4e Merge "Optimise combined_mutation_reader" from Paweł
"These patches optimise combined_mutation_reader for cases where the majority
of mutation_readers is disjoint.

perf_fast_forward:
Results are medians of 3 of fragments/s as reported by perf_fast_forward.

Command:
perf_fast_forward -c1 --enable-cache

small: small-partition-skips (read=1, skip=0)
large: large-partition-skips (read=1, skip=0)

          before        after      diff
small     195753      238196       +22%
large    1244325     1359096        +9%

perf_simple_query:

Results are medians of 10 of reads/s as reported by perf_simple_query.

Command:
perf_simple_query -c1

before   98651.40
after   104554.85
diff          +6%"

* tag 'avoid-merge_mutations/v1' of https://github.com/pdziepak/scylla:
  combined_mutation_reader: avoid unnecessary merge_mutations()
  combined_mutation_reader: do not pop mutation with different key
2017-07-31 15:14:42 +03:00
Avi Kivity
178b54e790 Merge "memtable flush: Fixes and improvements" from Duarte
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.

To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.

We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."

* 'memtable-flush-additional-fixes/v4' of github.com:duarten/scylla:
  column_family: Re-acquire flush permit in case of error
  column_family: Don't hold sstable read lock when retrying flush
  sstables: Release the flush permit before fsyncing
  sstables: Introduce write_monitor
  database: Extract out dirty_memory_manager
  dirty_memory_manager: Refactor flush permit lifetime management
  dirty_memory_manager: Invert permit acquisition order
  memtable_list: Register different seal functions for each behaviour
2017-07-31 14:57:19 +03:00
Paweł Dziepak
2b53a560c8 combined_mutation_reader: avoid unnecessary merge_mutations()
Merging mutations is quite an expensive operation. The creation of
streamed mutation merger involves several allocations (mostly coming
from various std::vector) and then all mutation_fragments need to go
through a heap.

All this is completely unnecessary if there is only one mutation, so
let's skip a call to merge_mutations() in such cases. This also means
that we can reuse memory allocated by _current vector if merge is not
required.
2017-07-31 12:35:40 +01:00
Paweł Dziepak
f78f2b3c92 combined_mutation_reader: do not pop mutation with different key
Originally, the loop insidecombined_mutation_reader::next() so that it
was popping mutation from the heap and when it encountered one with a
different decorated key it was pushed back and the ones accumulated so
far merged and emitted. In other words, every time the reader progressed
to the next mutation it did needless pop and push operations on the
heap.

This patch rearranges the code so that the key of the next mutation is
compared before it is popped from the heap.
2017-07-31 12:35:40 +01:00
Duarte Nunes
c81431ad16 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
9162e016da column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
1a33cc6847 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
784a078e72 sstables: Introduce write_monitor
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
d2b0a5a0a6 database: Extract out dirty_memory_manager
Needed to the flush_permit can be propagated to the sstables layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
a2b732c156 dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
f647f5b14a dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
e371accac8 memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Paweł Dziepak
e970630272 tests/serialized_action: add missing forced defers
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>
2017-07-31 11:35:24 +01:00
Duarte Nunes
4e3232fc29 utils/log_histogram: Fix typo when calculating number of buckets
We weren't correctly calculating the number of buckets due to
returning the wrong variable.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170731094733.7746-1-duarte@scylladb.com>
2017-07-31 12:49:11 +03:00
Avi Kivity
e855a28fae Revert "Merge "memtable flush: Fixes and improvements" from Duarte"
This reverts commit 733a64a1df, reversing
changes made to e11e66723a.

Breaks sstable_test and perf_fast_forward.
2017-07-31 12:44:28 +03:00
Avi Kivity
85056f3611 log_histogram: fix constexpr-ness of log_histogram_options
1. assert() is not constexpr.
2. can't use static_assert(), because the contructor may be called in a non-constexpr
   environment; moved to log_histogram
3. pow2_rank() uses count_leading_zeros() which is not constexpr; split
   into constexpr and non-constexpr versions
4. duplicated number_of_buckets() because bucket_of() can't be constexpr due to pow2_rank
Message-Id: <20170726105444.32698-1-avi@scylladb.com>
2017-07-31 09:11:40 +01:00
Avi Kivity
733a64a1df Merge "memtable flush: Fixes and improvements" from Duarte
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.

To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.

We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."

* 'memtable-flush-additional-fixes/v3' of github.com:duarten/scylla:
  column_family: Re-acquire flush permit in case of error
  column_family: Don't hold sstable read lock when retrying flush
  sstables: Release the flush permit before fsyncing
  sstables: Introduce write_monitor
  database: Extract out dirty_memory_manager
  dirty_memory_manager: Refactor flush permit lifetime management
  dirty_memory_manager: Invert permit acquisition order
  memtable_list: Register different seal functions for each behaviour
  main: Don't catch polymorphic exceptions by value
2017-07-31 10:32:26 +03:00
Duarte Nunes
e11e66723a main: Don't catch polymorphic exceptions by value
GCC trunk complains due to exception slicing.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170727163021.8000-1-duarte@scylladb.com>
2017-07-31 10:12:13 +03:00
Avi Kivity
fc683c3f3e Merge seastar upstream
* seastar a14d667...54e940f (8):
  > Merge "Prometheus to use output stream" from Amnon
  > http_test: Fix an http output stream test
  > build: harden try_compile_and_link output temporary file
  > configure: disable exception scalability hack on debug build
  > build: don't perform test compiles to /dev/null
  > Provide workaround for non scaleable c++ exception runtime
  > Merge "Add output stream to http message reply" from Amnon
  > configure.py: use user provided compiler flags when checking for features
2017-07-31 10:09:48 +03:00
Avi Kivity
c1718dd5e3 Update scylla-ami submodule
* dist/ami/files/scylla-ami 2bd1481...b41e5eb (1):
  > Fix incorrect scylla-server sysconfig file edit for i3 memflush controller
2017-07-31 09:41:24 +03:00
Takuya ASADA
714540cd4c dist/debian: refuse upgrade if current scylla < 1.7.3 && commitlog remains
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.

Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.

Fixes #2551

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>
2017-07-31 09:08:40 +03:00
Avi Kivity
e5d2e28df9 Merge "Backport exception scalability fix from gcc-7" from Gleb
"This patch series backports scalability fix for _Unwind_Find_FDE and modifies
out CentOS package to use our libgcc and libstdc++ which are needed to make
use of the fix instead of locally installed ones."

Ref #2646 (fixes on RHEL 7 and related only)

* 'gleb/exception-gcc-fix-v2' of github.com:cloudius-systems/seastar-dev:
  dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one
  dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc
2017-07-30 19:31:03 +03:00
Gleb Natapov
8fe875cc79 dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one 2017-07-30 16:03:25 +03:00
Gleb Natapov
1cf7e72c68 dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc 2017-07-30 16:03:10 +03:00
Paweł Dziepak
e62403190b Merge "Introduce perf_cache_eviction test" from Tomasz
Runs appending writes to a single partition, at full speed, and a reader
which selects the head of the partition, with 100ms delay between reads.
Prints latency percentiles and some stats.

Intended to test performance at the transition from non-evicting to
evicting modes.

Currently we can see that after the transition, whole partition gets
evicted and reads constantly miss.

Sample output:

    rd/s: 10, wr/s: 135947, ev/s: 0, pmerge/s: 1, miss/s: 0, cache: 708/778 [MB], LSA: 820/910 [MB], std free: 82 [MB]

    reads : min: 149   , 50%: 179   , 90%: 1331  , 99%: 1331  , 99.9%: 1331  , max: 6866   [us]
    writes: min: 3     , 50%: 4     , 90%: 4     , 99%: 5     , 99.9%: 258   , max: 51012  [us]

    rd/s: 7, wr/s: 93354, ev/s: 9, pmerge/s: 1, miss/s: 3, cache: 0/0 [MB], LSA: 107/128 [MB], std free: 82 [MB]

    reads : min: 179   , 50%: 179   , 90%: 73457 , 99%: 73457 , 99.9%: 73457 , max: 105778 [us]
    writes: min: 3     , 50%: 4     , 90%: 4     , 99%: 5     , 99.9%: 258   , max: 105778 [us]

* tag 'tgrabiec/row-eviction-perf-test' of github.com:scylladb/seastar-dev:
  tests: Introduce perf_cache_eviction
  tests: simple_schema: Add getter for DDL statement
  estimated_histogram: Implement percentile()
  utils: estimated_histogram: Make printable
2017-07-28 09:49:22 +01:00
Duarte Nunes
0f1bd81523 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
2f4cffc7f6 column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
5e64839e85 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
a737577881 sstables: Introduce write_monitor
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
121f967b30 database: Extract out dirty_memory_manager
Needed to the flush_permit can be propagated to the sstables layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
ef1275e9dd dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
cfc8fae33f dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
7e68e4677d memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
7502401652 main: Don't catch polymorphic exceptions by value
GCC trunk complains due to exception slicing.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
143f4fd861 Merge "Prevent pull requests from accumulating" from Tomasz
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.

In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:

  concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test

Allowing only one active and one queued pull request per remote
endpoint is enough.

* tag 'tgrabiec/dont-accumulate-schema-pulls-v2' of github.com:scylladb/seastar-dev:
  migration_manager: Log schema pulls
  migration_manager: Prevent pull requests from accumulating
  utils: Introduce serialized_action
2017-07-27 21:01:38 +02:00
Tomasz Grabiec
e09220dbff migration_manager: Log schema pulls 2017-07-27 20:08:25 +02:00
Tomasz Grabiec
350d98d4e1 migration_manager: Prevent pull requests from accumulating
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.

In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:

  concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test

Allowing only one active and one queued pull request per remote
endpoint is enough.
2017-07-27 20:08:25 +02:00
Tomasz Grabiec
6a3703944b utils: Introduce serialized_action 2017-07-27 20:08:21 +02:00
Duarte Nunes
dbbb9e93da tests/sstable_mutation_test: Test promoted index blocks are monotonic
Reproduces #2333

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Duarte Nunes
06728bdfe9 sstables: Consider eoc when flushing pi block
When flushing a promoted index block using a range tombstone cell name
as a bound, use the right eoc value instead of always writing
composite::eoc::none.

Fixes #2333

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Duarte Nunes
718517ed91 sstables: Extract out converting bound_kind to eoc
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Paweł Dziepak
f02bef7917 streamed_mutation: do not call fill_buffer() ahead of time
consume_mutation_fragments_until() allows consuming mutation fragments
until a specified condition happens. This patch reorganises its
implementation so that we avoid situations when fill_buffer() is called
with stop condition being true.
Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>
2017-07-27 17:47:57 +02:00
Tomasz Grabiec
ac7e6ef1bc tests: Introduce perf_cache_eviction 2017-07-27 17:19:07 +02:00
Tomasz Grabiec
2d2e7ef6fb tests: simple_schema: Add getter for DDL statement 2017-07-27 17:19:07 +02:00
Tomasz Grabiec
5602be72fa estimated_histogram: Implement percentile() 2017-07-27 17:19:07 +02:00
Tomasz Grabiec
1bc305ed7b utils: estimated_histogram: Make printable 2017-07-27 17:19:03 +02:00
Takuya ASADA
91a75f141b dist/redhat: limit metapackage dependencies to specific version of scylla packages
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.

To install same version packages with the metapackage, limited dependencies to
current package version.

Fixes #2642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
2017-07-27 14:21:35 +03:00
Takuya ASADA
11870e47ec dist/redhat: refuse upgrade if current scylla < 1.7.3 && commitlog remains
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.

Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.

Fixes #2551

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170727110730.613-1-syuu@scylladb.com>
2017-07-27 14:09:17 +03:00
Tomasz Grabiec
22948238b6 row_cache: Fix potential timeout or deadlock due to sstable read concurrency limit
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked. Therefore, each read may
create at most one such reader in order to be guaranteed to make
progress. If the reader tries to create another reader, that may
deadlock (or for non-system tables, timeout), if enough number of such
readers tries to do the same thing at the same time.

Avoid the problem by dropping previous reader before creating a new
one.

Refs #2644.

Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>
2017-07-27 13:58:20 +03:00
Vlad Zolotarov
e98adb13d5 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
2017-07-27 10:54:36 +02:00
Amnon Heiman
a71b9e498a database: remove latency from the system table
This patch remove the latency histograms from the system table, it also
extend the already existing exclusion to all system keyspaces.

It also uses the new get_histogram API to set a minimal bucket size to
100 microseconds.
2017-07-27 11:41:15 +03:00
Amnon Heiman
1b05f23d12 estimated histogram: return a smaller histogram
The current histogram contains 91 buckets, this is a very high
resolution with a high upper limit.

To reduce traffic passed, between scylla and the prometheus, this patch
generate a smaller histogram.

It limit the number of buckets (16 by default), set a lower limit to the
lowest bucket, and uses 2 as the bucket coeficient.

Highest empty buckets will not be reported.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

estimated histogram
2017-07-27 11:41:10 +03:00
Tomasz Grabiec
e9fc0b0491 Merge "Some fixes for performance regressions in perf_fast_forward" from Paweł
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.

Refs #2582.

Some numbers:
before - before cache changes were merged
(555621b537)

cache - at the commit that introduced the partial cache
(9b21a9bfb6)

after - recent master + this series
(based on e988121dbb)

Differences are shown relative to "before".

Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
  before      cache      after
 1636840    1013688    1234606
              -38%        -25%

Large partitions, range [0, 500000], reading from cache
  before      cache      after
 2012615    3076812    3035423
               +53%       +51%

Testing scanning small partitions with skips.
reading small partitions (skip 0)
 before      cache      after
 227060     165261     200639
              -27%       -11%

skipping small partitions (skip 1)
 before      cache      after
  29813      27312      38210
               -8%       +28%

Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
 before      cache      after
 195282     149695     180497
              -23%        -8%

* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
  sstables: make sure that fill_buffer() actually fills buffer
  mutation_merger: improve handling of non-deferring fill_buffer()s
  partition_snapshot_row_cursor: avoid apply() in single-version cases
  sstables: introduce decorated_key_view
  ring_position_comparator: accept sstables::decorated_key_view
  sstable: keep a pre-computed token in summary_entry
  sstables: cache token in index entries
  index_reader: advance_and_check_if_present() use index_comparator
  ring_position_comparator: drop unused overloads
  cache_streamed_mutation: avoid moving clustering_row
  streamed_mutation: introduce consume_mutation_fragments_until()
  cache_streamed_mutation: use consumer based read_context reader
  rows_entry: make position() inlineable
  mutation_fragment: make destructor always_inline
  keys: introduce compound_wrapper::from_exploded_view()
  sstables: avoid copying key components
  compound_compat: explode: reserve some elements in a vector
  cache: short-circut static row logic if there are no static columns
  cache: use equality comparators instead of tri_compare
  sstables: avoid indirect calls to abstract_type::is_multi_cell()
2017-07-27 10:14:35 +02:00
Pekka Enberg
b80504188a docs/docker-hub: Mark '--experimental' as 2.0 feature
The '--experimental' flag appears in 2.0 so mark it as such in the user
documentation on Docker Hub.

Message-Id: <1501137703-29706-1-git-send-email-penberg@scylladb.com>
2017-07-27 10:28:25 +03:00
Duarte Nunes
85e85ec72e Don't catch polymorphic exceptions by value
It makes gcc a very sad compiler.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726172053.5639-2-duarte@scylladb.com>
2017-07-27 09:39:58 +03:00
Duarte Nunes
7536659cb5 CqlParser: Don't catch polymorphic exceptions by value
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726172053.5639-1-duarte@scylladb.com>
2017-07-27 09:39:57 +03:00
Tzach Livyatan
ea97b87205 Adding Scylla restart instructions
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170725064719.31109-1-tzach@scylladb.com>
2017-07-27 09:38:49 +03:00
Vlad Zolotarov
9adabd1bc4 utils::loading_cache: add stop() method
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.

We have to ensure that these operations are over before we can destroy
the loading_cache object.

Fixes #2624

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501096256-10949-1-git-send-email-vladz@scylladb.com>
2017-07-26 21:28:49 +02:00
Duarte Nunes
50ad0003c6 db/schema_tables: Drop dropped columns when dropping tables
Fixes #2633

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-2-duarte@scylladb.com>
2017-07-26 18:41:28 +02:00
Duarte Nunes
3425403126 db/schema_tables: Store column_name in text form
As does Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-1-duarte@scylladb.com>
2017-07-26 18:41:12 +02:00
Duarte Nunes
787308a96c cql3/tuples: Don't catch polymorphic exception by value
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726155740.3275-1-duarte@scylladb.com>
2017-07-26 19:28:35 +03:00
Asias He
515a744303 gossip: Fix nr_live_nodes calculation
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.

It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).

Fixes #2637

Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
2017-07-26 16:48:30 +03:00
Paweł Dziepak
7b0f75c0d1 sstables: avoid indirect calls to abstract_type::is_multi_cell() 2017-07-26 14:38:27 +01:00
Paweł Dziepak
b4d1dea4a9 cache: use equality comparators instead of tri_compare
Equality comparator may be much cheaper than the fully fledged
trichotomic comparator, especially if the component types are byte order
equal but not byte order comparable.
2017-07-26 14:38:27 +01:00
Paweł Dziepak
2780555968 cache: short-circut static row logic if there are no static columns 2017-07-26 14:38:27 +01:00
Paweł Dziepak
4a0385e908 compound_compat: explode: reserve some elements in a vector
When we are exploding a compound key we know already that there is more
than one component, but we have no easy way of determining how many of
them are going to be there. Let's reserve space for a few elements so
that we avoid an excessive number of reallocations in case of
medium-sized keys.
2017-07-26 14:38:27 +01:00
Paweł Dziepak
28c105e4a7 sstables: avoid copying key components 2017-07-26 14:38:27 +01:00
Paweł Dziepak
6031b7e587 keys: introduce compound_wrapper::from_exploded_view() 2017-07-26 14:38:27 +01:00
Paweł Dziepak
c9ccd813ab mutation_fragment: make destructor always_inline
mutation_fragment destructor was already made inline-friendly by moving
most of the logic to a separate function. However, the compiler still is
quite reluctant to inline it in certain cases, so let's give it a
stronger hint.
2017-07-26 14:38:27 +01:00
Paweł Dziepak
43cce6c2f4 rows_entry: make position() inlineable 2017-07-26 14:38:27 +01:00
Paweł Dziepak
c2ec43f70b cache_streamed_mutation: use consumer based read_context reader 2017-07-26 14:38:21 +01:00
Paweł Dziepak
2066354de3 streamed_mutation: introduce consume_mutation_fragments_until()
consume_mutation_fragments_until() is a consumer based interface that
avoids indirect calls and continuation overhead present in the naive
streamed_mutation::operator() approach.
2017-07-26 14:37:20 +01:00
Paweł Dziepak
9bc6038ff3 cache_streamed_mutation: avoid moving clustering_row
clustering_row can stores quite a lot of data internally which makes its
move constructor not exactly cheap.
If possible it is better to move mutation_fragment around as it keeps
everything externally. This also avoids some cases when clustering row
would be extracted from mutation_fragment only to be made to create
another mutation_fragment later.
2017-07-26 14:36:37 +01:00
Paweł Dziepak
68e57a742f ring_position_comparator: drop unused overloads 2017-07-26 14:36:37 +01:00
Paweł Dziepak
960a140880 index_reader: advance_and_check_if_present() use index_comparator 2017-07-26 14:36:37 +01:00
Paweł Dziepak
dc7bad9a50 sstables: cache token in index entries
When a sstable reader is fast forwarded some index entries may be read
(and compared) multiple times. This patch makes sure that once a token
is computed we keep it around and reuse if the entry is accessed again.
2017-07-26 14:36:37 +01:00
Paweł Dziepak
bfb7b56c74 sstable: keep a pre-computed token in summary_entry
Each sstable index lookup involves a binary search in the summary and
each time a partition key of summary entry is compared with anything its
token needs to be calculated.
Since we keep summary in the memory all the time it is better to also
keep the tokens around.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
fe7eba7f06 ring_position_comparator: accept sstables::decorated_key_view
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
31d7cfdefb sstables: introduce decorated_key_view 2017-07-26 14:36:36 +01:00
Paweł Dziepak
722c56f3f2 partition_snapshot_row_cursor: avoid apply() in single-version cases 2017-07-26 14:36:36 +01:00
Paweł Dziepak
e145ee6bb8 mutation_merger: improve handling of non-deferring fill_buffer()s
It is possible that a call to fill_buffer() will return an immediately
ready future. This patch avoids uncontrolled recursion in case when all
merged streamed mutation do not defer ini fill_buffer() and also
optimises for non-deferring case by avoiding some of the logic.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
e0a04cb7fe sstables: make sure that fill_buffer() actually fills buffer
streamed_mutation::impl::fill_buffer() is supposed to either push
mutation fragments to the buffer or set EOS flag. However, it was
possible that mp_row_consumer would return proceed::no if a skip was
needed without satisfying any of these conditions.
2017-07-26 14:36:36 +01:00
Pekka Enberg
e66635a885 Merge "Developer documentation improvements" from Jesse
"This patch series addresses some feedback from the preliminary
 HACKING.md, adds some new content, and updates the README file with
 some quick-start information."

* 'jhk/better_hacking/v3' of github.com:hakuch/scylla:
  README.md: Add quick-start section and defer to `HACKING.md`
  HACKING.md: `CMakeLists.txt` for analysis works for other IDEs too
  HACKING.md: Add details and examples for unit tests
  HACKING.md: Add section for project dependencies
  HACKING.md: Describe releases and tags
  HACKING.md: Re-work "building" section, including memory needs
  HACKING.md: Update ccache recommendations
  HACKING.md: Update "Contributing" URL
2017-07-26 16:25:58 +03:00
Duarte Nunes
e988121dbb schema_builder: Replace type when re-dropping column
Fixes #2634

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725183933.5311-1-duarte@scylladb.com>
2017-07-26 13:26:29 +02:00
Duarte Nunes
64fcf0c642 alter_table_statement: Allow collection columns to replace normal ones
Fixes #2632

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725183811.5155-1-duarte@scylladb.com>
2017-07-26 13:24:03 +02:00
Duarte Nunes
1622847c1d perf/perf_fast_forward: Don't pass non-pod to varargs function
Passing a Non-POD object to variadic functions is unsupported.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726094756.22867-1-duarte@scylladb.com>
2017-07-26 11:48:22 +01:00
Duarte Nunes
9c831b4e97 schema: Remove unnecessary print
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725174000.71061-1-duarte@scylladb.com>
2017-07-26 12:01:51 +02:00
Duarte Nunes
472f32fb06 tests/schema_change_test: Add test case for add+drop notification
Reproduces #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-2-duarte@scylladb.com>
2017-07-26 11:59:48 +02:00
Duarte Nunes
33e18a1779 db/schema_tables: Consider differing dropped columns
If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.

Fixes #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
2017-07-26 11:59:34 +02:00
Jesse Haber-Kucharsky
d6c0138576 README.md: Add quick-start section and defer to HACKING.md 2017-07-25 17:58:00 -04:00
Jesse Haber-Kucharsky
d06bccf857 HACKING.md: CMakeLists.txt for analysis works for other IDEs too 2017-07-25 17:57:55 -04:00
Jesse Haber-Kucharsky
9c2390e1a4 HACKING.md: Add details and examples for unit tests 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
488839dd15 HACKING.md: Add section for project dependencies 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
14d03d7548 HACKING.md: Describe releases and tags 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
6e8bfdbb3f HACKING.md: Re-work "building" section, including memory needs 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
64acb41305 HACKING.md: Update ccache recommendations 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
4fe767de31 HACKING.md: Update "Contributing" URL
The old page results in a 404 error.
2017-07-25 17:46:45 -04:00
Paweł Dziepak
295689d16f db: include counter writes on leader in metrics
Counters write path on leader is completely different than on any other
replica (non-leaders share write path between counters and regular
columns). This patch makes sure that counter writes performed on leader
are added to appropriate metrics.
Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>
2017-07-25 18:31:43 +02:00
Tomasz Grabiec
18be42f71a Merge fixes related to row cache from Raphael
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
  db: atomically synchronize cache with changes to the snapshot
  db: refresh row cache's underlying data source after compaction
2017-07-25 15:34:32 +02:00
Paweł Dziepak
79a1ad7a37 tests/row_cache: test queries with no clustering ranges
Reproducer for #2604.
Message-Id: <20170725131220.17467-3-pdziepak@scylladb.com>
2017-07-25 15:29:17 +02:00
Paweł Dziepak
1ea507d6ae tests: do not overload the meaning of empty clustering range
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>
2017-07-25 15:28:12 +02:00
Paweł Dziepak
6572f38450 cache: fix aborts if no clustering range is specified
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).

Fixes #2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>
2017-07-25 15:27:48 +02:00
Amnon Heiman
1f5a9ecc40 scylla-housekeeping: support patches releases
To support both version and patch release, the version server now returns
a patchversion parameter that include the latest minor version's patch
release.

The housekeeping should return a separate message if the current
minor version is not with the latest patch release, and a message if the version was
changed.

For example, if a user is using version 1.6.1 it should get a warning
that he need to update if 1.6.2 is available and in addition a warning it
should upgrade if version 1.7 is out.

Examples:
$ scylla-housekeeping version --version 1.6.2
Your current Scylla release is 1.6.2, while the latest patch release is 1.6.4, and the latest minor release is 1.7.2 (recommended)

$ scylla-housekeeping version --version 1.7.1
You current Scylla release is  1.7.1 while the latest patch release is 1.7.2 is available, update for the latest bug fixes

$ scylla-housekeeping version --version 1.7.1
You current Scylla release is 1.7.1 while the latest patch release is 1.7.2, update for the latest bug fixes and improvements

Fixes #1972

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170725095455.6450-1-amnon@scylladb.com>
2017-07-25 13:12:18 +03:00
Raphael S. Carvalho
637f3bfa50 db: refresh row cache's underlying data source after compaction
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.

Fixes #2570.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:49:11 -03:00
Raphael S. Carvalho
e3ad676433 db: atomically synchronize cache with changes to the snapshot
updates to cache and snapshot (i.e. sstable set) aren't synchronized, so
it may happen that cache update for memtable flush will use wrong snapshot
version, and that violates cache invariant of each partition entry only
reflecting one snapshot.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:45:05 -03:00
Avi Kivity
c21bb5ae05 tests: fix sstable_datafile_test build with boost 1.55
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.

Message-Id: <20170724080128.8824-1-avi@scylladb.com>
2017-07-24 11:20:12 +03:00
Avi Kivity
f75b578607 Update ami submodule
* dist/ami/files/scylla-ami 5dfe42f...2bd1481 (1):
  > Enable support for experimental CPU controller in i3 instances
2017-07-24 10:26:52 +03:00
Tomasz Grabiec
60678f0e8a ring_position: Optimize contruction from r-value referenceces of decorated_key
Message-Id: <1500650171-26291-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:25:14 +03:00
Tomasz Grabiec
136d205855 mutation_partition: Always mark static row as continuous when no static columns
To avoid unnecessary cache misses after static columns are added.

Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:23:35 +03:00
Tomasz Grabiec
714d609605 database: Fix reversed order of keyspace and table names in a log message
Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 17:10:17 +02:00
Tomasz Grabiec
059779eea6 gdb: Fix 'scylla ptr' reporting large object pages as free
The 'free' attribute is not updated for all pages belonging to a large
object, so we can't use it to determine if the page is allocated or
not. More reliable way is to check if it belongs to any free span.
Message-Id: <1500648094-20039-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:41 +02:00
Tomasz Grabiec
29a82f5554 schema_registry: Keep unused entries around for 1 second
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.

If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.

Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:37 +02:00
Tomasz Grabiec
ecc85988dd legacy_schema_migrator: Don't snapshot empty legacy tables
Otherwise we will create a new (empty) snapshot each time we boot.
Message-Id: <1500573920-31478-2-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:31 +02:00
Tomasz Grabiec
408cea66cd database: Allow disabling auto snapshots during drop/truncate
Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:29 +02:00
Duarte Nunes
937fe80a1a Merge 'Fix possible inconsistency of table schema version' from Tomasz
"Fixes issues uncovered in longevity test (#2608).

Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.

The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.

This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."

* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
  schema_tables: Add scylla_tables to ALL
  schema: Make schema_mutations equality consistent with digest
  schema_tables: Extract compact_for_schema_digest()
  schema_tables: Always drop scylla_tables::version
2017-07-21 16:55:23 +02:00
Tomasz Grabiec
65c64614aa schema_registry: Ensure schema_ptr is always synced on the other core
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.

The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.

Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:54:47 +02:00
Duarte Nunes
7eecda3a61 schema: Support compaction enabled attribute
Fixes #2547

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170721132206.3037-1-duarte@scylladb.com>
2017-07-21 15:38:45 +02:00
Amnon Heiman
e345d05ebe system_keyspace: Use paging for get compaction history
there could be a lot of compactions when querying for compaction
history.

This patch changes the query to use paging. It would collect all results
when returning to the caller.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-07-20 18:17:49 +03:00
Vlad Zolotarov
9086c643a6 service::storage_proxy: add a trace points pair in the SELECT replica flow
Add two trace points: at the beginning and at the end of the replica flow on the
replica shard.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>
2017-07-20 16:44:25 +02:00
Amnon Heiman
08c81427b9 Add paging for internal queries
Usually, internal queries are used for short queries. Sometimes though,
like in the case of get compaction history, there could be a large
amount of results. Without paging it will overload the system.

This patch adds the ability to use paging internally.

Using paging will be done explicitely, all the relevant information
would be store in an internal_query_state, that would hold both the
paging state but also the query so consecutive calls can be made.

To use paging use the query method with a function.

The function gets beside a statement and its parameters a function that
will be used for each of the returned rows.

For example if qp is a query_processor:

qp.query("SELECT * from system.compaction_history", [] (const cql3::untyped_result_set::row& row) {
  ....
  // do something with row
  ...
  return stop_iteration::no; // keep on reading
});

Will run the function on each of the compaction history table rows.

To stop the iteration, the function can return stop_iteration::yes.
2017-07-20 17:43:51 +03:00
Tomasz Grabiec
2bc549f426 Merge perf_fast_forward enhancements from Paweł
* https://github.com/pdziepak/scylla.git perf_fast_forward_improvements/v1:
  perf_fast_forward: move global state to global scope
  perf_fast_forward: move tests groups to separate functions
  perf_fast_forward: allow running only selected test groups
  perf_fast_forward: use consumer interface for reading
    streamed_mutation
2017-07-20 16:41:29 +02:00
Tomasz Grabiec
ed2388da2c schema_tables: Add scylla_tables to ALL
So that scylla_tables takes part in the digest and in mutations sent
as part of schema sync. Otherwise inconsistencies in scylla_tables
will not heal.

Refs #2608.
2017-07-20 15:47:10 +02:00
Tomasz Grabiec
78ff728795 schema: Make schema_mutations equality consistent with digest
Digest only looks like live values, ignoring deletion
information. Equality should be consistent with that, so that schemas
considered equal do not trigger the alter path unnecessarily.
2017-07-20 15:47:10 +02:00
Tomasz Grabiec
6adbe61e2f schema_tables: Extract compact_for_schema_digest() 2017-07-20 15:47:10 +02:00
Tomasz Grabiec
1b85c316bf schema_tables: Always drop scylla_tables::version
It can happen that due to time drift between nodes, the incoming
"version" cell will have higher timestamp than api::new_timestamp().
In such case the column would not be dropped and would cause version
mismatch between nodes.

Ensure it's always covered by using max of current time and cell's
timestamp.

Refs #2608.
2017-07-20 15:47:10 +02:00
Takuya ASADA
2bf16c6e8a dist/debian: add --no-clean option to skip building pbuilder .tgz image
By default build_deb.sh destroys all previous build image to make sure we don't
have environment dependent issue, but it's takes time to build distribution root
image (.tgz in pbuilder) from scratch.

--no-clean option is for skipping create .tgz stage, use previously built image,
to make build time shorter.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500542094-12946-1-git-send-email-syuu@scylladb.com>
2017-07-20 15:37:00 +03:00
Calle Wilund
91f314e54c duration.cc: Fix static assert
static_assert(cond) is C++17 only
Message-Id: <1500373227-12025-1-git-send-email-calle@scylladb.com>
2017-07-20 13:14:51 +02:00
Paweł Dziepak
823fb5e9d8 perf_fast_forward: use consumer interface for reading streamed_mutation
Using streamed_mutation::operator() is undesirable as it introduces an
indirect call and a continuation overhead for each emitted mutation
fragment. Consumer interface is the preferred method of reading streamed
mutations.
2017-07-20 11:02:53 +01:00
Paweł Dziepak
d184508d7b perf_fast_forward: allow running only selected test groups 2017-07-20 11:02:31 +01:00
botond
884928c511 install-dependencies.sh: Fix ubuntu dependencies
Remove dependencies section from README.md, point to the
install-dependencies.sh script instead.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7f51f17a743a82d68b7d4a279b066ffe55fe0379.1500540523.git.bdenes@scylladb.com>
2017-07-20 12:00:20 +03:00
Paweł Dziepak
a18a36c94b perf_fast_forward: move tests groups to separate functions 2017-07-20 09:26:42 +01:00
Paweł Dziepak
3fd4f9c1c7 perf_fast_forward: move global state to global scope
All test perf_fast_forward test cases currently live in the main
function. This patch moves the state they rely on to a global scope
so that it will be easier to extract these tests to individual
functions.
2017-07-20 09:26:42 +01:00
Avi Kivity
c5ee62a6a4 Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.
2017-07-20 10:58:53 +03:00
Takuya ASADA
c441b1604a dist/redhat: use EPEL's ragel for CentOS
Since ragel added on EPEL, drop self-built package and use EPEL one.

See #2441

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170719170500.18515-1-syuu@scylladb.com>
2017-07-20 10:36:57 +03:00
Calle Wilund
7a583585a2 system_keyspace: Make sure "system" is written to keyspaces (visible)
Fixes #2514

Bug in schema version 3 update: We failed to write "system" to the
schema tables. Only visible on an empty instance of course.

Message-Id: <1500469809-23546-2-git-send-email-calle@scylladb.com>
2017-07-19 16:18:56 +03:00
Calle Wilund
247c36e048 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
2017-07-19 16:18:45 +03:00
Duarte Nunes
1daf1bc4bb Merge 'Revert back to 1.7 schema layout in memory' from Tomasz
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."

* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
  schema: Revert back to the 1.7 layout of static compact tables in memory
  schema: Use v3 column layout when converting to/from schema mutations
  schema: Encapsulate column layout translations in the v3_columns class
2017-07-19 12:52:52 +02:00
Avi Kivity
d5aba779d4 Merge "streaming error handling improvement" from Asias
"This series improves the streaming error handling so that when one side of the
streaming failed, it will propagate the error to the other side and the peer
will close the failed session accordingly. This removes the unnecessary wait and
timeout time for the peer to discover the failed session and fail eventually.

Fix it by:

- Use the complete message to notify peer node local session is failed
- Listen on shutdown gossip callback so that we can detect the peer is shutdown
  can close the session with the peer

Fixes #1743"

* tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev:
  streaming: Listen on shutdown gossip callback
  gms: Add is_shutdown helper for endpoint_state class
  streaming: Send complete message with failed flag when session is failed
  streaming: Handle failed flag in complete message
  streaming: Do not fail the session when failed to send complete message
  streaming: Introduce send_failed_complete_message
  streaming: Do not send complete message when session is successful
  streaming: Introduce the failed parameter for complete message
  streaming: Remove unused session_failed function
  streaming: Less verbose in logging
  streaming: Better stats
2017-07-19 11:18:09 +03:00
Amos Kong
2bdcad5bc3 scylla_raid_setup: fix syntax error
/usr/lib/scylla/scylla_raid_setup: line 132: syntax error
near unexpected token `fi'

Fixes #2610

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <af3a5bc77c5ba2b49a8f48a5aaa19afffb787886.1500430021.git.amos@scylladb.com>
2017-07-19 11:10:29 +03:00
Duarte Nunes
ab72132cb1 view_schema_test: Retry failed queries
Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
2017-07-19 09:59:44 +02:00
Duarte Nunes
115ff1095e db/view: Use view schema for view pk operations
Instead of base schema.

Fixes #2504

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718190703.12972-1-duarte@scylladb.com>
2017-07-19 09:59:34 +02:00
Tomasz Grabiec
a9237c1666 schema: Revert back to the 1.7 layout of static compact tables in memory
We are using C* 3.x compatible layout in schema tables but want to
keep using the 1.7 layout in memory for compatibility during rolling
upgrade. This patch switches the schema and schema_builder classes
back to the old layout. Translation of layout happens when converting
to/from schema mutations.

Notable changes:

 1) Includes a revert of commit 6260f31e08
    "thrift: Update CQL mapping of static CFs".

 2) Brings back the "default_validation_class" schema attribute. In v3
    it can be dervied from column definitions, but in v2 it can't, so
    we have to store it.

 3) legacy_schema_migrator and schema_builder don't have to do
    conversions to v3, this is now handled by the v3_columns
    class. schema_builder works with the same layout as schema, that
    is v2.

 4) Includes a revert of commit 66991a7ccb
    "v3 schema test fixes"

Fixes #2555.
2017-07-19 09:52:15 +02:00
Tomasz Grabiec
dc2dc056a4 schema: Use v3 column layout when converting to/from schema mutations 2017-07-19 09:52:15 +02:00
Tomasz Grabiec
dc463ef644 schema: Encapsulate column layout translations in the v3_columns class 2017-07-19 09:52:15 +02:00
Avi Kivity
bfae5c7bac Merge "Time window compaction strategy support" from Raphael
"Time window strategy was introduced to address several limitations of
date tiered strategy. In addition, its options are much easier to reason
about, basically just window size and window unit.
TWCS will work to keep only one sstable in each window. So the only real
optimization needed is to align partition key to the window.
Size tiered strategy is used to reduce write amplification when compacting
the incoming window.

For more details: https://issues.apache.org/jira/browse/CASSANDRA-9666

Fixes #1432."

* 'twcs_v2' of github.com:raphaelsc/scylla:
  tests: add tests for time window compaction strategy
  compaction: wire up time window compaction strategy
  compaction/twcs: override default values with options in schema
  sstables: implement time window compaction strategy
  sstables: import TimeWindowCompactionStrategy.java
2017-07-19 10:22:53 +03:00
Duarte Nunes
3bfcf47cc6 types: Implement hash() for collections
This patch provides a rather trivial implementation of hash() for
collection types.

It is needed for view building, where we hold mutations in a map
indexed by partition keys (and frozen collection types can be part of
the key).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718192107.13746-1-duarte@scylladb.com>
2017-07-19 09:52:56 +03:00
Raphael S. Carvalho
c55c63f213 tests: add tests for time window compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
7ecedac222 compaction: wire up time window compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
01886c23a8 compaction/twcs: override default values with options in schema
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
206d30c52a sstables: implement time window compaction strategy
For more details, https://issues.apache.org/jira/browse/CASSANDRA-9666

Fixes #1432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:35 -03:00
Glauber Costa
c9a529ebee simple controller for memtable/streaming writer shares.
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.

I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.

Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
   ok
2) when the load is very high - as foreground requests dominate we can't
   flush fast enough and hit the hard limit. However, in such scenarios
   the memtable shares do hit its maximum, and the results are no worse
   than they are right now and this will only be fixed by CPU-limiting the
   actual requests.

This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:47 -04:00
Glauber Costa
4f01ec0910 restrict background writers to 50 % of CPU.
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.

The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.

In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.

The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.

While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.

As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:33 -04:00
Asias He
d6cebd1341 streaming: Listen on shutdown gossip callback
When a node shutdown itself, it will send a shutdown status to peer
nodes. When peer nodes receives the shtudown status update, they are
supposed to close all the sessions with that node becasue the node is
shutdown, no need to wait and timeout, then fail the session.

This change can speed up the closing of sessions.
2017-07-19 10:11:06 +08:00
Asias He
ed7e6974d5 gms: Add is_shutdown helper for endpoint_state class
It will be used by streaming manager to check if a node is in shutdown
status.
2017-07-19 10:11:05 +08:00
Asias He
aa87429e67 streaming: Send complete message with failed flag when session is failed
To notify peer node the session is failed.
2017-07-19 10:11:05 +08:00
Asias He
03b838705c streaming: Handle failed flag in complete message
Fail the current session if the failed flag is on in the complete
message handler.
2017-07-19 10:11:05 +08:00
Asias He
12d18cfab4 streaming: Do not fail the session when failed to send complete message
Since the complete message is not mandatary, no point to fail the session
in case failed to send the complete message.
2017-07-19 10:11:04 +08:00
Asias He
ca5248cd58 streaming: Introduce send_failed_complete_message
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.

Also rename it to send_failed_complete_message.
2017-07-19 10:11:04 +08:00
Raphael S. Carvalho
2686e84792 sstables: import TimeWindowCompactionStrategy.java
it will be later converted to C++. Imported from latest scylla-
tools-java repository. Checked that it doesn't lack anything.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-18 18:26:17 -03:00
Takuya ASADA
49b01e764a dist/common/scripts/scylla_prepare: stop running hugeadm when it's posix mode
A user reported scylla-server.service does not able to run on their cloud instance, because of hugeadm.
(hugeadm says the kernel does not support huge pages.)
We don't need it for posix mode, so move it in dpdk mode.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500367219-8728-1-git-send-email-syuu@scylladb.com>
2017-07-18 16:39:16 +03:00
Tomasz Grabiec
63caa58b70 Merge "Drop mutations that raced with truncate" from Duarte
Instead of retrying, just drop mutations that raced with a truncate.

* git@github.com:duarten/scylla.git truncate-reorder/v1:
  database: Rename replay_position_reordered_exception
  database: Drop mutations that raced with truncate
2017-07-18 12:53:36 +02:00
Asias He
f21cb75cdb streaming: Do not send complete message when session is successful
The complete_message is not needed and the handler of this rpc message
does nothing but returns a ready future. The patch to remove it did not
make into the Scylla 1.0 release so it was left there.
2017-07-18 15:29:42 +08:00
Duarte Nunes
d9fa3bf322 thrift: Fail when mixed CFs are detected
Fixes #2588

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170717222612.7429-1-duarte@scylladb.com>
2017-07-18 10:21:33 +03:00
Asias He
0ba4e73068 streaming: Introduce the failed parameter for complete message
Use this flag to notify the peer that the session is failed so that the
peer can close the failed session more quickly.

The flag is used as a rpc::optional so it is compatible use old
version of the verb.
2017-07-18 11:24:31 +08:00
Asias He
7599c1524d streaming: Remove unused session_failed function
It is never used. Get rid of it.
2017-07-18 11:22:09 +08:00
Asias He
caad7ced23 streaming: Less verbose in logging
Now, we will have large number of small streaming. Make the
not very important logging message debug level.
2017-07-18 11:17:09 +08:00
Asias He
d0dffd7346 streaming: Better stats
Log the number of bytes streamed and streaming bandwidth summary in the same line with session
complete message.
2017-07-18 11:17:09 +08:00
Avi Kivity
64ef7aa5e4 Merge seastar upstream
* seastar 867b7c7...a14d667 (1):
  > tls: remove unneeded lambda captures
2017-07-17 19:30:59 +03:00
Duarte Nunes
6b464da67d schema: Get rid of regular_columns_by_name
They are unused.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170717103635.6473-2-duarte@scylladb.com>
2017-07-17 12:52:41 +02:00
Asias He
adc5f0bd21 gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.

Fixes #2603.

Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
2017-07-17 13:29:16 +03:00
Duarte Nunes
13caccf1cf Merge 'Fixes around migration to v3 schema tables' from Tomasz
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
  schema: Use proper name comparator
  legacy_schema_migrator: Properly migrate non-UTF8 named columns
  schema_tables: Store column_name in text form
  legacy_schema_migrator: Migrate columns like Cassandra
  schema_builder: Add factory method for default_names
  legacy_schema_migrator: Simplify logic
  thrift: Don't set regular_column_name_type
  schema: Use proper column name type for static columns
  schema: Fix column_name_type() for static compact tables
  schema: Introduce clustering_column_at()
  thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
  partition_slice_builder: Use proper column's type instead of regular_column_name_type()
2017-07-17 11:16:52 +02:00
Tomasz Grabiec
34dae0588c schema: Use proper name comparator
This replaces column_definition::name_comparator, which incorrectly
assumes that names are always utf8, with name_compare moved from
schema::rebuild() and unifies usages.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
7e54290d38 legacy_schema_migrator: Properly migrate non-UTF8 named columns
Currently migrator assumed all columns are utf8-named, which
doesn't have to be the case for static compact tables.

Refs #2597.

Due to #2573, we can assume that Scylla wasn't used with non-utf8
column names, and that old names are always in textual form.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
60a76efd37 schema_tables: Store column_name in text form
That's how it is stored by Cassandra.

Refs #2597.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
61229a7536 legacy_schema_migrator: Migrate columns like Cassandra
This fixes generation of synthetic columns for static compact tables.
Current code always generates synthetic clustering column with utf8
type and synthetic regular column with bytes type (in schema_builder).
That's fine when creating a new CQL table, but not when migrating
existing tables created via thrift API.

Fixes #2584.

This also migrates empty compact value columns like Cassandra
does. Such columns are present in compact tables without regular
columns, e.g.:

  create table test (k int, ck int, primary key (k, ck)) with compact storage;

They should be migrated to a synthetic regular column with
empty_type type and a non-empty name.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
49e21b3b8e schema_builder: Add factory method for default_names 2017-07-17 09:40:06 +02:00
Tomasz Grabiec
6dc299c27a legacy_schema_migrator: Simplify logic
The expression "is_dense.value_or(true)" is always true inside the if,
so drop it.

This allows us to drop temporary calulated_is_dense.

We can also get rid of one of the if branches by extracting
builder.set_is_dense() outside.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
3987e9be31 thrift: Don't set regular_column_name_type
Regular columns are always utf8 after f5dae826ce.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
b919c50d21 schema: Use proper column name type for static columns
After f5dae826ce, static columns not
always have utf8 column names. For static compact tables it's
determined by the cell name comparator type, which is equal to the
type of the synthetic clustering column.

Caused various errors with static thrift tables with non-utf8
comparator.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
f685f7f8a1 schema: Fix column_name_type() for static compact tables
Introduced in f5dae826ce.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
84536a4a75 schema: Introduce clustering_column_at() 2017-07-17 09:40:06 +02:00
Tomasz Grabiec
9ed958a1eb thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type 2017-07-17 09:40:06 +02:00
Tomasz Grabiec
9768036d61 partition_slice_builder: Use proper column's type instead of regular_column_name_type() 2017-07-17 09:40:06 +02:00
Avi Kivity
c51001b598 Merge seastar upstream
* seastar b812cee...867b7c7 (1):
  > rpc: start server's send loop only after protocol negotiation

Fixes #2600.
2017-07-16 19:36:31 +03:00
Avi Kivity
a5bd854019 Merg seastar upstream
* seastar 844bcfb...b812cee (1):
  > Update dpdk submodule

Fix #2595 (again).
2017-07-16 17:00:48 +03:00
Avi Kivity
d9c64ef737 tests: move tmpdir to /tmp
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>
2017-07-16 11:55:08 +02:00
Avi Kivity
9116dd91cb tests: copy the sstable with an unknown component to the data directory
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.

Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>
2017-07-16 11:55:00 +02:00
Duarte Nunes
2c711922cc database: Drop mutations that raced with truncate
Mutations that race with a truncate can just be dropped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Duarte Nunes
0825c9c805 database: Rename replay_position_reordered_exception
Rename replay_position_reordered_exception to
mutation_reordered_with_truncate_exception for more precision, since
this is the only situation where this exception can be thrown.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Avi Kivity
e87ab54bfc Merge seastar upstream
* seastar ff34c42...844bcfb (1):
  > Update dpdk submodule

Fixes #2595.
2017-07-15 19:17:05 +03:00
Tomasz Grabiec
caa62f7f05 Merge "Fixes for memtable flushing and replay positions" from Duarte
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.

Fixes #2074

branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
  commitlog: Always flush latest memtable
  column_family: More precise count of switched memtables
  column_family: Fix typo in pending_tasks metric name
  column_family: More precise count of pending flushes
  dirty_memory_manager: Remove unnecessary check from flush_one()
  column_family: Don't rely on flush_queue to guarantee flushes finished
  column_family: Don't bother closing the flush_queue on stop()
  column_family: Stop using flush_queue
  column_family: Remove outdated comment about the flush_queue
  memtable: Stop tracking the highest flushed rp
2017-07-14 11:39:37 +02:00
Avi Kivity
162d9aa85d tests: fix view_schema_test with clang
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.

Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>
2017-07-14 12:24:27 +03:00
Duarte Nunes
b8235f2e88 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
2017-07-14 12:11:22 +03:00
Duarte Nunes
5f24e9a4a5 memtable: Stop tracking the highest flushed rp
Since we no longer enforce that mutations are applied in memory
ordered by their replay_positions, the way the highest_flush_rp is
being tracked is no longer correct.

The invariant it was used to maintain no longer exists, so we can get
rid of it together with the assertion on the highest_flush_rp on
flush().

Fixes #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:06 +02:00
Duarte Nunes
22a53a52a1 column_family: Remove outdated comment about the flush_queue
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:05 +02:00
Duarte Nunes
003941cd95 column_family: Stop using flush_queue
Since commitlog ordering requirements have been relaxed, we now keep
the set of replay_positions seen by a memtable in a set, which we then
use to clean up relevant segments in the commitlog. This means that
the guarantees provided by the flush_queue are no longer necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:00 +02:00
Duarte Nunes
7e6fe5895e column_family: Don't bother closing the flush_queue on stop()
When stopping a column family we issue a flush(), for which we wait.
Since writes are supposed to have stopped coming in, and also new
flush requests, there's no need to call and wait for the flush_queue
to be closed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
a1f4536ffb column_family: Don't rely on flush_queue to guarantee flushes finished
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. So, use a phased_barrier() to
ensure that calling flush() returns a future that completes when all
flushes up to that point have finished.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
1b320496e2 dirty_memory_manager: Remove unnecessary check from flush_one()
We don't need to check whether a memtable is empty in flush_one(), as
that must be checked later, during the actual sealing.

The condition itself is rare and is checked already after the potentially
contented semaphore has been acquired.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:57 +02:00
Duarte Nunes
59bdaed02b column_family: More precise count of pending flushes
This patch ensures we update the count of pending flushes in the same
place as we update the stats across column families, which is more
correct since it only accounts for actual flushes and not those of
empty memtables or that have been coalesced together.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
3e27c335a9 column_family: Fix typo in pending_tasks metric name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
a11724c6e1 column_family: More precise count of switched memtables
The memtable_switch_count metric is supposed to count the number of
times a flush has resulted in the memtable being switched out, but we
were incrementing the count regardless of whether we tried to flush an
empty memtable or two or more flushes were coalesced into one. This
patch fixes this by moving the metric to where the memtable is
actually switched.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
bca1b19ce9 commitlog: Always flush latest memtable
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. When flushing commit log segments,
ensure we flush the latest memtable.

Refs #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Paweł Dziepak
ec689b2fe1 Merge "utils: minor fixes in the loading_cache class" from Vlad
"This series aims to fix the "serving invalid (old) values" issue in the
loading_cache (issue #2590) by arming the timer with a period that equals
min(expire, refresh).

We are still trying to optimize the main case where 'expire' is
significantly longer than 'refresh' period.

We don't want to add any additional logic in the fast path and this
series gives the immediate solution for the issue above while not adding
any additional CPU cycle to the fast path."

* 'loading_cache_short_expired-v2' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
  utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source
2017-07-13 16:58:53 +01:00
Vlad Zolotarov
45e23d8090 db::config: fix the permissions cache related parameters description
Make the descriptions of permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries more readable and more related to what they really
do.

Mention the none-zero value requirement for the permissions_update_interval_in_ms and
the permissions_cache_max_entries when the permissions cache is enabled.

Adjust the parameters description in the scylla.yaml too.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499957053-31792-1-git-send-email-vladz@scylladb.com>
2017-07-13 16:00:40 +01:00
Vlad Zolotarov
76ea74f3fd utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.

Fixes #2590

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-07-13 10:48:59 -04:00
Vlad Zolotarov
121e3c7b8f utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-07-13 10:42:12 -04:00
Tomasz Grabiec
30ec4af949 legacy_schema_migrator: Fix calculation of is_dense
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.

It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.

If there is no clustering key, then if there are any regular columns
we're sure it's not dense.

Fixes #2587.

Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
2017-07-13 17:28:09 +03:00
Jesse Haber-Kucharsky
8fa47b74e8 cql: Add definition of underlying type for durations
Cassandra 3.10 added the `duration` type [1], intended to manipulate date-time
values with offsets (for example, `now() - 2y3h`).

The full implementation of the `duration` type in Scylla requires support
for version 5 of the binary protocol, which is not yet available.

In the meantime, this patch patch adds the implementation of the underlying type
for the eventual `duration` type. Included is also the ported test suite from
the reference implementation and additional tests.

Related to #2240.

[1] https://issues.apache.org/jira/browse/CASSANDRA-11873

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <b1e481da103efee82106bf31f261c5a1f4f8d9ca.1499885803.git.jhaberku@scylladb.com>
2017-07-13 17:26:00 +03:00
Tomasz Grabiec
54953c8d27 gdb: Fix "scylla columnfamilies" command
Broken in 0e4d5bc2f3.

Message-Id: <1499951956-26206-1-git-send-email-tgrabiec@scylladb.com>
2017-07-13 16:33:32 +03:00
Amnon Heiman
45b3e8cd11 query_options: Allows creating query_options from query_options
query_options object cannot be changed after it was created. For
internal uses, like internal query paging, it is needed to create a new
object based on some of the data from an existing one with a new paging
state.

This patch adds a constructor from a unique_ptr and paging state.

using unique_ptr behave similar to move modify constructor.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-07-13 14:02:11 +03:00
Duarte Nunes
3df6777b9b database: Load views after loading tables
Since base tables no longer look for their views, we need to parse
base tables first so that when we add a view we can fetch and connect
it to its base table.

When announcing view table mutations to other nodes we always include
the base table mutations, so there's no need to expect a view being
added before its base table.

Found out while testing view building.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170712172115.2960-1-duarte@scylladb.com>
2017-07-13 11:14:02 +02:00
Avi Kivity
4704a78332 tests: remove bad constexpr in sstable_datafile_test
std::ceil() is not constexpr.

Found by clang.
2017-07-12 17:14:13 +03:00
Avi Kivity
67a5e10218 Merge seastar upstream
* seastar a2be7a4...ff34c42 (3):
  > tls: Wrap all IO in semaphore (Fixes #2575)
  > tests/lowres_clock_test.cc: Declare helper static
  > tests/lowres_clock_test.cc: fix compilation error for older GCC
2017-07-12 10:19:55 +03:00
Avi Kivity
a397889c81 Merge "Preserve table schema digest on schema tables migration" from Tomasz
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.

To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.

This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.

Fixes #2549."

* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
  tests: Add test for concurrent column addition
  legacy_schema_migrator: Set digest to one compatible with the old nodes
  schema_tables: Persist table_schema_version
  schema_tables: Introduce system_schema.scylla_tables
  schema_tables: Simplify read_table_mutations()
  schema_tables: Resurrect v2 read_table_mutations()
  system_keyspace: Forward-declare legacy schemas
  legacy_schema_migrator: Take storage_proxy as dependency
2017-07-11 17:22:42 +03:00
Raphael S. Carvalho
7dbfebb7dc lcs: remove conditional limit for partial sort
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170711140241.11023-2-raphaelsc@scylladb.com>
2017-07-11 17:18:32 +03:00
Raphael S. Carvalho
ebb5dafef0 lcs: remove useless filter for demotion procedure
there's no way a sstable from a level higher than N+1 will be in
set of candidates that can be either level N or level N + 1.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170711140241.11023-1-raphaelsc@scylladb.com>
2017-07-11 17:18:31 +03:00
Botond Dénes
33bc62a9cf Fix crash in the out-of order restrictions error msg composition
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.

Fixes #2421

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
2017-07-11 17:15:33 +03:00
Gleb Natapov
f88723e739 storage_proxy: pass pending_endpoints by reference instead of by value
This makes lifetime of dead_endpoints object more clear and move() also has its price.

Message-Id: <20170710084549.GX2324@scylladb.com>
2017-07-11 16:52:21 +03:00
Gleb Natapov
739dd878e3 consistency_level: report less live endpoints in Unavailable exception if there are pending nodes
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.

Fixes #2535

Message-Id: <20170710085238.GY2324@scylladb.com>
2017-07-11 16:51:56 +03:00
Avi Kivity
3b7fde18cf Merge "improvements for leveled strategy manifest" from Raphael
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"

* 'lcs_improvements_2' of github.com:raphaelsc/scylla:
  lcs: only demote sstable from level higher than target one
  lcs: improve indentation for get_overlapping_starved_sstables
  lcs: improve indentation for get_compaction_candidates
  lcs: partially sort candidates that will be trimmed
  lcs: remove quadratic behavior from L0 compaction
  lcs: introduce private interface
  lcs: make some member functions static
  lcs: make some functions const qualified
  lcs: remove add method
  lcs: extract code for higher levels compaction from get_candidates_for
  lcs: simplify code to get candidates for higher levels
  lcs: extract round-robin heuristic for even distribution of keys into function
  lcs: update outdated comments for level 0 compaction
  lcs: improve worth_promoting_L0_candidates interface
  lcs: do not check if level 0 can be promoted twice
  lcs: extract code for level 0 compaction from get_candidates_for
2017-07-11 16:38:50 +03:00
Paweł Dziepak
5aa523aaf9 transport: send correct type id for counter columns
CQL reply may contain metadata that describes columns present in the
response including the information about their type.

However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.

Fixes #2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>
2017-07-11 16:21:49 +03:00
Tomasz Grabiec
6d53cb7ab5 tests: Add test for concurrent column addition 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
f5909ec515 legacy_schema_migrator: Set digest to one compatible with the old nodes
Calculate and set digest using v2 mutations so that digests are the
same before and after migration. This is neeed so that no schema
definition exchange is required during rolling upgrade.

Fixes #2549.
2017-07-11 14:52:23 +02:00
Tomasz Grabiec
5b69d99bf8 schema_tables: Persist table_schema_version
When migrating schema tables from v2 to v3, mutations underlying
table schema will change, and so will their digest. However, we want
the digest to be the same on new nodes as on the old nodes, because
schema exchange is not possible between the two nodes, so they
must to request schema definitions from each other.

The solution is to make the digest persistable, so that it sticks to
given table schema, surviving both migration and node restarts. On
migration from v2, the digest will be calculated from v2 mutations, so
it will be the same on new and old nodes.
2017-07-11 14:52:23 +02:00
Tomasz Grabiec
cdf5b67522 schema_tables: Introduce system_schema.scylla_tables
It will be used to store Scylla spcific table metadata.  We cannot
store it in the standard "tables" table for compatibility reasons -
Cassandra will fail to read schema if it encounteres columns it is not
expecting.
2017-07-11 14:52:23 +02:00
Tomasz Grabiec
cdcdf4772f schema_tables: Simplify read_table_mutations() 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
6e62bc77f1 schema_tables: Resurrect v2 read_table_mutations() 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
4b5818a404 system_keyspace: Forward-declare legacy schemas 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
8624edc0fa legacy_schema_migrator: Take storage_proxy as dependency
Will be needed to query for mutations.
2017-07-11 14:52:23 +02:00
Raphael S. Carvalho
6aa2e5be17 lcs: only demote sstable from level higher than target one
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:42 -03:00
Raphael S. Carvalho
53b72b473e lcs: improve indentation for get_overlapping_starved_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:40 -03:00
Raphael S. Carvalho
3639b48d7b lcs: improve indentation for get_compaction_candidates
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:38 -03:00
Raphael S. Carvalho
5a8b8a6ccb lcs: partially sort candidates that will be trimmed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:37 -03:00
Raphael S. Carvalho
8334086441 lcs: remove quadratic behavior from L0 compaction
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.

Fixes #2432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:35 -03:00
Raphael S. Carvalho
80f1dca328 lcs: introduce private interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:33 -03:00
Raphael S. Carvalho
bc71f97116 lcs: make some member functions static
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:32 -03:00
Raphael S. Carvalho
f4b733efe4 lcs: make some functions const qualified
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:28 -03:00
Raphael S. Carvalho
ede0ee16b2 lcs: remove add method
Its code can be inlined because no one besides create() calls it

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:26 -03:00
Raphael S. Carvalho
00ef528e5b lcs: extract code for higher levels compaction from get_candidates_for
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:25 -03:00
Raphael S. Carvalho
a46b73c401 lcs: simplify code to get candidates for higher levels
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:19 -03:00
Raphael S. Carvalho
e954af0f0f lcs: extract round-robin heuristic for even distribution of keys into function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:15 -03:00
Raphael S. Carvalho
3c0028d921 lcs: update outdated comments for level 0 compaction
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:07 -03:00
Raphael S. Carvalho
62607ba36a lcs: improve worth_promoting_L0_candidates interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:00 -03:00
Raphael S. Carvalho
c1e42f6528 lcs: do not check if level 0 can be promoted twice
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:34:49 -03:00
Raphael S. Carvalho
887aab4ae7 lcs: extract code for level 0 compaction from get_candidates_for
I will split code for higher levels compaction into functions first
before putting it into its own function too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:34:41 -03:00
Pekka Enberg
ed3c62704e transport/server: Kill unused functions
Message-Id: <1499773755-27920-1-git-send-email-penberg@scylladb.com>
2017-07-11 14:57:54 +03:00
Glauber Costa
780a6e4d2e change task quota's default
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.

In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
2017-07-11 13:50:39 +03:00
Avi Kivity
7147808797 Merge seastar upstream
* seastar 89cc97c...a2be7a4 (3):
  > configure.py: verifies boost version
  > pkg-config: Eliminate spaces in include path arguments
  > allow applications to override task-quota-ms
2017-07-11 13:50:06 +03:00
Botond Dénes
f18f724f1c Generate an error when CONTAINS is used on a non-collection column
Fixes #2255

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <517bb6268ac213aed9a1def231614c2e88f77c9f.1499764183.git.bdenes@scylladb.com>
2017-07-11 11:30:49 +02:00
Tomasz Grabiec
310d2a54d2 legacy_schema_migrator: Use separate joinpoint instance for each table
Otherwise we may deadlock, as explained in commit 5e8f0efc8:

Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.

However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>
2017-07-11 11:21:45 +02:00
Avi Kivity
7b4412c3ce Revert "Merge "improvements for leveled strategy manifest" from Raphael"
This reverts commit 43a3e718e6, reversing
changes made to 3813e94b0a. It contains some
unrelated commits.
2017-07-11 11:12:53 +03:00
Avi Kivity
43a3e718e6 Merge "improvements for leveled strategy manifest" from Raphael
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"

* 'lcs_improvements' of github.com:raphaelsc/scylla: (21 commits)
  lcs: only demote sstable from level higher than target one
  lcs: improve indentation for get_overlapping_starved_sstables
  lcs: improve indentation for get_compaction_candidates
  lcs: partially sort candidates that will be trimmed
  lcs: remove quadratic behavior from L0 compaction
  lcs: introduce private interface
  lcs: make some member functions static
  lcs: make some functions const qualified
  lcs: remove add method
  lcs: extract code for higher levels compaction from get_candidates_for
  lcs: simplify code to get candidates for higher levels
  lcs: extract round-robin heuristic for even distribution of keys into function
  lcs: update outdated comments for level 0 compaction
  lcs: improve worth_promoting_L0_candidates interface
  lcs: do not check if level 0 can be promoted twice
  lcs: extract code for level 0 compaction from get_candidates_for
  dist/offline_installer: add --skip-setup option to offline installer
  dist/offline_installer/debian: install python-minimal package before installing scylla deps
  migration_manager: Give empty response to schema pulls from incompatible nodes
  migration_manager: Don't pull schema from incompatible nodes
  ...
2017-07-11 11:08:12 +03:00
Botond Dénes
3813e94b0a Add Cql.tokens and KDevelop project files to .gitignore
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ae4935d2ac0c92287022f677c3e66757c0861e13.1499753032.git.bdenes@scylladb.com>
2017-07-11 10:21:00 +03:00
Botond Dénes
61c5c2a175 transport: Fix accept typo in debug log message
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d2f9269f25ace6579a6fbe6b99f4da60a05beac8.1499753306.git.bdenes@scylladb.com>
2017-07-11 09:16:35 +03:00
Raphael S. Carvalho
8b9686e621 lcs: only demote sstable from level higher than target one
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 16:10:42 -03:00
Raphael S. Carvalho
0d0699e06e lcs: improve indentation for get_overlapping_starved_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 16:01:31 -03:00
Raphael S. Carvalho
cda2b18f83 lcs: improve indentation for get_compaction_candidates
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:55:43 -03:00
Raphael S. Carvalho
ca1c6fd9ca lcs: partially sort candidates that will be trimmed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:45:26 -03:00
Raphael S. Carvalho
28ebe1807f lcs: remove quadratic behavior from L0 compaction
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.

Fixes #2432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:42:28 -03:00
Raphael S. Carvalho
0392dc5d23 lcs: introduce private interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:13:56 -03:00
Raphael S. Carvalho
dd9c9341be lcs: make some member functions static
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:13:55 -03:00
Raphael S. Carvalho
408a7f902a lcs: make some functions const qualified
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
7cba6548e2 lcs: remove add method
Its code can be inlined because no one besides create() calls it

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
0a9fcc6202 lcs: extract code for higher levels compaction from get_candidates_for
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
8709365d84 lcs: simplify code to get candidates for higher levels
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
1a9fc835a0 lcs: extract round-robin heuristic for even distribution of keys into function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:40 -03:00
Raphael S. Carvalho
258ed0afbd lcs: update outdated comments for level 0 compaction
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Raphael S. Carvalho
97b5cf94d8 lcs: improve worth_promoting_L0_candidates interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Raphael S. Carvalho
8f418e9864 lcs: do not check if level 0 can be promoted twice
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Raphael S. Carvalho
6785d83c02 lcs: extract code for level 0 compaction from get_candidates_for
I will split code for higher levels compaction into functions first
before putting it into its own function too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Takuya ASADA
abd2b6bd6f dist/offline_installer: add --skip-setup option to offline installer
To use offline installer in non-interactive way, add option to skip scylla_setup (which run in interactive mode).

Fixes #2533

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387981-11814-1-git-send-email-syuu@scylladb.com>
2017-07-10 15:10:30 -03:00
Takuya ASADA
e1a2be28d2 dist/offline_installer/debian: install python-minimal package before installing scylla deps
To prevent dependency error, we need to install python-minimal manually.

Fixes #2553

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387587-9032-1-git-send-email-syuu@scylladb.com>
2017-07-10 15:10:30 -03:00
Tomasz Grabiec
fa17c2a59b migration_manager: Give empty response to schema pulls from incompatible nodes
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
2017-07-10 15:10:30 -03:00
Tomasz Grabiec
8e8a26ef1b migration_manager: Don't pull schema from incompatible nodes
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
2017-07-10 15:10:30 -03:00
Tomasz Grabiec
b2f52454b9 service: Advertise schema tables format version through gossip
Will be needed to inhibit schema exchange on per-peer basis.
2017-07-10 15:10:30 -03:00
Takuya ASADA
bf49dd8aa1 dist/offline_installer: add --skip-setup option to offline installer
To use offline installer in non-interactive way, add option to skip scylla_setup (which run in interactive mode).

Fixes #2533

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387981-11814-1-git-send-email-syuu@scylladb.com>
2017-07-10 16:02:11 +03:00
Takuya ASADA
1f97e5b3f4 dist/offline_installer/debian: install python-minimal package before installing scylla deps
To prevent dependency error, we need to install python-minimal manually.

Fixes #2553

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387587-9032-1-git-send-email-syuu@scylladb.com>
2017-07-10 16:01:26 +03:00
Avi Kivity
91221e020b Merge "Silence schema pull errors during upgrade from 1.7 to 2.0" from Tomasz
"Old and new nodes will advertise different schema version because
of different format of schema tables. This will result in attempts
to sync the schema by each of the node.

Currently this will result in scary error messages in logs about
sync failing due to not being able to find schema of given version.
It's benign, but may scare users. It the future incompatibilities
could result in more subtle errors. Better to inhibit it completely."

* 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev:
  migration_manager: Give empty response to schema pulls from incompatible nodes
  migration_manager: Don't pull schema from incompatible nodes
  service: Advertise schema tables format version through gossip
2017-07-10 14:04:04 +03:00
Pekka Enberg
8112d7c5c0 idl: Fix frozen_schema version numbers
The IDL changes will appear in 2.0 so fix up the version numbers.

Message-Id: <1499680669-6757-1-git-send-email-penberg@scylladb.com>
2017-07-10 14:02:20 +03:00
Avi Kivity
06b7ec6901 install-dependencies.sh: add snappy 2017-07-10 13:25:57 +03:00
Avi Kivity
7ddd322bce Add install-dependencies.sh
Easier to get started when a script installs all the build dependencies.
Message-Id: <20170710101657.12574-1-avi@scylladb.com>
2017-07-10 12:21:02 +02:00
Botond Dénes
e0d0f9f30c Make the CMakeLists.txt's IDE marker generic
To allow some other IDEs (e.g. KDevelop, QtCreator) to use the cmake
file in a convenient manner. Keep the existing CLIEN_IDE marker to
not break existing workflows.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <5ecf8c0e8a242cc8ebb0d803547bead4dadc38e2.1499667807.git.bdenes@scylladb.com>
2017-07-10 12:21:02 +02:00
Botond Dénes
66cbc45321 Add text(sstring) version of count, max and min functions
Fixes #2459

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6abb97f21c0caea8e36c7590b92a12d148195db.1499666251.git.bdenes@scylladb.com>
2017-07-10 09:06:15 +03:00
Tomasz Grabiec
72e01b7fe8 tests: commitlog: Check there are no segments left on disk after clean shutdown
Reproduces #2550.

Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com>
2017-07-09 19:25:27 +03:00
Tomasz Grabiec
6555a2f50b commitlog: Discard active but unused segments on shutdown
So that they are not left on disk even though we did a clean shutdown.

First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.

Fixes #2550.

Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
2017-07-09 19:25:22 +03:00
Tomasz Grabiec
d33d29ad95 legacy_schema_migrator: Drop tables instead of truncate()+remove()
It achieves similar effect, but is safer than non-standard remove()
path. The latter was missing unregistration from compaction manager.

Fixes 2554.

Message-Id: <1499447165-30253-1-git-send-email-tgrabiec@scylladb.com>
2017-07-09 18:36:44 +03:00
Duarte Nunes
136accdbf6 database: Fix typos in metric descriptions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170709145522.19534-1-duarte@scylladb.com>
2017-07-09 18:35:17 +03:00
Raphael S. Carvalho
7f7758fb6f tests/sstable: make sstable_expired_data_ratio more robust
this change will stress histogram ability to return a good estimation
after merging keys such that it doesn't grow beyond size limit.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170708205713.5958-1-raphaelsc@scylladb.com>
2017-07-09 10:33:10 +03:00
Botond Dénes
4f6b2a1ff0 transport: Move "accept failed" message to the debug log
Fixes #2518

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <492ea8a916bb3b2427f6cc16a4f6eadadaa30b10.1499418234.git.bdenes@scylladb.com>
2017-07-08 10:59:03 +03:00
Takuya ASADA
09aeb2aabe dist/debian/pbuilderrc: merge Debian releases
Merge duplicated lines to simplified pbuilderrc.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499454242-3716-1-git-send-email-syuu@scylladb.com>
2017-07-08 10:56:54 +03:00
Tomasz Grabiec
07ed512060 migration_manager: Give empty response to schema pulls from incompatible nodes
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
2017-07-07 19:09:57 +02:00
Tomasz Grabiec
5f613d0527 migration_manager: Don't pull schema from incompatible nodes
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
2017-07-07 19:08:59 +02:00
Tomasz Grabiec
18a9e1762c service: Advertise schema tables format version through gossip
Will be needed to inhibit schema exchange on per-peer basis.
2017-07-07 19:07:59 +02:00
Piotr Jastrzebski
a4b6cfe8f0 row_cache: use continuity info in single partition queries
If a query requests for a single partition that is inside
a range that has already been queried, use the continuity info
and don't go to disk when it's not needed.

Fixes #2244.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Piotr Jastrzebski
b950c59bbb row_cache: Fix wrong comment on continuity flag
This comment was stating exactly the opposite to the
truth. This is very misleading

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <79062a061e22ef4c4add24cbdf723cbfb5cda060.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Piotr Jastrzebski
70f4b23876 row_cache_test: Add test to reproduce issue 2544
This tests checks that cache should use continuity information
for single partition queries inside a range that has already been
queried.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2ebd03ff5366e554d520f86da8054e0b9eff4178.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Avi Kivity
ecda97edeb Merge seastar upstream
* seastar c848486...89cc97c (2):
  > future-utils: fix do_for_each exception reporting
  > core/thread: Fix unwind information for seastar threads
2017-07-06 17:28:29 +03:00
Jesse Haber-Kucharsky
4f838a82e2 Add guide for getting started with development ("hacking")
This change adds the start of what will hopefully be a continually evolving and
improving document for helping developers and contributors to get started with
Scylla development.

The first part of the document is general advice and information that is broadly
applicable.

The second part is an opinionated example of a particular work-flow and set of
tools. This is intended to serve as a starting point and inspire contributors to
develop their own work-flow.

The section on branching is marked "TODO" for now, and will be addressed by a
subsequent change.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <470a542a92aff20d6205fb94b3fb26168735ae6f.1499319310.git.jhaberku@scylladb.com>
2017-07-06 15:59:16 +03:00
Duarte Nunes
3dd0397700 wrapping_range: Fix lvalue transform()
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
2017-07-06 15:47:49 +03:00
Raphael S. Carvalho
ff50b57761 dist: fix spelling mistakes in dev-mode.conf
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170705202054.4614-1-raphaelsc@scylladb.com>
2017-07-06 15:08:17 +03:00
Botond Dénes
b1082641f9 Make sure keyspace strategy class is stored in qualified form
Even when it's provided in unqualified (short) form.
Fixes #767

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4379f8864843e64c097d432fd06129ce4025f100.1499322476.git.bdenes@scylladb.com>
2017-07-06 14:50:00 +03:00
Botond Dénes
c4277d6774 cql3: Add K_FROZEN and K_TUPLE to basic_unreserved_keyword
To allow the non-reserved keywords "frozen" and "tuple" to be used as
column names without double-quotes.

Fixes #2507

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9ae17390662aca90c14ae695c9b4a39531c6cde6.1499329781.git.bdenes@scylladb.com>
2017-07-06 12:25:38 +03:00
Avi Kivity
a6d9cf09a7 build: fix excessive stack usage in CqlParser in debug mode
The state machines generated by antlr allocate many local variables per function.
In release mode, the stack space occupied by the variables is reused, but in debug
build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes
a single function to take above 100k of stack space.

Fix by hacking the generated code to use just one variable.

Fixes #2546
Message-Id: <20170704135824.13225-1-avi@scylladb.com>
2017-07-05 23:05:26 +02:00
Duarte Nunes
d583ef6860 thrift/handler: Remove leftover debug artifacts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705161156.2307-1-duarte@scylladb.com>
2017-07-05 19:57:07 +03:00
Takuya ASADA
6d0bd01e0f dist/offline_installer/redhat: enable EPEL repo before try to install makeself
To prevent yum install error, we need to enable EPEL repo before install
makeself.

Fixes #2508

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499196715-19710-1-git-send-email-syuu@scylladb.com>
2017-07-05 09:50:03 +03:00
Takuya ASADA
71624d7919 dist/common/scripts/scylla_raid_setup: prevent renaming MDRAID device after reboot
On Debian variants, mdadm.conf should placed at /etc/mdadm instead of /etc.
Also it seems we need update-initramfs to fix renaming issue.

Fixes #2502

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499179912-14125-1-git-send-email-syuu@scylladb.com>
2017-07-04 18:07:20 +03:00
Avi Kivity
b1a0e37fcb Merge "Adjust row cache metrics for row granularity" from Tomasz
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
  row_cache: Switch _stats.hits/misses to row granularity
  row_cache: Rename num_entries() to partitions() for clarity
  row_cache: Track mispopulations also at row level
  row_cache: Track row insertions
  row_cache: Track row hits and misses
  row_cache: Make mispopulation counter also apply for continuity information
  row_cache: Add partition_ prefix to current counters
  misc_services: Switch to using reads_with[_no]_misses counters
  row_cache: Add metrics for operations on underlying reader
  row_cache: Add reader-related metrics
  row_cache: Remove dead code
2017-07-04 15:20:25 +03:00
Tomasz Grabiec
37d2b6b3c6 row_cache: Switch _stats.hits/misses to row granularity
Those are exported by the RESTful APIs called
"get_row_hits/get_row_misses" and reported by nodetool.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
62c76abf71 row_cache: Rename num_entries() to partitions() for clarity 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
60c2a86192 row_cache: Track mispopulations also at row level 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
94547db620 row_cache: Track row insertions 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a58f2c8640 row_cache: Track row hits and misses 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
77b2a92ece row_cache: Make mispopulation counter also apply for continuity information 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a5fdff2ac2 row_cache: Add partition_ prefix to current counters
In preparation for adding per-row counters.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
ae4b24db06 misc_services: Switch to using reads_with[_no]_misses counters
They better approximate the intended meaning than hits/misses, which
according to Gleb is whether a read did any I/O or not.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
6a22cbceaf row_cache: Add metrics for operations on underlying reader 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
5c7b6fc164 row_cache: Add reader-related metrics 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
be2e89d596 row_cache: Remove dead code 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
e720b317c9 row_cache: Restore update of concurrent_misses_same_key
It was lost in action in 6f6575f456.

Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 14:51:05 +03:00
Avi Kivity
66e56511d6 Merge "Use selective_token_range_sharder in repair" from Asias
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."

* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
  repair: Use selective_token_range_sharder
  tests: Add test_selective_token_range_sharder
  dht: Add selective_token_range_sharder
2017-07-04 14:14:33 +03:00
Asias He
b10e961a64 repair: Use selective_token_range_sharder
With this change, we ask all the shard to handle the ranges provided by
user and we use selective_token_range_sharder to split the ranges and
ignore the ranges do not belong to the current shard.
2017-07-04 18:46:19 +08:00
Asias He
2a794db61b tests: Add test_selective_token_range_sharder 2017-07-04 18:46:19 +08:00
Asias He
d835cf2748 dht: Add selective_token_range_sharder
It is like ring_position_range_sharder but it works with
dht::token_range. This sharder will return the ranges belong to a
selected shard.
2017-07-04 18:46:19 +08:00
Tomasz Grabiec
1d6fec0755 row_cache: Drop not very useful prefixes from metric names
This drops "total_opertaions_" and "objects_" prefixes. There is no
convention of adding them in other parts of the system, and they don't
add much value.

Fixes scylladb/scylla-grafana-monitoring#169.

Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 13:37:12 +03:00
Nadav Har'El
d95f908586 Fix test to use non-wrapping range
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
2017-07-04 13:36:29 +03:00
Avi Kivity
07b8adce0e sstables: fix use-after-free in read_simple()
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.

Fix by copying `r` instead of moving it.

Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>
2017-07-04 10:24:07 +02:00
Raphael S. Carvalho
7b777fe2e3 sstables/lcs: choose sstable with highest droppable tombstone ratio
Currently, lcs will choose, for tombstone compaction, sstable with
the lowest ratio from the ones which ratio is at least above threshold
(0.2 by default).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170703185633.6644-1-raphaelsc@scylladb.com>
2017-07-04 10:25:10 +03:00
Avi Kivity
bcf7867ac9 Merge "small fixes and cleanup for leveled strategy (part 2)" from from Raphael
* 'lcs_improvements_part_2' of github.com:raphaelsc/scylla:
  lcs: Match estimated tasks arithmetic to score in LCS
  lcs: prevent leveled_compaction_strategy.hh from being included more than once
  lcs: use vector instead for storing a level of sstables
  compaction: keep only one variant of size_tiered_most_interesting_bucket
  lcs: get rid of unused code in leveled_manifest
2017-07-04 10:10:53 +03:00
Raphael S. Carvalho
7606ffd744 lcs: Match estimated tasks arithmetic to score in LCS
Contains fix for CASSANDRA-8904.

Added TARGET_SCORE to get rid of magic number for target score which
is now used more than once.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:35:02 -03:00
Raphael S. Carvalho
dfb5463478 lcs: prevent leveled_compaction_strategy.hh from being included more than once
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:35:00 -03:00
Raphael S. Carvalho
db98ab6aaf lcs: use vector instead for storing a level of sstables
list is no longer needed because lcs no longer moves a sstable breaking
invariant at its level to level 0. Now lcs incrementally restores invariant
by compacting together first set of overlapping tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:34:57 -03:00
Raphael S. Carvalho
b350352e6c compaction: keep only one variant of size_tiered_most_interesting_bucket
two variants of size_tiered_most_interesting_bucket existed to avoid copy,
but subsequent work will make lcs use vector for each level of sstables,
so let's only keep one variant.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:34:51 -03:00
Raphael S. Carvalho
5921600b95 lcs: get rid of unused code in leveled_manifest
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:34:34 -03:00
Nadav Har'El
d177ec05cb repair: further limit parallelism of checksum calculation
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.

But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.

So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.

The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.

Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
2017-07-03 18:14:57 +03:00
Piotr Jastrzebski
80f08921c4 Make table_helper independent from trace_keyspace_helper
table_helper is a generic helper than can easily be used in other places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <11e46dbc1c90d0273a41c8144e6f6013e21efcdb.1499077818.git.piotr@scylladb.com>
2017-07-03 15:55:00 +03:00
Raphael S. Carvalho
972a0237ef database: restore indentation for cleanup_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-2-raphaelsc@scylladb.com>
2017-07-03 12:48:54 +03:00
Raphael S. Carvalho
b9d0645199 database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
2017-07-03 12:48:53 +03:00
Avi Kivity
5883e85da3 Merge "improve maintainability of compaction strategies" from Raphael
"compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to improve maintainability of the strategies by
putting each of them into its own header.

NOTE: No semantic change is introduced here."

* 'improve_compaction_strategy_maintainability' of github.com:raphaelsc/scylla:
  compaction_strategy: move dtcs to its existing header
  compaction_strategy: move lcs implementation to its own header
  compaction_strategy: move stcs implementation to its own header
  compaction_strategy: move compaction_strategy_impl to its own header
2017-07-03 11:39:30 +03:00
Takuya ASADA
0c81974bc4 dist/common/systemd: move scylla-server.service to be after network-online.target instead of network.target
To make sure start Scylla after network is up, we need to move from
network.target to network-online.target.

Fixes #2337

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493661832-9545-1-git-send-email-syuu@scylladb.com>
2017-07-03 10:01:21 +03:00
Asias He
b2a2fbcf73 repair: Do not store the failed ranges
The number of failed ranges can be large so it can consume a lot of memory.
We already logged the failed ranges in the log. No need to storge them
in memory.

Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>
2017-07-03 10:00:25 +03:00
Takuya ASADA
1c35549932 dist/common/scripts/scylla_cpuscaling_setup: skip configuration when cpufreq driver doesn't loaded
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.

Fixes #2051

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
2017-07-03 09:59:56 +03:00
Takuya ASADA
e645b0fb13 dist/common/scripts: move EC2 configuration verification to 'scylla_ec2_check'
Currently we only have EC2 configuration verification on AMI, so move it to
/usr/lib/scylla and run it from scylla_setup, to make it usable for
non-AMI users.

Fixes #1997

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498811107-29135-1-git-send-email-syuu@scylladb.com>
2017-07-03 09:59:28 +03:00
Avi Kivity
6895f6e603 sstable_datafile_test: fix sstable_expired_data_ratio failure
A comment states that we want the file to be old enough, but sets
a timestamp of max(), which is in the future. This may have passed
because the conversion from numeric_limits<time_t>::max() to
db_clock::time_point is not well defined (their dynamic range is
different), so truncation may have converted the large number to a
low one.
Message-Id: <20170702082903.20879-1-avi@scylladb.com>
2017-07-02 20:22:51 +02:00
Avi Kivity
51b6066212 cql3: operation: correctly format error messages
Error messages incorrectly used the debug representation of the receiver,
rather than the text representation of the operation itself.

Fixes #113.
Message-Id: <20170701101325.3163-1-avi@scylladb.com>
2017-07-02 20:06:50 +02:00
Duarte Nunes
d157e4558a utils/log_histogram: Remove largest() function
It should never have existed in the first place, as there are no
legitimate callers and it can be misused.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170630095939.2429-1-duarte@scylladb.com>
2017-07-02 14:29:17 +03:00
Gleb Natapov
d23111312f main: wait for wait_for_gossip_to_settle() to complete during boot
Boot should not continue until a future returned by
wait_for_gossip_to_settle() is resolved.  Commit 991ec4a16 mistakenly
broke that, so restore it back. Also fix calls for supervisor::notify()
to be in the right places.

Message-Id: <20170702082355.GQ14563@scylladb.com>
2017-07-02 11:32:36 +03:00
Avi Kivity
5bc13e4454 Revert "Make table_helper independent from trace_keyspace_helper"
This reverts commit db5bf363d0. Causes
errors of the sort

    Exiting on unhandled exception: exceptions::invalid_request_exception
    (Keyspace 'system_traces' does not exist)
2017-07-02 11:30:51 +03:00
Avi Kivity
7c809917b6 compaction_manager: fix debug mode build (periodic_compaction_submission_interval)
Turn static constexpr variable into a function.
2017-07-01 19:34:46 +03:00
Avi Kivity
c2c69e003f compaction: fix build on debug mode (DEFAULT_TOMBSTONE_COMPACTION_INTERVAL)
Debug mode wants to allocate storage for a constexpr variable for some
reason. Turn it into a function.
2017-07-01 19:26:22 +03:00
Avi Kivity
59f649e2bc Revert "cql_server::do_accepts: modernize loop"
This reverts commit 37af493f6e. Connections
are not accepted and ^C does not work anymore.
2017-07-01 12:54:23 +03:00
Jesse Haber-Kucharsky
1100bb8a5b cql: Eagerly throw lexing and parsing exceptions
Previously, lexing and parsing errors were aggregated while CQL queries were
evaluated. Afterwards, the first collected error (if present) would be thrown as
an exception.

The problem was that when parsing and lexing errors were aggregated this way,
the parser would continue even in spite of errors like "no viable alternative".
Semantic actions attached to grammar rules would still execute, though with
variables that had not yet been initialized. This would crash Scylla.

This change modifies the error-handling strategy of CQL parsing. Rather than
aggregate errors, we throw an exception on the first error we encounter. This
ensures that grammar actions never execute unless there is a precise match.

One possible issue with this approach is that the generated C++ code from the
ANTLR grammar may not be exception-safe. I compiled Scylla in debug-mode with
ASan support and executed several erroneous CQL queries with `cqlsh`. No memory
leaks were reported.

Fixes #2466.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <db1f650a2bbb615b506d9015486eece45375a440.1498836703.git.jhaberku@scylladb.com>
2017-07-01 12:13:44 +03:00
Raphael S. Carvalho
69a9ad468c compaction_strategy: move dtcs to its existing header
Goal is to improve maintainability.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:09 -03:00
Raphael S. Carvalho
4d387475fe compaction_strategy: move lcs implementation to its own header
Goal is to improve maintainability.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:07 -03:00
Raphael S. Carvalho
4b46d286fd compaction_strategy: move stcs implementation to its own header
Goal is to improve maintainability.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:06 -03:00
Raphael S. Carvalho
0d9bb0da39 compaction_strategy: move compaction_strategy_impl to its own header
compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to eventually improve maintainability of the
strategies by putting each of them into its own header.
This is the first step towards that goal.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:04 -03:00
Raphael S. Carvalho
9fa855e105 compaction_strategy: use duration type for default tombstone compaction interval
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630041838.20604-1-raphaelsc@scylladb.com>
2017-06-30 08:56:22 +03:00
Piotr Jastrzebski
db5bf363d0 Make table_helper independent from trace_keyspace_helper
table_helper is a generic helper than can easily be used in other places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <3e360a963d4a53de6d758ba8bada78fc572f001a.1498745600.git.piotr@scylladb.com>
2017-06-29 17:20:07 +03:00
Tomasz Grabiec
97005825bf row_cache: Fix compilation errors with gcc 5
Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>
2017-06-29 16:34:46 +03:00
Avi Kivity
6da9b6eb81 cql3: error_listener: add virtual destructor
Found by Eclipse.
Message-Id: <20170629063324.31309-1-avi@scylladb.com>
2017-06-29 10:51:20 +02:00
Avi Kivity
9298fea27b Merge seastar upstream
* seastar 0ab7ae5...c848486 (2):
  > build: export full cflags in pkgconfig file (Fixes #2439)
  > configure: Avoid putting tmp file on /tmp
2017-06-29 11:35:24 +03:00
Avi Kivity
fc966c0c4c Merge "tombstone removal compaction" from Raphael
"This feature is intended to make compaction more efficient at getting rid of
droppable tombstone and expired data wasting disk space. So far, people have
been dealing with it manually through major compaction.

With strategies other than date tiered, large sstables will be left untouched
for a long time even though it's all expired. Date tiered suffers from it when
mixing data with different TTL because it only includes for compaction sstable
that is fully expired.

sstables keeps as metadata a histogram which allows us to easily estimate
droppable data ratio from gc_before. sstables which droppable data ratio is
above 20% (default value for tombstone_threshold option) will be considered
candidates for the operation.

Like in C*, we will only do tombstone removal compaction when there's nothing
to compact in standard way. It would be interesting to trigger it too when
disk usage is above a given threshold, but I decided to leave this for later.

Fixes #2306."

* 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla:
  tests: more testing for tombstone compaction options
  tests: basic tombstone compaction test for date tiered
  compaction/dtcs: add support for tombstone compaction
  tests: basic test of tombstone compaction with lcs
  compaction/lcs: add support for tombstone compaction
  tests: basic tombstone compaction test for size tiered
  compaction/stcs: add support for tombstone compaction
  tests: add test for estimation of droppable tombstone ratio
  sstables: introduce function to estimate droppable tombstone ratio
  compaction_manager: periodically submit cfs for compaction
  streaming_histogram: fix coding style
  tests: add streaming_histogram_test
  streaming_histogram: implement sum
  tests: add test for sstable with bad tombstone histogram
  sstables: discard bad streaming histogram for future use
  tests: add sstable tombstone histogram test
  streaming_histogram: fix update
  streaming_histogram: move it to utils
  streaming_histogram: do not limit it to be used by sstables
  sstables: update tombstone_histogram for cells with expiration time
2017-06-29 10:19:59 +03:00
Avi Kivity
1317c4a03e Update ami submodule
* dist/ami/files/scylla-ami f10db69...5dfe42f (1):
  > don't fetch perf from amazon repo
2017-06-29 09:38:48 +03:00
Raphael S. Carvalho
ab335c8085 tests: more testing for tombstone compaction options
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
ce4dc15a20 tests: basic tombstone compaction test for date tiered
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
f76ece5349 compaction/dtcs: add support for tombstone compaction
Unlike other strategies, dtcs has tombstone compaction disabled by
default due to:
- deletion shouldn't be used with DTCS; rather data is deleted through TTL.
- with time series workloads, it's usually better to wait for whole sstable
to be expired rather than compacting a single sstable when it's more than
20% (default value) expired.
See CASSANDRA-9234 for more details.

For tombstone compaction, unworthy sstables are filtered out and the oldest
one is chosen because it's the one less likely to shadow data and it's also
relatively big.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
c400bf97b9 tests: basic test of tombstone compaction with lcs
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
70e54cfe6e compaction/lcs: add support for tombstone compaction
LCS will choose its candidate by starting from highest level and
getting sstable which has highest droppable tombstone ratio.
Unlike STCS which needs to choose oldest sstable from biggest tier,
LCS can choose the one with highest d__t__r because sstables in
a given level don't overlap.
Sstable picked up for tombstone removal compaction won't be demoted
or promoted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
138fda468f tests: basic tombstone compaction test for size tiered
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
8fd80ac22c compaction/stcs: add support for tombstone compaction
Larger sstables are hard to find sstable peers and therefore are
left uncompacted for a long time. Expired data and tombstones which
can be purged will waste disk space meanwhile.

sstable tracks droppable tombstone from which ratio can be calculated.
If ratio is greater than threshold (0.2 by default), sstable will
be eligible for compaction. Oldest sstables from biggest tiers are
preferrable because droppable data in them are more likely to satisfy
the conditions for purge, like not shadowing data in another sstable.

Subsequent patches will add support in leveled and date tiered
strategies.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
ad24470972 tests: add test for estimation of droppable tombstone ratio
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
eb6d17b748 sstables: introduce function to estimate droppable tombstone ratio
Function used to estimate ratio of droppable tombstone.
A tombstone is considered droppable for cells expired before
gc_before and regular tombstones older than gc_before.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
0d21129cc7 compaction_manager: periodically submit cfs for compaction
This is useful for a column family which isn't generating new content
and will have lots of expired data later on that can be purged.
Compaction submission is NO-OP if there's nothing to do, so I think
it's reasonable to do it at an interval of 1 hour.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:03 -03:00
Raphael S. Carvalho
719dbf547d streaming_histogram: fix coding style
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
6fb26d9f0c tests: add streaming_histogram_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
a65b9eb8b4 streaming_histogram: implement sum
This function is used to estimate number of points in interval
[-inf,b]. It will be useful for estimating droppable tombstone
ratio in a given sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
c01c659594 tests: add test for sstable with bad tombstone histogram
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
06fabf9810 sstables: discard bad streaming histogram for future use
Find bad histogram which had incorrect elements merged due to use of
unordered map. The keys will be unordered. Histogram which size is
less than max allowed will be correct because no entries needed to be
merged, so we can avoid discarding those.

This is important because histogram for tombstone will be used to
estimate droppable tombstone ratio. If it's incorrectly high for many
of existing sstables, we will needlessly compact lots of them.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:10 -03:00
Raphael S. Carvalho
7b532867ce tests: add sstable tombstone histogram test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 01:17:28 -03:00
Raphael S. Carvalho
f35bd66da4 streaming_histogram: fix update
This bug was introduced when converting java code. Return value
of map::erase() was used as if it were the value of the removed
entry, but it's actually the number of removed entries.
update() also relies on ordered keys, so map is used instead
by histogram.
In addition, histograms will be written in sorted order (like C*
does) such that we can detect bad histograms, using disk_array.
disk_array is also used from now on to read histograms.
The conversion from array to map is fine because histograms for
sstables are limited to 100 elements.

Coming patch will detect bad histograms (generated only by us)
and discard them, because we can't rely on their information.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 01:17:26 -03:00
Amnon Heiman
644868d816 api: remove reply creation
As a preperation for the http stream support, creation of empty reply
should be avoided.

This patch removes a line that cannot be reached but causes the compiler
to complain.

It has no effect aside of removing the reply creation.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170628130202.8132-1-amnon@scylladb.com>
2017-06-28 16:30:58 +03:00
Tomasz Grabiec
786e75dbf7 row_cache: Use continuity information to decide whether to populate
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.

Another positive effect is that we don't have to clear continuity now.

Fixes #1999.

Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 13:32:48 +03:00
Tomasz Grabiec
3489c68a68 lsa: Fix performance regression in eviction and compact_on_idle
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).

This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.

Introduced in 11b5076b3c.

Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 12:32:43 +03:00
Raphael S. Carvalho
a3a73899bc database: remove outdated FIXME comments
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170621002253.29660-1-raphaelsc@scylladb.com>
2017-06-28 11:06:02 +02:00
Etienne Kruger
37af493f6e cql_server::do_accepts: modernize loop
Replace recursion in cql_server::do_accepts with more modern repeat()
from future-util.hh.

Fixes #2467.

Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170628033130.19824-1-el@loadavg.io>
2017-06-28 10:25:22 +03:00
Raphael S. Carvalho
d90f46000d streaming_histogram: move it to utils
It's not specific to sstables. May be needed somewhere else in
the future.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-28 01:07:13 -03:00
Glauber Costa
f3742d1e38 disable defragment-memory-on-idle-by-default
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.

Until we can figure out how to reduce its impact, we should disable it.

Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
2017-06-28 00:21:11 +03:00
Raphael S. Carvalho
fb9bc609c6 streaming_histogram: do not limit it to be used by sstables
streaming histogram will later be placed in /utils, so we want
it to use std::unordered_map<> instead of disk_hash<>.
That also requires implementing serialization/deserialization
functions for it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-27 16:51:52 -03:00
Raphael S. Carvalho
e224653d70 sstables: update tombstone_histogram for cells with expiration time
That tombstone_histogram is used to determine droppable data ratio
for a sstable, and unlike C*, we were only updating it for
tombstones. We need to update it with expiration time of cells too,
if any. Creation time (expiration - ttl) cannot be used because if
ttl > gc_grace_seconds, the resulting sstable could be considered
worth dropping by tomstone compaction before any data is actually
expired.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-27 16:50:38 -03:00
Avi Kivity
08488a75e0 dist: tolerate sysctl failures
sysctl may fail in a container environment if /proc is not virtualized
properly.

Fixes #1990
Message-Id: <20170625145930.31619-1-avi@scylladb.com>
2017-06-27 16:11:48 +02:00
Avi Kivity
ff7be8241f Merge "Fix compilation issues in older environments" from Tomasz
* 'tgrabiec/fix-compilation-issues' of github.com:cloudius-systems/seastar-dev:
  tests: streamed_mutation_test: Avoid using boost::size() on row ranges
  tests: row_cache: Remove unused method
2017-06-27 16:30:54 +03:00
Tomasz Grabiec
eb844a10e9 tests: streamed_mutation_test: Avoid using boost::size() on row ranges
Fails to compile with libboost 1.55.
2017-06-27 15:27:13 +02:00
Tomasz Grabiec
e68925595c tests: row_cache: Remove unused method 2017-06-27 14:10:37 +02:00
Vlad Zolotarov
6839a50677 db::commitlog: entry_writer add a virtual destructor
Add a virtual destructor for a base class commitlog::entry_writer.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1498511180-18391-1-git-send-email-vladz@scylladb.com>
2017-06-27 10:17:10 +03:00
642 changed files with 51646 additions and 24658 deletions

9
.gitignore vendored
View File

@@ -9,3 +9,12 @@ dist/ami/files/*.rpm
dist/ami/variables.json
dist/ami/scylla_deploy.sh
*.pyc
Cql.tokens
.kdev4
*.kdev4
CMakeLists.txt.user
.cache
.tox
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit

3
.gitmodules vendored
View File

@@ -9,3 +9,6 @@
[submodule "dist/ami/files/scylla-ami"]
path = dist/ami/files/scylla-ami
url = ../scylla-ami
[submodule "xxHash"]
path = xxHash
url = ../xxHash

View File

@@ -5,8 +5,8 @@
cmake_minimum_required(VERSION 3.7)
project(scylla)
if (NOT DEFINED ENV{CLION_IDE})
message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in CLion")
if (NOT DEFINED FOR_IDE AND NOT DEFINED ENV{FOR_IDE} AND NOT DEFINED ENV{CLION_IDE})
message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in IDEs, please define FOR_IDE to acknowledge this.")
endif()
# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.
@@ -125,7 +125,7 @@ list(REMOVE_ITEM SEASTAR_CFLAGS "-DHAVE_GCC6_CONCEPTS")
#
# For ease of browsing the source code, we always pretend that DPDK is enabled.
target_compile_options(scylla PUBLIC
-std=gnu++14
-std=gnu++1z
-DHAVE_DPDK
-DHAVE_HWLOC
"${SEASTAR_CFLAGS}")
@@ -137,4 +137,5 @@ target_include_directories(scylla PUBLIC
${SEASTAR_DPDK_INCLUDE_DIRS}
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
xxhash
build/release/gen)

279
HACKING.md Normal file
View File

@@ -0,0 +1,279 @@
# Guidelines for developing Scylla
This document is intended to help developers and contributors to Scylla get started. The first part consists of general guidelines that make no assumptions about a development environment or tooling. The second part describes a particular environment and work-flow for exemplary purposes.
## Overview
This section covers some high-level information about the Scylla source code and work-flow.
### Getting the source code
Scylla uses [Git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules) to manage its dependency on Seastar and other tools. Be sure that all submodules are correctly initialized when cloning the project:
```bash
$ git clone https://github.com/scylladb/scylla
$ cd scylla
$ git submodule update --init --recursive
```
### Dependencies
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
To build for the first time:
```bash
$ ./configure.py
$ ninja-build
```
Afterwards, it is sufficient to just execute Ninja.
The full suite of options for project configuration is available via
```bash
$ ./configure.py --help
```
The most important options are:
- `--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.
- `--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.
To save time -- for instance, to avoid compiling all unit tests -- you can also specify specific targets to Ninja. For example,
```bash
$ ninja-build build/release/tests/schema_change_test
```
### Unit testing
Unit tests live in the `/tests` directory. Like with application source files, test sources and executables are specified manually in `configure.py` and need to be updated when changes are made.
A test target can be any executable. A non-zero return code indicates test failure.
Most tests in the Scylla repository are built using the [Boost.Test](http://www.boost.org/doc/libs/1_64_0/libs/test/doc/html/index.html) library. Utilities for writing tests with Seastar futures are also included.
Run all tests through the test execution wrapper with
```bash
$ ./test.py --mode={debug,release}
```
The `--name` argument can be specified to run a particular test.
Alternatively, you can execute the test executable directly. For example,
```bash
$ build/release/tests/row_cache_test -- -c1 -m1G
```
The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread and 1 GB of memory.
### Preparing patches
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/). There are also some guidelines that can help you make the patch review process smoother:
1. Before generating patches, make sure your Git configuration points to `.gitorderfile`. You can do it by running
```bash
$ git config diff.orderfile .gitorderfile
```
2. If you are sending more than a single patch, push your changes into a new branch of your fork of Scylla on GitHub and add a URL pointing to this branch to your cover letter.
3. If you are sending a new revision of an earlier patchset, add a brief summary of changes in this version, for example:
```
In v3:
- declared move constructor and move assignment operator as noexcept
- used std::variant instead of a union
...
```
4. Add information about the tests run with this fix. It can look like
```
"Tests: unit ({mode}), dtest ({smp})"
```
The usual is "Tests: unit (release)", although running debug tests is encouraged.
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.
### Finding a person to review and merge your patches
You can use the `scripts/find-maintainer` script to find a subsystem maintainer and/or reviewer for your patches. The script accepts a filename in the git source tree as an argument and outputs a list of subsystems the file belongs to and their respective maintainers and reviewers. For example, if you changed the `cql3/statements/create_view_statement.hh` file, run the script as follows:
```bash
$ ./scripts/find-maintainer cql3/statements/create_view_statement.hh
```
and you will get output like this:
```
CQL QUERY LANGUAGE
Tomasz Grabiec <tgrabiec@scylladb.com> [maintainer]
Pekka Enberg <penberg@scylladb.com> [maintainer]
MATERIALIZED VIEWS
Pekka Enberg <penberg@scylladb.com> [maintainer]
Duarte Nunes <duarte@scylladb.com> [maintainer]
Nadav Har'El <nyh@scylladb.com> [reviewer]
Duarte Nunes <duarte@scylladb.com> [reviewer]
```
### Running Scylla
Once Scylla has been compiled, executing the (`debug` or `release`) target will start a running instance in the foreground:
```bash
$ build/release/scylla
```
The `scylla` executable requires a configuration file, `scylla.yaml`. By default, this is read from `$SCYLLA_HOME/conf/scylla.yaml`. A good starting point for development is located in the repository at `/conf/scylla.yaml`.
For development, a directory at `$HOME/scylla` can be used for all Scylla-related files:
```bash
$ mkdir -p $HOME/scylla $HOME/scylla/conf
$ cp conf/scylla.yaml $HOME/scylla/conf/scylla.yaml
$ # Edit configuration options as appropriate
$ SCYLLA_HOME=$HOME/scylla build/release/scylla
```
The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories` and `commitlog_directory` fields as appropriate.
Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.
Additionally, when running on under-powered platforms like portable laptops, the `--overprovisined` flag is useful.
On a development machine, one might run Scylla as
```bash
$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes
```
### Branches and tags
Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.
Similarly, tags are used to pin-point precise release versions, including hot-fix versions like 1.5.4. These are named `scylla-1.5.4`, for example.
Most development happens on the `master` branch. Release branches are cut from `master` based on time and/or features. When a patch against `master` fixes a serious issue like a node crash or data loss, it is backported to a particular release branch with `git cherry-pick` by the project maintainers.
## Example: development on Fedora 25
This section describes one possible work-flow for developing Scylla on a Fedora 25 system. It is presented as an example to help you to develop a work-flow and tools that you are comfortable with.
### Preface
This guide will be written from the perspective of a fictitious developer, Taylor Smith.
### Git work-flow
Having two Git remotes is useful:
- A public clone of Seastar (`"public"`)
- A private clone of Seastar (`"private"`) for in-progress work or work that is not yet ready to share
The first step to contributing a change to Scylla is to create a local branch dedicated to it. For example, a feature that fixes a bug in the CQL statement for creating tables could be called `ts/cql_create_table_error/v1`. The branch name is prefaced by the developer's initials and has a suffix indicating that this is the first version. The version suffix is useful when branches are shared publicly and changes are requested on the mailing list. Having a branch for each version of the patch (or patch set) shared publicly makes it easier to reference and compare the history of a change.
Setting the upstream branch of your development branch to `master` is a useful way to track your changes. You can do this with
```bash
$ git branch -u master ts/cql_create_table_error/v1
```
As a patch set is developed, you can periodically push the branch to the private remote to back-up work.
Once the patch set is ready to be reviewed, push the branch to the public remote and prepare an email to the `scylladb-dev` mailing list. Including a link to the branch on your public remote allows for reviewers to quickly test and explore your changes.
### Development environment and source code navigation
Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.
[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.
Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).
To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.
[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.
### Distributed compilation: `distcc` and `ccache`
Scylla's compilations times can be long. Two tools help somewhat:
- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible
- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines
A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.
Having a direct wired connection between the machines ensures that object files can be transmitted quickly and limits the overhead of remote compilation.
The coordinator has been assigned the static IP address `10.0.0.1` and the passive compilation machine has been assigned `10.0.0.2`.
On Fedora, installing the `ccache` package places symbolic links for `gcc` and `g++` in the `PATH`. This allows normal compilation to transparently invoke `ccache` for compilation and cache object files on the local file-system.
Next, set `CCACHE_PREFIX` so that `ccache` is responsible for invoking `distcc` as necessary:
```bash
export CCACHE_PREFIX="distcc"
```
On each host, edit `/etc/sysconfig/distccd` to include the allowed coordinators and the total number of jobs that the machine should accept.
This example is for the laptop, which has 2 physical cores (4 logical cores with hyper-threading):
```
OPTIONS="--allow 10.0.0.2 --allow 127.0.0.1 --jobs 4"
```
`10.0.0.2` has 8 physical cores (16 logical cores) and 64 GB of memory.
As a rule-of-thumb, the number of jobs that a machine should be specified to support should be equal to the number of its native threads.
Restart the `distccd` service on all machines.
On the coordinator machine, edit `$HOME/.distcc/hosts` with the available hosts for compilation. Order of the hosts indicates preference.
```
10.0.0.2/16 localhost/2
```
In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine will be sent up to 2. Allowing for two extra threads on the host machine for coordination, we run compilation with `16 + 2 + 2 = 20` jobs in total: `ninja-build -j20`.
When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.
### Using the `gold` linker
Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds the linking process. On Fedora, you can switch the system linker using
```bash
$ sudo alternatives --config ld
```
### Testing changes in Seastar with Scylla
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
```bash
$ cd $HOME/src/scylla
$ cd seastar
$ git remote add local /home/tsmith/src/seastar
$ git remote update
$ git checkout -t local/my_local_seastar_branch
```

131
MAINTAINERS Normal file
View File

@@ -0,0 +1,131 @@
M: Maintainer with commit access
R: Reviewer with subsystem expertise
F: Filename, directory, or pattern for the subsystem
---
AUTH
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Calle Wilund <calle@scylladb.com>
R: Vlad Zolotarov <vladz@scylladb.com>
R: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
F: auth/*
CACHE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
R: Piotr Jastrzebski <piotr@scylladb.com>
F: row_cache*
F: *mutation*
F: tests/mvcc*
COMMITLOG / BATCHLOGa
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Calle Wilund <calle@scylladb.com>
F: db/commitlog/*
F: db/batch*
COORDINATOR
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Gleb Natapov <gleb@scylladb.com>
F: service/storage_proxy*
COMPACTION
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: sstables/compaction*
CQL TRANSPORT LAYER
M: Pekka Enberg <penberg@scylladb.com>
F: transport/*
CQL QUERY LANGUAGE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: cql3/*
COUNTERS
M: Paweł Dziepak <pdziepak@scylladb.com>
F: counters*
F: tests/counter_test*
GOSSIP
M: Duarte Nunes <duarte@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Asias He <asias@scylladb.com>
F: gms/*
DOCKER
M: Pekka Enberg <penberg@scylladb.com>
F: dist/docker/*
LSA
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
F: utils/logalloc*
MATERIALIZED VIEWS
M: Duarte Nunes <duarte@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
R: Duarte Nunes <duarte@scylladb.com>
F: db/view/*
F: cql3/statements/*view*
PACKAGING
R: Takuya ASADA <syuu@scylladb.com>
F: dist/*
REPAIR
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Asias He <asias@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: repair/*
SCHEMA MANAGEMENT
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: db/schema_tables*
F: db/legacy_schema_migrator*
F: service/migration*
F: schema*
SECONDARY INDEXES
M: Pekka Enberg <penberg@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
R: Pekka Enberg <penberg@scylladb.com>
F: db/index/*
F: cql3/statements/*index*
SSTABLES
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: sstables/*
STREAMING
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
R: Asias He <asias@scylladb.com>
F: streaming/*
F: service/storage_service.*
THRIFT TRANSPORT LAYER
M: Duarte Nunes <duarte@scylladb.com>
F: thrift/*
THE REST
M: Avi Kivity <avi@scylladb.com>
M: Paweł Dziepak <pdziepak@scylladb.com>
M: Duarte Nunes <duarte@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
F: *

View File

@@ -1,2 +1,5 @@
This project includes code developed by the Apache Software Foundation (http://www.apache.org/),
especially Apache Cassandra.
It also includes files from https://github.com/antonblanchard/crc32-vpmsum (author Anton Blanchard <anton@au.ibm.com>, IBM).
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.

View File

@@ -1,29 +1,19 @@
# Scylla
## Building Scylla
## Quick-start
In addition to required packages by Seastar, the following packages are required by Scylla.
### Submodules
Scylla uses submodules, so make sure you pull the submodules first by doing:
```
git submodule init
git submodule update --init --recursive
```bash
$ git submodule update --init --recursive
$ sudo ./install-dependencies.sh
$ ./configure.py --mode=release
$ ninja-build -j4 # Assuming 4 system threads.
$ ./build/release/scylla
$ # Rejoice!
```
### Building and Running Scylla on Fedora
* Installing required packages:
Please see [HACKING.md](HACKING.md) for detailed information on building and developing Scylla.
```
sudo dnf install yaml-cpp-devel lz4-devel zlib-devel snappy-devel jsoncpp-devel thrift-devel antlr3-tool antlr3-C++-devel libasan libubsan gcc-c++ gnutls-devel ninja-build ragel libaio-devel cryptopp-devel xfsprogs-devel numactl-devel hwloc-devel libpciaccess-devel libxml2-devel python3-pyparsing lksctp-tools-devel protobuf-devel protobuf-compiler systemd-devel libunwind-devel
```
* Build Scylla
```
./configure.py --mode=release --with=scylla --disable-xen
ninja-build build/release/scylla -j2 # you can use more cpus if you have tons of RAM
```
## Running Scylla
* Run Scylla
```

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=2.0.5
VERSION=2.2.2
if test -f version
then

View File

@@ -792,6 +792,24 @@
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
{
"method":"GET",
"summary":"Return an array with the ids of the currently active repairs",
"type":"array",
"items":{
"type":"int"
},
"nickname":"get_active_repair_async",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/repair_async/{keyspace}",
"operations":[
@@ -952,6 +970,22 @@
}
]
},
{
"path":"/storage_service/force_terminate_repair",
"operations":[
{
"method":"POST",
"summary":"Force terminate all repair sessions",
"type":"void",
"nickname":"force_terminate_all_repair_sessions_new",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/decommission",
"operations":[
@@ -2159,11 +2193,11 @@
"description":"The column family"
},
"total":{
"type":"int",
"type":"long",
"description":"The total snapshot size"
},
"live":{
"type":"int",
"type":"long",
"description":"The live snapshot size"
}
}

View File

@@ -0,0 +1,29 @@
{
"swagger": "2.0",
"info": {
"version": "1.0.0",
"title": "Scylla API",
"description": "The scylla API version 2.0",
"termsOfService": "http://www.scylladb.com/tos/",
"contact": {
"name": "Scylla Team",
"email": "info@scylladb.com",
"url": "http://scylladb.com"
},
"license": {
"name": "AGPL",
"url": "https://github.com/scylladb/scylla/blob/master/LICENSE.AGPL"
}
},
"host": "{{Host}}",
"basePath": "/v2",
"schemes": [
"http"
],
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"paths": {

View File

@@ -49,19 +49,22 @@ static std::unique_ptr<reply> exception_reply(std::exception_ptr eptr) {
throw bad_param_exception(ex.what());
}
// We never going to get here
return std::make_unique<reply>();
throw std::runtime_error("exception_reply");
}
future<> set_server_init(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
return ctx.http_server.set_routes([rb, &ctx, rb02](routes& r) {
r.register_exeption_handler(exception_reply);
r.put(GET, "/ui", new httpd::file_handler(ctx.api_dir + "/index.html",
new content_replace("html")));
r.add(GET, url("/ui").remainder("path"), new httpd::directory_handler(ctx.api_dir,
new content_replace("html")));
rb->set_api_doc(r);
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
rb->register_function(r, "system",
"The system related API");
set_system(ctx, r);
@@ -112,6 +115,11 @@ future<> set_server_stream_manager(http_context& ctx) {
"The stream manager API", set_stream_manager);
}
future<> set_server_cache(http_context& ctx) {
return register_api(ctx, "cache_service",
"The cache service API", set_cache_service);
}
future<> set_server_gossip_settle(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
@@ -119,9 +127,6 @@ future<> set_server_gossip_settle(http_context& ctx) {
rb->register_function(r, "failure_detector",
"The failure detector API");
set_failure_detector(ctx,r);
rb->register_function(r, "cache_service",
"The cache service API");
set_cache_service(ctx,r);
});
}

View File

@@ -46,7 +46,7 @@ future<> set_server_messaging_service(http_context& ctx);
future<> set_server_storage_proxy(http_context& ctx);
future<> set_server_stream_manager(http_context& ctx);
future<> set_server_gossip_settle(http_context& ctx);
future<> set_server_cache(http_context& ctx);
future<> set_server_done(http_context& ctx);
}

View File

@@ -20,6 +20,7 @@
*/
#include "compaction_manager.hh"
#include "sstables/compaction_manager.hh"
#include "api/api-doc/compaction_manager.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"

View File

@@ -397,7 +397,7 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_range_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::read);
return sum_timer_stats(ctx.sp, &proxy::stats::range);
});
sp::get_range_latency.set(r, [&ctx](std::unique_ptr<request> req) {

View File

@@ -34,6 +34,7 @@
#include "column_family.hh"
#include "log.hh"
#include "release.hh"
#include "sstables/compaction_manager.hh"
namespace api {
@@ -92,10 +93,13 @@ void set_storage_service(http_context& ctx, routes& r) {
return ctx.db.local().commitlog()->active_config().commit_log_location;
});
ss::get_token_endpoint.set(r, [] (const_req req) {
auto token_to_ep = service::get_local_storage_service().get_token_to_endpoint_map();
std::vector<storage_service_json::mapper> res;
return map_to_key_value(token_to_ep, res);
ss::get_token_endpoint.set(r, [] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_to_endpoint_map(), [](const auto& i) {
storage_service_json::mapper val;
val.key = boost::lexical_cast<std::string>(i.first);
val.value = boost::lexical_cast<std::string>(i.second);
return val;
}));
});
ss::get_leaving_nodes.set(r, [](const_req req) {
@@ -354,6 +358,12 @@ void set_storage_service(http_context& ctx, routes& r) {
});
});
ss::get_active_repair_async.set(r, [&ctx](std::unique_ptr<request> req) {
return get_active_repairs(ctx.db).then([] (std::vector<int> res){
return make_ready_future<json::json_return_type>(res);
});
});
ss::repair_async_status.set(r, [&ctx](std::unique_ptr<request> req) {
return repair_get_status(ctx.db, boost::lexical_cast<int>( req->get_query_param("id")))
.then_wrapped([] (future<repair_status>&& fut) {
@@ -361,16 +371,22 @@ void set_storage_service(http_context& ctx, routes& r) {
try {
res = fut.get0();
} catch(std::runtime_error& e) {
return make_ready_future<json::json_return_type>(json_exception(httpd::bad_param_exception(e.what())));
throw httpd::bad_param_exception(e.what());
}
return make_ready_future<json::json_return_type>(json::json_return_type(res));
});
});
ss::force_terminate_all_repair_sessions.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(json_void());
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::force_terminate_all_repair_sessions_new.set(r, [](std::unique_ptr<request> req) {
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::decommission.set(r, [](std::unique_ptr<request> req) {

View File

@@ -57,7 +57,6 @@ class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;
static constexpr unsigned flags_size = 1;
@@ -74,17 +73,10 @@ private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
}
static bool is_revert_set(bytes_view cell) {
return cell[0] & REVERT_FLAG;
}
static bool is_counter_in_place_revert_set(bytes_view cell) {
return cell[0] & COUNTER_IN_PLACE_REVERT;
}
template<typename BytesContainer>
static void set_revert(BytesContainer& cell, bool revert) {
cell[0] = (cell[0] & ~REVERT_FLAG) | (revert * REVERT_FLAG);
}
template<typename BytesContainer>
static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {
cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);
}
@@ -216,9 +208,6 @@ public:
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_data);
}
bool is_revert_set() const {
return atomic_cell_type::is_revert_set(_data);
}
bool is_counter_in_place_revert_set() const {
return atomic_cell_type::is_counter_in_place_revert_set(_data);
}
@@ -269,14 +258,11 @@ public:
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() < now;
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _data;
}
void set_revert(bool revert) {
atomic_cell_type::set_revert(_data, revert);
}
void set_counter_in_place_revert(bool flag) {
atomic_cell_type::set_counter_in_place_revert(_data, flag);
}

View File

@@ -25,6 +25,7 @@
#include "types.hh"
#include "atomic_cell.hh"
#include "atomic_cell_or_collection.hh"
#include "hashing.hh"
#include "counters.hh"
@@ -78,3 +79,15 @@ struct appending_hash<collection_mutation> {
feed_hash(h, static_cast<collection_mutation_view>(cm), cdef);
}
};
template<>
struct appending_hash<atomic_cell_or_collection> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell_or_collection& c, const column_definition& cdef) const {
if (cdef.is_atomic()) {
feed_hash(h, c.as_atomic_cell(), cdef);
} else {
feed_hash(h, c.as_collection_mutation(), cdef);
}
}
};

View File

@@ -59,14 +59,6 @@ public:
bool operator==(const atomic_cell_or_collection& other) const {
return _data == other._data;
}
template<typename Hasher>
void feed_hash(Hasher& h, const column_definition& def) const {
if (def.is_atomic()) {
::feed_hash(h, as_atomic_cell(), def);
} else {
::feed_hash(h, as_collection_mutation(), def);
}
}
size_t external_memory_usage() const {
return _data.external_memory_usage();
}

View File

@@ -0,0 +1,41 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
namespace auth {
const sstring& allow_all_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthenticator";
return name;
}
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
allow_all_authenticator,
cql3::query_processor&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -0,0 +1,101 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <stdexcept>
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/common.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class migration_manager;
}
namespace auth {
const sstring& allow_all_authenticator_name();
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::migration_manager&) {
}
virtual future<> start() override {
return make_ready_future<>();
}
virtual future<> stop() override {
return make_ready_future<>();
}
virtual const sstring& qualified_java_name() const override {
return allow_all_authenticator_name();
}
virtual bool require_authentication() const override {
return false;
}
virtual authentication_option_set supported_options() const override {
return authentication_option_set();
}
virtual authentication_option_set alterable_options() const override {
return authentication_option_set();
}
future<authenticated_user> authenticate(const credentials_map& credentials) const override {
return make_ready_future<authenticated_user>(anonymous_user());
}
virtual future<> create(stdx::string_view, const authentication_options& options) const override {
return make_ready_future();
}
virtual future<> alter(stdx::string_view, const authentication_options& options) const override {
return make_ready_future();
}
virtual future<> drop(stdx::string_view) const override {
return make_ready_future();
}
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override {
return make_ready_future<custom_options>();
}
virtual const resource_set& protected_resources() const override {
static const resource_set resources;
return resources;
}
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
throw std::runtime_error("Should not reach");
}
};
}

View File

@@ -0,0 +1,41 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "utils/class_registrator.hh"
namespace auth {
const sstring& allow_all_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";
return name;
}
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
allow_all_authorizer,
cql3::query_processor&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthorizer");
}

View File

@@ -0,0 +1,93 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "auth/authorizer.hh"
#include "exceptions/exceptions.hh"
#include "stdx.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class migration_manager;
}
namespace auth {
const sstring& allow_all_authorizer_name();
class allow_all_authorizer final : public authorizer {
public:
allow_all_authorizer(cql3::query_processor&, ::service::migration_manager&) {
}
virtual future<> start() override {
return make_ready_future<>();
}
virtual future<> stop() override {
return make_ready_future<>();
}
virtual const sstring& qualified_java_name() const override {
return allow_all_authorizer_name();
}
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
virtual future<> grant(stdx::string_view, permission_set, const resource&) const override {
return make_exception_future<>(
unsupported_authorization_operation("GRANT operation is not supported by AllowAllAuthorizer"));
}
virtual future<> revoke(stdx::string_view, permission_set, const resource&) const override {
return make_exception_future<>(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
virtual future<std::vector<permission_details>> list_all() const override {
return make_exception_future<std::vector<permission_details>>(
unsupported_authorization_operation(
"LIST PERMISSIONS operation is not supported by AllowAllAuthorizer"));
}
virtual future<> revoke_all(stdx::string_view) const override {
return make_exception_future(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
virtual future<> revoke_all(const resource&) const override {
return make_exception_future(
unsupported_authorization_operation("REVOKE operation is not supported by AllowAllAuthorizer"));
}
virtual const resource_set& protected_resources() const override {
static const resource_set resources;
return resources;
}
};
}

View File

@@ -1,384 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/sleep.hh>
#include <seastar/core/distributed.hh>
#include "auth.hh"
#include "authenticator.hh"
#include "authorizer.hh"
#include "database.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/raw/cf_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "db/config.hh"
#include "service/migration_manager.hh"
#include "utils/loading_cache.hh"
#include "utils/hash.hh"
const sstring auth::auth::DEFAULT_SUPERUSER_NAME("cassandra");
const sstring auth::auth::AUTH_KS("system_auth");
const sstring auth::auth::USERS_CF("users");
static const sstring USER_NAME("name");
static const sstring SUPER("super");
static logging::logger alogger("auth");
// TODO: configurable
using namespace std::chrono_literals;
const std::chrono::milliseconds auth::auth::SUPERUSER_SETUP_DELAY = 10000ms;
class auth_migration_listener : public service::migration_listener {
void on_create_keyspace(const sstring& ks_name) override {}
void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_create_view(const sstring& ks_name, const sstring& view_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
auth::authorizer::get().revoke_all(auth::data_resource(ks_name));
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
auth::authorizer::get().revoke_all(auth::data_resource(ks_name, cf_name));
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static auth_migration_listener auth_migration;
namespace std {
template <>
struct hash<auth::data_resource> {
size_t operator()(const auth::data_resource & v) const {
return v.hash_value();
}
};
template <>
struct hash<auth::authenticated_user> {
size_t operator()(const auth::authenticated_user & v) const {
return utils::tuple_hash()(v.name(), v.is_anonymous());
}
};
}
class auth::auth::permissions_cache {
public:
typedef utils::loading_cache<std::pair<authenticated_user, data_resource>, permission_set, utils::loading_cache_reload_enabled::yes, utils::simple_entry_size<permission_set>, utils::tuple_hash> cache_type;
typedef typename cache_type::key_type key_type;
permissions_cache()
: permissions_cache(
cql3::get_local_query_processor().db().local().get_config()) {
}
permissions_cache(const db::config& cfg)
: _cache(cfg.permissions_cache_max_entries(), std::chrono::milliseconds(cfg.permissions_validity_in_ms()), std::chrono::milliseconds(cfg.permissions_update_interval_in_ms()), alogger,
[] (const key_type& k) {
alogger.debug("Refreshing permissions for {}", k.first.name());
return authorizer::get().authorize(::make_shared<authenticated_user>(k.first), k.second);
}) {}
future<> stop() {
return _cache.stop();
}
future<permission_set> get(::shared_ptr<authenticated_user> user, data_resource resource) {
return _cache.get(key_type(*user, std::move(resource)));
}
private:
cache_type _cache;
};
namespace std { // for ADL, yuch
std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p) {
os << "{user: " << p.first.name() << ", data_resource: " << p.second << "}";
return os;
}
}
static distributed<auth::auth::permissions_cache> perm_cache;
/**
* Poor mans job schedule. For maximum 2 jobs. Sic.
* Still does nothing more clever than waiting 10 seconds
* like origin, then runs the submitted tasks.
*
* Only difference compared to sleep (from which this
* borrows _heavily_) is that if tasks have not run by the time
* we exit (and do static clean up) we delete the promise + cont
*
* Should be abstracted to some sort of global server function
* probably.
*/
struct waiter {
promise<> done;
timer<> tmr;
waiter() : tmr([this] {done.set_value();})
{
tmr.arm(auth::auth::SUPERUSER_SETUP_DELAY);
}
~waiter() {
if (tmr.armed()) {
tmr.cancel();
done.set_exception(std::runtime_error("shutting down"));
}
alogger.trace("Deleting scheduled task");
}
void kill() {
}
};
typedef std::unique_ptr<waiter> waiter_ptr;
static std::vector<waiter_ptr> & thread_waiters() {
static thread_local std::vector<waiter_ptr> the_waiters;
return the_waiters;
}
void auth::auth::schedule_when_up(scheduled_func f) {
alogger.trace("Adding scheduled task");
auto & waiters = thread_waiters();
waiters.emplace_back(std::make_unique<waiter>());
auto* w = waiters.back().get();
w->done.get_future().finally([w] {
auto & waiters = thread_waiters();
auto i = std::find_if(waiters.begin(), waiters.end(), [w](const waiter_ptr& p) {
return p.get() == w;
});
if (i != waiters.end()) {
waiters.erase(i);
}
}).then([f = std::move(f)] {
alogger.trace("Running scheduled task");
return f();
}).handle_exception([](auto ep) {
return make_ready_future();
});
}
bool auth::auth::is_class_type(const sstring& type, const sstring& classname) {
if (type == classname) {
return true;
}
auto i = classname.find_last_of('.');
return classname.compare(i + 1, sstring::npos, type) == 0;
}
future<> auth::auth::setup() {
auto& db = cql3::get_local_query_processor().db().local();
auto& cfg = db.get_config();
future<> f = perm_cache.start();
if (is_class_type(cfg.authenticator(),
authenticator::ALLOW_ALL_AUTHENTICATOR_NAME)
&& is_class_type(cfg.authorizer(),
authorizer::ALLOW_ALL_AUTHORIZER_NAME)
) {
// just create the objects
return f.then([&cfg] {
return authenticator::setup(cfg.authenticator());
}).then([&cfg] {
return authorizer::setup(cfg.authorizer());
});
}
if (!db.has_keyspace(AUTH_KS)) {
std::map<sstring, sstring> opts;
opts["replication_factor"] = "1";
auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments. See issue #2129.
f = service::get_local_migration_manager().announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([] {
return setup_table(USERS_CF, sprint("CREATE TABLE %s.%s (%s text, %s boolean, PRIMARY KEY(%s)) WITH gc_grace_seconds=%d",
AUTH_KS, USERS_CF, USER_NAME, SUPER, USER_NAME,
90 * 24 * 60 * 60)); // 3 months.
}).then([&cfg] {
return authenticator::setup(cfg.authenticator());
}).then([&cfg] {
return authorizer::setup(cfg.authorizer());
}).then([] {
service::get_local_migration_manager().register_listener(&auth_migration); // again, only one shard...
// instead of once-timer, just schedule this later
schedule_when_up([] {
// setup default super user
return has_existing_users(USERS_CF, DEFAULT_SUPERUSER_NAME, USER_NAME).then([](bool exists) {
if (!exists) {
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
AUTH_KS, USERS_CF, USER_NAME, SUPER);
cql3::get_local_query_processor().process(query, db::consistency_level::ONE, {DEFAULT_SUPERUSER_NAME, true}).then([](auto) {
alogger.info("Created default superuser '{}'", DEFAULT_SUPERUSER_NAME);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception&) {
alogger.warn("Skipped default superuser setup: some nodes were not ready");
}
});
}
});
});
});
}
future<> auth::auth::shutdown() {
// just make sure we don't have pending tasks.
// this is mostly relevant for test cases where
// db-env-shutdown != process shutdown
return smp::invoke_on_all([] {
thread_waiters().clear();
}).then([] {
return perm_cache.stop();
});
}
future<auth::permission_set> auth::auth::get_permissions(::shared_ptr<authenticated_user> user, data_resource resource) {
return perm_cache.local().get(std::move(user), std::move(resource));
}
static db::consistency_level consistency_for_user(const sstring& username) {
if (username == auth::auth::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
static future<::shared_ptr<cql3::untyped_result_set>> select_user(const sstring& username) {
// Here was a thread local, explicit cache of prepared statement. In normal execution this is
// fine, but since we in testing set up and tear down system over and over, we'd start using
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
return cql3::get_local_query_processor().process(
sprint("SELECT * FROM %s.%s WHERE %s = ?",
auth::auth::AUTH_KS, auth::auth::USERS_CF,
USER_NAME), consistency_for_user(username),
{ username }, true);
}
future<bool> auth::auth::is_existing_user(const sstring& username) {
return select_user(username).then(
[](::shared_ptr<cql3::untyped_result_set> res) {
return make_ready_future<bool>(!res->empty());
});
}
future<bool> auth::auth::is_super_user(const sstring& username) {
return select_user(username).then(
[](::shared_ptr<cql3::untyped_result_set> res) {
return make_ready_future<bool>(!res->empty() && res->one().get_as<bool>(SUPER));
});
}
future<> auth::auth::insert_user(const sstring& username, bool is_super) {
return cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
AUTH_KS, USERS_CF, USER_NAME, SUPER),
consistency_for_user(username), { username, is_super }).discard_result();
}
future<> auth::auth::delete_user(const sstring& username) {
return cql3::get_local_query_processor().process(sprint("DELETE FROM %s.%s WHERE %s = ?",
AUTH_KS, USERS_CF, USER_NAME),
consistency_for_user(username), { username }).discard_result();
}
future<> auth::auth::setup_table(const sstring& name, const sstring& cql) {
auto& qp = cql3::get_local_query_processor();
auto& db = qp.db().local();
if (db.has_schema(AUTH_KS, name)) {
return make_ready_future();
}
::shared_ptr<cql3::statements::raw::cf_statement> parsed = static_pointer_cast<
cql3::statements::raw::cf_statement>(cql3::query_processor::parse_statement(cql));
parsed->prepare_keyspace(AUTH_KS);
::shared_ptr<cql3::statements::create_table_statement> statement =
static_pointer_cast<cql3::statements::create_table_statement>(
parsed->prepare(db, qp.get_cql_stats())->statement);
auto schema = statement->get_cf_meta_data();
auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
return service::get_local_migration_manager().announce_new_column_family(b.build(), false);
}
future<bool> auth::auth::has_existing_users(const sstring& cfname, const sstring& def_user_name, const sstring& name_column) {
auto default_user_query = sprint("SELECT * FROM %s.%s WHERE %s = ?", AUTH_KS, cfname, name_column);
auto all_users_query = sprint("SELECT * FROM %s.%s LIMIT 1", AUTH_KS, cfname);
return cql3::get_local_query_processor().process(default_user_query, db::consistency_level::ONE, { def_user_name }).then([=](::shared_ptr<cql3::untyped_result_set> res) {
if (!res->empty()) {
return make_ready_future<bool>(true);
}
return cql3::get_local_query_processor().process(default_user_query, db::consistency_level::QUORUM, { def_user_name }).then([all_users_query](::shared_ptr<cql3::untyped_result_set> res) {
if (!res->empty()) {
return make_ready_future<bool>(true);
}
return cql3::get_local_query_processor().process(all_users_query, db::consistency_level::QUORUM).then([](::shared_ptr<cql3::untyped_result_set> res) {
return make_ready_future<bool>(!res->empty());
});
});
});
}

View File

@@ -1,125 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <chrono>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include "exceptions/exceptions.hh"
#include "permission.hh"
#include "data_resource.hh"
#include "authenticated_user.hh"
namespace auth {
class auth {
public:
class permissions_cache;
static const sstring DEFAULT_SUPERUSER_NAME;
static const sstring AUTH_KS;
static const sstring USERS_CF;
static const std::chrono::milliseconds SUPERUSER_SETUP_DELAY;
static bool is_class_type(const sstring& type, const sstring& classname);
static future<permission_set> get_permissions(::shared_ptr<authenticated_user>, data_resource);
/**
* Checks if the username is stored in AUTH_KS.USERS_CF.
*
* @param username Username to query.
* @return whether or not Cassandra knows about the user.
*/
static future<bool> is_existing_user(const sstring& username);
/**
* Checks if the user is a known superuser.
*
* @param username Username to query.
* @return true is the user is a superuser, false if they aren't or don't exist at all.
*/
static future<bool> is_super_user(const sstring& username);
/**
* Inserts the user into AUTH_KS.USERS_CF (or overwrites their superuser status as a result of an ALTER USER query).
*
* @param username Username to insert.
* @param isSuper User's new status.
* @throws RequestExecutionException
*/
static future<> insert_user(const sstring& username, bool is_super);
/**
* Deletes the user from AUTH_KS.USERS_CF.
*
* @param username Username to delete.
* @throws RequestExecutionException
*/
static future<> delete_user(const sstring& username);
/**
* Sets up Authenticator and Authorizer.
*/
static future<> setup();
static future<> shutdown();
/**
* Set up table from given CREATE TABLE statement under system_auth keyspace, if not already done so.
*
* @param name name of the table
* @param cql CREATE TABLE statement
*/
static future<> setup_table(const sstring& name, const sstring& cql);
static future<bool> has_existing_users(const sstring& cfname, const sstring& def_user_name, const sstring& name_column_name);
// For internal use. Run function "when system is up".
typedef std::function<future<>()> scheduled_func;
static void schedule_when_up(scheduled_func);
};
}
std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p);

View File

@@ -39,34 +39,30 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/authenticated_user.hh"
#include "authenticated_user.hh"
#include "auth.hh"
#include <iostream>
const sstring auth::authenticated_user::ANONYMOUS_USERNAME("anonymous");
namespace auth {
auth::authenticated_user::authenticated_user()
: _anon(true)
{}
auth::authenticated_user::authenticated_user(sstring name)
: _name(name), _anon(false)
{}
auth::authenticated_user::authenticated_user(authenticated_user&&) = default;
auth::authenticated_user::authenticated_user(const authenticated_user&) = default;
const sstring& auth::authenticated_user::name() const {
return _anon ? ANONYMOUS_USERNAME : _name;
authenticated_user::authenticated_user(stdx::string_view name)
: name(sstring(name)) {
}
future<bool> auth::authenticated_user::is_super() const {
if (is_anonymous()) {
return make_ready_future<bool>(false);
std::ostream& operator<<(std::ostream& os, const authenticated_user& u) {
if (!u.name) {
os << "anonymous";
} else {
os << *u.name;
}
return auth::auth::is_super_user(_name);
return os;
}
static const authenticated_user the_anonymous_user{};
const authenticated_user& anonymous_user() noexcept {
return the_anonymous_user;
}
bool auth::authenticated_user::operator==(const authenticated_user& v) const {
return _anon ? v._anon : _name == v._name;
}

View File

@@ -41,43 +41,63 @@
#pragma once
#include <experimental/string_view>
#include <functional>
#include <iosfwd>
#include <optional>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
class authenticated_user {
///
/// A type-safe wrapper for the name of a logged-in user, or a nameless (anonymous) user.
///
class authenticated_user final {
public:
static const sstring ANONYMOUS_USERNAME;
///
/// An anonymous user has no name.
///
std::optional<sstring> name{};
authenticated_user();
authenticated_user(sstring name);
authenticated_user(authenticated_user&&);
authenticated_user(const authenticated_user&);
const sstring& name() const;
/**
* Checks the user's superuser status.
* Only a superuser is allowed to perform CREATE USER and DROP USER queries.
* Im most cased, though not necessarily, a superuser will have Permission.ALL on every resource
* (depends on IAuthorizer implementation).
*/
future<bool> is_super() const;
/**
* If IAuthenticator doesn't require authentication, this method may return true.
*/
bool is_anonymous() const {
return _anon;
}
bool operator==(const authenticated_user&) const;
private:
sstring _name;
bool _anon;
///
/// An anonymous user.
///
authenticated_user() = default;
explicit authenticated_user(stdx::string_view name);
};
///
/// The user name, or "anonymous".
///
std::ostream& operator<<(std::ostream&, const authenticated_user&);
inline bool operator==(const authenticated_user& u1, const authenticated_user& u2) noexcept {
return u1.name == u2.name;
}
inline bool operator!=(const authenticated_user& u1, const authenticated_user& u2) noexcept {
return !(u1 == u2);
}
const authenticated_user& anonymous_user() noexcept;
inline bool is_anonymous(const authenticated_user& u) noexcept {
return u == anonymous_user();
}
}
namespace std {
template <>
struct hash<auth::authenticated_user> final {
size_t operator()(const auth::authenticated_user &u) const {
return std::hash<std::optional<sstring>>()(u.name);
}
};
}

View File

@@ -1,5 +1,5 @@
/*
* Copyright (C) 2015 ScyllaDB
* Copyright (C) 2018 ScyllaDB
*/
/*
@@ -19,8 +19,19 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
// Used to ensure that all .hh files build, as well as a place to put
// out-of-line implementations.
#include "auth/authentication_options.hh"
#include "io/i_versioned_serializer.hh"
#include "io/i_serializer.hh"
#include <iostream>
namespace auth {
std::ostream& operator<<(std::ostream& os, authentication_option a) {
switch (a) {
case authentication_option::password: os << "PASSWORD"; break;
case authentication_option::options: os << "OPTIONS"; break;
}
return os;
}
}

View File

@@ -0,0 +1,64 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <iosfwd>
#include <optional>
#include <stdexcept>
#include <unordered_map>
#include <unordered_set>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth {
enum class authentication_option {
password,
options
};
std::ostream& operator<<(std::ostream&, authentication_option);
using authentication_option_set = std::unordered_set<authentication_option>;
using custom_options = std::unordered_map<sstring, sstring>;
struct authentication_options final {
std::optional<sstring> password;
std::optional<custom_options> options;
};
inline bool any_authentication_options(const authentication_options& aos) noexcept {
return aos.password || aos.options;
}
class unsupported_authentication_option : public std::invalid_argument {
public:
explicit unsupported_authentication_option(authentication_option k)
: std::invalid_argument(sprint("The %s option is not supported.", k)) {
}
};
}

View File

@@ -39,89 +39,14 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "authenticator.hh"
#include "authenticated_user.hh"
#include "password_authenticator.hh"
#include "auth.hh"
#include "auth/authenticator.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/password_authenticator.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
const sstring auth::authenticator::USERNAME_KEY("username");
const sstring auth::authenticator::PASSWORD_KEY("password");
const sstring auth::authenticator::ALLOW_ALL_AUTHENTICATOR_NAME("org.apache.cassandra.auth.AllowAllAuthenticator");
auth::authenticator::option auth::authenticator::string_to_option(const sstring& name) {
if (strcasecmp(name.c_str(), "password") == 0) {
return option::PASSWORD;
}
throw std::invalid_argument(name);
}
sstring auth::authenticator::option_to_string(option opt) {
switch (opt) {
case option::PASSWORD:
return "PASSWORD";
default:
throw std::invalid_argument(sprint("Unknown option {}", opt));
}
}
/**
* Authenticator is assumed to be a fully state-less immutable object (note all the const).
* We thus store a single instance globally, since it should be safe/ok.
*/
static std::unique_ptr<auth::authenticator> global_authenticator;
future<>
auth::authenticator::setup(const sstring& type) {
if (auth::auth::is_class_type(type, ALLOW_ALL_AUTHENTICATOR_NAME)) {
class allow_all_authenticator : public authenticator {
public:
const sstring& class_name() const override {
return ALLOW_ALL_AUTHENTICATOR_NAME;
}
bool require_authentication() const override {
return false;
}
option_set supported_options() const override {
return option_set();
}
option_set alterable_options() const override {
return option_set();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
future<> create(sstring username, const option_map& options) override {
return make_ready_future();
}
future<> alter(sstring username, const option_map& options) override {
return make_ready_future();
}
future<> drop(sstring username) override {
return make_ready_future();
}
const resource_ids& protected_resources() const override {
static const resource_ids ids;
return ids;
}
::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
throw std::runtime_error("Should not reach");
}
};
global_authenticator = std::make_unique<allow_all_authenticator>();
} else if (auth::auth::is_class_type(type, password_authenticator::PASSWORD_AUTHENTICATOR_NAME)) {
auto pwa = std::make_unique<password_authenticator>();
auto f = pwa->init();
return f.then([pwa = std::move(pwa)]() mutable {
global_authenticator = std::move(pwa);
});
} else {
throw exceptions::configuration_exception("Invalid authenticator type: " + type);
}
return make_ready_future();
}
auth::authenticator& auth::authenticator::get() {
assert(global_authenticator);
return *global_authenticator;
}

View File

@@ -41,21 +41,24 @@
#pragma once
#include <experimental/string_view>
#include <memory>
#include <unordered_map>
#include <set>
#include <stdexcept>
#include <unordered_map>
#include <boost/any.hpp>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/enum.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/shared_ptr.hh>
#include "auth/authentication_options.hh"
#include "auth/resource.hh"
#include "bytes.hh"
#include "data_resource.hh"
#include "enum_set.hh"
#include "exceptions/exceptions.hh"
#include "stdx.hh"
namespace db {
class config;
@@ -65,136 +68,104 @@ namespace auth {
class authenticated_user;
///
/// Abstract client for authenticating role identity.
///
/// All state necessary to authorize a role is stored externally to the client instance.
///
class authenticator {
public:
///
/// The name of the key to be used for the user-name part of password authentication with \ref authenticate.
///
static const sstring USERNAME_KEY;
///
/// The name of the key to be used for the password part of password authentication with \ref authenticate.
///
static const sstring PASSWORD_KEY;
static const sstring ALLOW_ALL_AUTHENTICATOR_NAME;
/**
* Supported CREATE USER/ALTER USER options.
* Currently only PASSWORD is available.
*/
enum class option {
PASSWORD
};
static option string_to_option(const sstring&);
static sstring option_to_string(option);
using option_set = enum_set<super_enum<option, option::PASSWORD>>;
using option_map = std::unordered_map<option, boost::any, enum_hash<option>>;
using credentials_map = std::unordered_map<sstring, sstring>;
/**
* Setup is called once upon system startup to initialize the IAuthenticator.
*
* For example, use this method to create any required keyspaces/column families.
* Note: Only call from main thread.
*/
static future<> setup(const sstring& type);
virtual ~authenticator() = default;
/**
* Returns the system authenticator. Must have called setup before calling this.
*/
static authenticator& get();
virtual future<> start() = 0;
virtual ~authenticator()
{}
virtual future<> stop() = 0;
virtual const sstring& class_name() const = 0;
///
/// A fully-qualified (class with package) Java-like name for this implementation.
///
virtual const sstring& qualified_java_name() const = 0;
/**
* Whether or not the authenticator requires explicit login.
* If false will instantiate user with AuthenticatedUser.ANONYMOUS_USER.
*/
virtual bool require_authentication() const = 0;
/**
* Set of options supported by CREATE USER and ALTER USER queries.
* Should never return null - always return an empty set instead.
*/
virtual option_set supported_options() const = 0;
virtual authentication_option_set supported_options() const = 0;
/**
* Subset of supportedOptions that users are allowed to alter when performing ALTER USER [themselves].
* Should never return null - always return an empty set instead.
*/
virtual option_set alterable_options() const = 0;
///
/// A subset of `supported_options()` that users are permitted to alter for themselves.
///
virtual authentication_option_set alterable_options() const = 0;
/**
* Authenticates a user given a Map<String, String> of credentials.
* Should never return null - always throw AuthenticationException instead.
* Returning AuthenticatedUser.ANONYMOUS_USER is an option as well if authentication is not required.
*
* @throws authentication_exception if credentials don't match any known user.
*/
virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const = 0;
///
/// Authenticate a user given implementation-specific credentials.
///
/// If this implementation does not require authentication (\ref require_authentication), an anonymous user may
/// result.
///
/// \returns an exceptional future with \ref exceptions::authentication_exception if given invalid credentials.
///
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const = 0;
/**
* Called during execution of CREATE USER query (also may be called on startup, see seedSuperuserOptions method).
* If authenticator is static then the body of the method should be left blank, but don't throw an exception.
* options are guaranteed to be a subset of supportedOptions().
*
* @param username Username of the user to create.
* @param options Options the user will be created with.
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> create(sstring username, const option_map& options) = 0;
///
/// Create an authentication record for a new user. This is required before the user can log-in.
///
/// The options provided must be a subset of `supported_options()`.
///
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const = 0;
/**
* Called during execution of ALTER USER query.
* options are always guaranteed to be a subset of supportedOptions(). Furthermore, if the user performing the query
* is not a superuser and is altering himself, then options are guaranteed to be a subset of alterableOptions().
* Keep the body of the method blank if your implementation doesn't support any options.
*
* @param username Username of the user that will be altered.
* @param options Options to alter.
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> alter(sstring username, const option_map& options) = 0;
///
/// Alter the authentication record of an existing user.
///
/// The options provided must be a subset of `supported_options()`.
///
/// Callers must ensure that the specification of `alterable_options()` is adhered to.
///
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const = 0;
///
/// Delete the authentication record for a user. This will disallow the user from logging in.
///
virtual future<> drop(stdx::string_view role_name) const = 0;
/**
* Called during execution of DROP USER query.
*
* @param username Username of the user that will be dropped.
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> drop(sstring username) = 0;
///
/// Query for custom options (those corresponding to \ref authentication_options::options).
///
/// If no options are set the result is an empty container.
///
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const = 0;
/**
* Set of resources that should be made inaccessible to users and only accessible internally.
*
* @return Keyspaces, column families that will be unmodifiable by users; other resources.
* @see resource_ids
*/
virtual const resource_ids& protected_resources() const = 0;
///
/// System resources used internally as part of the implementation. These are made inaccessible to users.
///
virtual const resource_set& protected_resources() const = 0;
///
/// A stateful SASL challenge which supports many authentication schemes (depending on the implementation).
///
class sasl_challenge {
public:
virtual ~sasl_challenge() {}
virtual ~sasl_challenge() = default;
virtual bytes evaluate_response(bytes_view client_response) = 0;
virtual bool is_complete() const = 0;
virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const = 0;
virtual future<authenticated_user> get_authenticated_user() const = 0;
};
/**
* Provide a sasl_challenge to be used by the CQL binary protocol server. If
* the configured authenticator requires authentication but does not implement this
* interface we refuse to start the binary protocol server as it will have no way
* of authenticating clients.
* @return sasl_challenge implementation
*/
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const = 0;
};
inline std::ostream& operator<<(std::ostream& os, authenticator::option opt) {
return os << authenticator::option_to_string(opt);
}
}

View File

@@ -1,104 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "authorizer.hh"
#include "authenticated_user.hh"
#include "default_authorizer.hh"
#include "auth.hh"
#include "db/config.hh"
const sstring auth::authorizer::ALLOW_ALL_AUTHORIZER_NAME("org.apache.cassandra.auth.AllowAllAuthorizer");
/**
* Authenticator is assumed to be a fully state-less immutable object (note all the const).
* We thus store a single instance globally, since it should be safe/ok.
*/
static std::unique_ptr<auth::authorizer> global_authorizer;
future<>
auth::authorizer::setup(const sstring& type) {
if (auth::auth::is_class_type(type, ALLOW_ALL_AUTHORIZER_NAME)) {
class allow_all_authorizer : public authorizer {
public:
future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");
}
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");
}
future<std::vector<permission_details>> list(::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const override {
throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");
}
future<> revoke_all(sstring dropped_user) override {
return make_ready_future();
}
future<> revoke_all(data_resource) override {
return make_ready_future();
}
const resource_ids& protected_resources() override {
static const resource_ids ids;
return ids;
}
future<> validate_configuration() const override {
return make_ready_future();
}
};
global_authorizer = std::make_unique<allow_all_authorizer>();
} else if (auth::auth::is_class_type(type, default_authorizer::DEFAULT_AUTHORIZER_NAME)) {
auto da = std::make_unique<default_authorizer>();
auto f = da->init();
return f.then([da = std::move(da)]() mutable {
global_authorizer = std::move(da);
});
} else {
throw exceptions::configuration_exception("Invalid authorizer type: " + type);
}
return make_ready_future();
}
auth::authorizer& auth::authorizer::get() {
assert(global_authorizer);
return *global_authorizer;
}

View File

@@ -41,133 +41,116 @@
#pragma once
#include <vector>
#include <experimental/string_view>
#include <functional>
#include <optional>
#include <stdexcept>
#include <tuple>
#include <vector>
#include <experimental/optional>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include "permission.hh"
#include "data_resource.hh"
#include "auth/permission.hh"
#include "auth/resource.hh"
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
class authenticated_user;
class role_or_anonymous;
struct permission_details {
sstring user;
data_resource resource;
sstring role_name;
::auth::resource resource;
permission_set permissions;
bool operator<(const permission_details& v) const {
return std::tie(user, resource, permissions) < std::tie(v.user, v.resource, v.permissions);
}
};
using std::experimental::optional;
inline bool operator==(const permission_details& pd1, const permission_details& pd2) {
return std::forward_as_tuple(pd1.role_name, pd1.resource, pd1.permissions.mask())
== std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions.mask());
}
inline bool operator!=(const permission_details& pd1, const permission_details& pd2) {
return !(pd1 == pd2);
}
inline bool operator<(const permission_details& pd1, const permission_details& pd2) {
return std::forward_as_tuple(pd1.role_name, pd1.resource, pd1.permissions)
< std::forward_as_tuple(pd2.role_name, pd2.resource, pd2.permissions);
}
class unsupported_authorization_operation : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
///
/// Abstract client for authorizing roles to access resources.
///
/// All state necessary to authorize a role is stored externally to the client instance.
///
class authorizer {
public:
static const sstring ALLOW_ALL_AUTHORIZER_NAME;
virtual ~authorizer() = default;
virtual ~authorizer() {}
virtual future<> start() = 0;
/**
* The primary Authorizer method. Returns a set of permissions of a user on a resource.
*
* @param user Authenticated user requesting authorization.
* @param resource Resource for which the authorization is being requested. @see DataResource.
* @return Set of permissions of the user on the resource. Should never return empty. Use permission.NONE instead.
*/
virtual future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const = 0;
virtual future<> stop() = 0;
/**
* Grants a set of permissions on a resource to a user.
* The opposite of revoke().
*
* @param performer User who grants the permissions.
* @param permissions Set of permissions to grant.
* @param to Grantee of the permissions.
* @param resource Resource on which to grant the permissions.
*
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<> grant(::shared_ptr<authenticated_user> performer, permission_set, data_resource, sstring to) = 0;
///
/// A fully-qualified (class with package) Java-like name for this implementation.
///
virtual const sstring& qualified_java_name() const = 0;
/**
* Revokes a set of permissions on a resource from a user.
* The opposite of grant().
*
* @param performer User who revokes the permissions.
* @param permissions Set of permissions to revoke.
* @param from Revokee of the permissions.
* @param resource Resource on which to revoke the permissions.
*
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<> revoke(::shared_ptr<authenticated_user> performer, permission_set, data_resource, sstring from) = 0;
///
/// Query for the permissions granted directly to a role for a particular \ref resource (and not any of its
/// parents).
///
/// The optional role name is empty when an anonymous user is authorized. Some implementations may still wish to
/// grant default permissions in this case.
///
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const = 0;
/**
* Returns a list of permissions on a resource of a user.
*
* @param performer User who wants to see the permissions.
* @param permissions Set of Permission values the user is interested in. The result should only include the matching ones.
* @param resource The resource on which permissions are requested. Can be null, in which case permissions on all resources
* should be returned.
* @param of The user whose permissions are requested. Can be null, in which case permissions of every user should be returned.
*
* @return All of the matching permission that the requesting user is authorized to know about.
*
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<std::vector<permission_details>> list(::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const = 0;
///
/// Grant a set of permissions to a role for a particular \ref resource.
///
/// \throws \ref unsupported_authorization_operation if granting permissions is not supported.
///
virtual future<> grant(stdx::string_view role_name, permission_set, const resource&) const = 0;
/**
* This method is called before deleting a user with DROP USER query so that a new user with the same
* name wouldn't inherit permissions of the deleted user in the future.
*
* @param droppedUser The user to revoke all permissions from.
*/
virtual future<> revoke_all(sstring dropped_user) = 0;
///
/// Revoke a set of permissions from a role for a particular \ref resource.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke(stdx::string_view role_name, permission_set, const resource&) const = 0;
/**
* This method is called after a resource is removed (i.e. keyspace or a table is dropped).
*
* @param droppedResource The resource to revoke all permissions on.
*/
virtual future<> revoke_all(data_resource) = 0;
///
/// Query for all directly granted permissions.
///
/// \throws \ref unsupported_authorization_operation if listing permissions is not supported.
///
virtual future<std::vector<permission_details>> list_all() const = 0;
/**
* Set of resources that should be made inaccessible to users and only accessible internally.
*
* @return Keyspaces, column families that will be unmodifiable by users; other resources.
*/
virtual const resource_ids& protected_resources() = 0;
///
/// Revoke all permissions granted directly to a particular role.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke_all(stdx::string_view role_name) const = 0;
/**
* Validates configuration of IAuthorizer implementation (if configurable).
*
* @throws ConfigurationException when there is a configuration error.
*/
virtual future<> validate_configuration() const = 0;
///
/// Revoke all permissions granted to any role for a particular resource.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
virtual future<> revoke_all(const resource&) const = 0;
/**
* Setup is called once upon system startup to initialize the IAuthorizer.
*
* For example, use this method to create any required keyspaces/column families.
*/
static future<> setup(const sstring& type);
/**
* Returns the system authorizer. Must have called setup before calling this.
*/
static authorizer& get();
///
/// System resources used internally as part of the implementation. These are made inaccessible to users.
///
virtual const resource_set& protected_resources() const = 0;
};
}

97
auth/common.cc Normal file
View File

@@ -0,0 +1,97 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/common.hh"
#include <seastar/core/shared_ptr.hh>
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "database.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
namespace auth {
namespace meta {
const sstring DEFAULT_SUPERUSER_NAME("cassandra");
const sstring AUTH_KS("system_auth");
const sstring USERS_CF("users");
const sstring AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
}
static logging::logger auth_log("auth");
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func) {
struct empty_state { };
return delay_until_system_ready(as).then([&as, func = std::move(func)] () mutable {
return exponential_backoff_retry::do_until_value(1s, 1min, as, [func = std::move(func)] {
return func().then_wrapped([] (auto&& f) -> stdx::optional<empty_state> {
if (f.failed()) {
auth_log.info("Auth task failed with error, rescheduling: {}", f.get_exception());
return { };
}
return { empty_state() };
});
});
}).discard_result();
}
future<> create_metadata_table_if_missing(
stdx::string_view table_name,
cql3::query_processor& qp,
stdx::string_view cql,
::service::migration_manager& mm) {
auto& db = qp.db().local();
if (db.has_schema(meta::AUTH_KS, sstring(table_name))) {
return make_ready_future<>();
}
auto parsed_statement = static_pointer_cast<cql3::statements::raw::cf_statement>(
cql3::query_processor::parse_statement(cql));
parsed_statement->prepare_keyspace(meta::AUTH_KS);
auto statement = static_pointer_cast<cql3::statements::create_table_statement>(
parsed_statement->prepare(db, qp.get_cql_stats())->statement);
const auto schema = statement->get_cf_meta_data(qp.db().local());
const auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
return mm.announce_new_column_family(b.build(), false);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db] { return db.get_version() != database::empty_version; }, pause).then([&mm] {
return do_until([&mm] { return mm.have_schema_agreement(); }, pause);
});
}
}

85
auth/common.hh Normal file
View File

@@ -0,0 +1,85 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <chrono>
#include <experimental/string_view>
#include <seastar/core/future.hh>
#include <seastar/core/abort_source.hh>
#include <seastar/util/noncopyable_function.hh>
#include <seastar/core/reactor.hh>
#include <seastar/core/resource.hh>
#include <seastar/core/sstring.hh>
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
using namespace std::chrono_literals;
class database;
namespace service {
class migration_manager;
}
namespace cql3 {
class query_processor;
}
namespace auth {
namespace meta {
extern const sstring DEFAULT_SUPERUSER_NAME;
extern const sstring AUTH_KS;
extern const sstring USERS_CF;
extern const sstring AUTH_PACKAGE_NAME;
}
template <class Task>
future<> once_among_shards(Task&& f) {
if (engine().cpu_id() == 0u) {
return f();
}
return make_ready_future<>();
}
inline future<> delay_until_system_ready(seastar::abort_source& as) {
return sleep_abortable(15s, as);
}
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
future<> create_metadata_table_if_missing(
stdx::string_view table_name,
cql3::query_processor&,
stdx::string_view cql,
::service::migration_manager&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&);
}

View File

@@ -1,171 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "data_resource.hh"
#include <regex>
#include "service/storage_proxy.hh"
const sstring auth::data_resource::ROOT_NAME("data");
auth::data_resource::data_resource(level l, const sstring& ks, const sstring& cf)
: _level(l), _ks(ks), _cf(cf)
{
}
auth::data_resource::data_resource()
: data_resource(level::ROOT)
{}
auth::data_resource::data_resource(const sstring& ks)
: data_resource(level::KEYSPACE, ks)
{}
auth::data_resource::data_resource(const sstring& ks, const sstring& cf)
: data_resource(level::COLUMN_FAMILY, ks, cf)
{}
auth::data_resource::level auth::data_resource::get_level() const {
return _level;
}
auth::data_resource auth::data_resource::from_name(
const sstring& s) {
static std::regex slash_regex("/");
auto i = std::regex_token_iterator<sstring::const_iterator>(s.begin(),
s.end(), slash_regex, -1);
auto e = std::regex_token_iterator<sstring::const_iterator>();
auto n = std::distance(i, e);
if (n > 3 || ROOT_NAME != sstring(*i++)) {
throw std::invalid_argument(sprint("%s is not a valid data resource name", s));
}
if (n == 1) {
return data_resource();
}
auto ks = *i++;
if (n == 2) {
return data_resource(ks.str());
}
auto cf = *i++;
return data_resource(ks.str(), cf.str());
}
sstring auth::data_resource::name() const {
switch (get_level()) {
case level::ROOT:
return ROOT_NAME;
case level::KEYSPACE:
return sprint("%s/%s", ROOT_NAME, _ks);
case level::COLUMN_FAMILY:
default:
return sprint("%s/%s/%s", ROOT_NAME, _ks, _cf);
}
}
auth::data_resource auth::data_resource::get_parent() const {
switch (get_level()) {
case level::KEYSPACE:
return data_resource();
case level::COLUMN_FAMILY:
return data_resource(_ks);
default:
throw std::invalid_argument("Root-level resource can't have a parent");
}
}
const sstring& auth::data_resource::keyspace() const {
if (is_root_level()) {
throw std::invalid_argument("ROOT data resource has no keyspace");
}
return _ks;
}
const sstring& auth::data_resource::column_family() const {
if (!is_column_family_level()) {
throw std::invalid_argument(sprint("%s data resource has no column family", name()));
}
return _cf;
}
bool auth::data_resource::has_parent() const {
return !is_root_level();
}
bool auth::data_resource::exists() const {
switch (get_level()) {
case level::ROOT:
return true;
case level::KEYSPACE:
return service::get_local_storage_proxy().get_db().local().has_keyspace(_ks);
case level::COLUMN_FAMILY:
default:
return service::get_local_storage_proxy().get_db().local().has_schema(_ks, _cf);
}
}
sstring auth::data_resource::to_string() const {
switch (get_level()) {
case level::ROOT:
return "<all keyspaces>";
case level::KEYSPACE:
return sprint("<keyspace %s>", _ks);
case level::COLUMN_FAMILY:
default:
return sprint("<table %s.%s>", _ks, _cf);
}
}
bool auth::data_resource::operator==(const data_resource& v) const {
return _ks == v._ks && _cf == v._cf;
}
bool auth::data_resource::operator<(const data_resource& v) const {
return _ks < v._ks ? true : (v._ks < _ks ? false : _cf < v._cf);
}
std::ostream& auth::operator<<(std::ostream& os, const data_resource& r) {
return os << r.to_string();
}

View File

@@ -1,159 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "utils/hash.hh"
#include <iosfwd>
#include <set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth {
class data_resource {
private:
enum class level {
ROOT, KEYSPACE, COLUMN_FAMILY
};
static const sstring ROOT_NAME;
level _level;
sstring _ks;
sstring _cf;
data_resource(level, const sstring& ks = {}, const sstring& cf = {});
level get_level() const;
public:
/**
* Creates a DataResource representing the root-level resource.
* @return the root-level resource.
*/
data_resource();
/**
* Creates a DataResource representing a keyspace.
*
* @param keyspace Name of the keyspace.
*/
data_resource(const sstring& ks);
/**
* Creates a DataResource instance representing a column family.
*
* @param keyspace Name of the keyspace.
* @param columnFamily Name of the column family.
*/
data_resource(const sstring& ks, const sstring& cf);
/**
* Parses a data resource name into a DataResource instance.
*
* @param name Name of the data resource.
* @return DataResource instance matching the name.
*/
static data_resource from_name(const sstring&);
/**
* @return Printable name of the resource.
*/
sstring name() const;
/**
* @return Parent of the resource, if any. Throws IllegalStateException if it's the root-level resource.
*/
data_resource get_parent() const;
bool is_root_level() const {
return get_level() == level::ROOT;
}
bool is_keyspace_level() const {
return get_level() == level::KEYSPACE;
}
bool is_column_family_level() const {
return get_level() == level::COLUMN_FAMILY;
}
/**
* @return keyspace of the resource.
* @throws std::invalid_argument if it's the root-level resource.
*/
const sstring& keyspace() const;
/**
* @return column family of the resource.
* @throws std::invalid_argument if it's not a cf-level resource.
*/
const sstring& column_family() const;
/**
* @return Whether or not the resource has a parent in the hierarchy.
*/
bool has_parent() const;
/**
* @return Whether or not the resource exists in scylla.
*/
bool exists() const;
sstring to_string() const;
bool operator==(const data_resource&) const;
bool operator<(const data_resource&) const;
size_t hash_value() const {
return utils::tuple_hash()(_ks, _cf);
}
};
/**
* Resource id mappings, i.e. keyspace and/or column families.
*/
using resource_ids = std::set<data_resource>;
std::ostream& operator<<(std::ostream&, const data_resource&);
}

View File

@@ -39,181 +39,283 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unistd.h>
#include <crypt.h>
#include <random>
#include <chrono>
#include "auth/default_authorizer.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
#include <chrono>
#include <random>
#include <boost/algorithm/string/join.hpp>
#include <boost/range.hpp>
#include <seastar/core/reactor.hh>
#include "auth.hh"
#include "default_authorizer.hh"
#include "authenticated_user.hh"
#include "permission.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/permission.hh"
#include "auth/role_or_anonymous.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
const sstring auth::default_authorizer::DEFAULT_AUTHORIZER_NAME(
"org.apache.cassandra.auth.CassandraAuthorizer");
namespace auth {
static const sstring USER_NAME = "username";
const sstring& default_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "CassandraAuthorizer";
return name;
}
static const sstring ROLE_NAME = "role";
static const sstring RESOURCE_NAME = "resource";
static const sstring PERMISSIONS_NAME = "permissions";
static const sstring PERMISSIONS_CF = "permissions";
static const sstring PERMISSIONS_CF = "role_permissions";
static logging::logger alogger("default_authorizer");
auth::default_authorizer::default_authorizer() {
}
auth::default_authorizer::~default_authorizer() {
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
default_authorizer,
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.CassandraAuthorizer");
default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
}
future<> auth::default_authorizer::init() {
sstring create_table = sprint("CREATE TABLE %s.%s ("
"%s text,"
"%s text,"
"%s set<text>,"
"PRIMARY KEY(%s, %s)"
") WITH gc_grace_seconds=%d", auth::auth::AUTH_KS,
PERMISSIONS_CF, USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME,
USER_NAME, RESOURCE_NAME, 90 * 24 * 60 * 60); // 3 months.
return auth::setup_table(PERMISSIONS_CF, create_table);
default_authorizer::~default_authorizer() {
}
static const sstring legacy_table_name{"permissions"};
future<auth::permission_set> auth::default_authorizer::authorize(
::shared_ptr<authenticated_user> user, data_resource resource) const {
return user->is_super().then([this, user, resource = std::move(resource)](bool is_super) {
if (is_super) {
return make_ready_future<permission_set>(permissions::ALL);
}
bool default_authorizer::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
/**
* TOOD: could create actual data type for permission (translating string<->perm),
* but this seems overkill right now. We still must store strings so...
*/
auto& qp = cql3::get_local_query_processor();
auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? AND %s = ?"
, PERMISSIONS_NAME, auth::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return qp.process(query, db::consistency_level::LOCAL_ONE, {user->name(), resource.name() })
.then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
future<bool> default_authorizer::any_granted() const {
static const sstring query = sprint("SELECT * FROM %s.%s LIMIT 1", meta::AUTH_KS, PERMISSIONS_CF);
if (res->empty() || !res->one().has(PERMISSIONS_NAME)) {
return make_ready_future<permission_set>(permissions::NONE);
}
return make_ready_future<permission_set>(permissions::from_strings(res->one().get_set<sstring>(PERMISSIONS_NAME)));
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);
return make_ready_future<permission_set>(permissions::NONE);
}
});
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
});
}
#include <boost/range.hpp>
future<> default_authorizer::migrate_legacy_metadata() const {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
future<> auth::default_authorizer::modify(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring user, sstring op) {
// TODO: why does this not check super user?
auto& qp = cql3::get_local_query_processor();
auto query = sprint("UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",
auth::AUTH_KS, PERMISSIONS_CF, PERMISSIONS_NAME,
PERMISSIONS_NAME, op, USER_NAME, RESOURCE_NAME);
return qp.process(query, db::consistency_level::ONE, {
permissions::to_strings(set), user, resource.name() }).discard_result();
}
future<> auth::default_authorizer::grant(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring to) {
return modify(std::move(performer), std::move(set), std::move(resource), std::move(to), "+");
}
future<> auth::default_authorizer::revoke(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring from) {
return modify(std::move(performer), std::move(set), std::move(resource), std::move(from), "-");
}
future<std::vector<auth::permission_details>> auth::default_authorizer::list(
::shared_ptr<authenticated_user> performer, permission_set set,
optional<data_resource> resource, optional<sstring> user) const {
return performer->is_super().then([this, performer, set = std::move(set), resource = std::move(resource), user = std::move(user)](bool is_super) {
if (!is_super && (!user || performer->name() != *user)) {
throw exceptions::unauthorized_exception(sprint("You are not authorized to view %s's permissions", user ? *user : "everyone"));
}
auto query = sprint("SELECT %s, %s, %s FROM %s.%s", USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME, auth::AUTH_KS, PERMISSIONS_CF);
auto& qp = cql3::get_local_query_processor();
// Oh, look, it is a case where it does not pay off to have
// parameters to process in an initializer list.
future<::shared_ptr<cql3::untyped_result_set>> f = make_ready_future<::shared_ptr<cql3::untyped_result_set>>();
if (resource && user) {
query += sprint(" WHERE %s = ? AND %s = ?", USER_NAME, RESOURCE_NAME);
f = qp.process(query, db::consistency_level::ONE, {*user, resource->name()});
} else if (resource) {
query += sprint(" WHERE %s = ? ALLOW FILTERING", RESOURCE_NAME);
f = qp.process(query, db::consistency_level::ONE, {resource->name()});
} else if (user) {
query += sprint(" WHERE %s = ?", USER_NAME);
f = qp.process(query, db::consistency_level::ONE, {*user});
} else {
f = qp.process(query, db::consistency_level::ONE, {});
}
return f.then([set](::shared_ptr<cql3::untyped_result_set> res) {
std::vector<permission_details> result;
for (auto& row : *res) {
if (row.has(PERMISSIONS_NAME)) {
auto username = row.get_as<sstring>(USER_NAME);
auto resource = data_resource::from_name(row.get_as<sstring>(RESOURCE_NAME));
auto ps = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
ps = permission_set::from_mask(ps.mask() & set.mask());
result.emplace_back(permission_details {username, resource, ps});
}
}
return make_ready_future<std::vector<permission_details>>(std::move(result));
});
return _qp.process(
query,
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
parse_resource(row.get_as<sstring>(RESOURCE_NAME)),
[this, &row](const auto& username, const auto& r) {
const permission_set perms = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
return grant(username, perms, r);
});
}).finally([results] {});
}).then([] {
alogger.info("Finished migrating legacy permissions metadata.");
}).handle_exception([](std::exception_ptr ep) {
alogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> auth::default_authorizer::revoke_all(sstring dropped_user) {
auto& qp = cql3::get_local_query_processor();
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?", auth::AUTH_KS,
PERMISSIONS_CF, USER_NAME);
return qp.process(query, db::consistency_level::ONE, { dropped_user }).discard_result().handle_exception(
[dropped_user](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);
future<> default_authorizer::start() {
static const sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text,"
"%s set<text>,"
"PRIMARY KEY(%s, %s)"
") WITH gc_grace_seconds=%d",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
ROLE_NAME,
RESOURCE_NAME,
90 * 24 * 60 * 60); // 3 months.
return once_among_shards([this] {
return create_metadata_table_if_missing(
PERMISSIONS_CF,
_qp,
create_table,
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {
migrate_legacy_metadata().get0();
return;
}
});
alogger.warn("Ignoring legacy permissions metadata since role permissions exist.");
}
});
});
});
});
}
future<> auth::default_authorizer::revoke_all(data_resource resource) {
auto& qp = cql3::get_local_query_processor();
auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",
USER_NAME, auth::AUTH_KS, PERMISSIONS_CF, RESOURCE_NAME);
return qp.process(query, db::consistency_level::LOCAL_ONE, { resource.name() })
.then_wrapped([resource, &qp](future<::shared_ptr<cql3::untyped_result_set>> f) {
future<> default_authorizer::stop() {
_as.request_abort();
return _finished.handle_exception_type([](const sleep_aborted&) {});
}
future<permission_set>
default_authorizer::authorize(const role_or_anonymous& maybe_role, const resource& r) const {
if (is_anonymous(maybe_role)) {
return make_ready_future<permission_set>(permissions::NONE);
}
static const sstring query = sprint(
"SELECT %s FROM %s.%s WHERE %s = ? AND %s = ?",
PERMISSIONS_NAME,
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
}
return permissions::from_strings(results->one().get_set<sstring>(PERMISSIONS_NAME));
});
}
future<>
default_authorizer::modify(
stdx::string_view role_name,
permission_set set,
const resource& resource,
stdx::string_view op) const {
return do_with(
sprint(
"UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",
meta::AUTH_KS,
PERMISSIONS_CF,
PERMISSIONS_NAME,
PERMISSIONS_NAME,
op,
ROLE_NAME,
RESOURCE_NAME),
[this, &role_name, set, &resource](const auto& query) {
return _qp.process(
query,
db::consistency_level::ONE,
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
future<> default_authorizer::grant(stdx::string_view role_name, permission_set set, const resource& resource) const {
return modify(role_name, std::move(set), resource, "+");
}
future<> default_authorizer::revoke(stdx::string_view role_name, permission_set set, const resource& resource) const {
return modify(role_name, std::move(set), resource, "-");
}
future<std::vector<permission_details>> default_authorizer::list_all() const {
static const sstring query = sprint(
"SELECT %s, %s, %s FROM %s.%s",
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
meta::AUTH_KS,
PERMISSIONS_CF);
return _qp.process(
query,
db::consistency_level::ONE,
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
for (const auto& row : *results) {
if (row.has(PERMISSIONS_NAME)) {
auto role_name = row.get_as<sstring>(ROLE_NAME);
auto resource = parse_resource(row.get_as<sstring>(RESOURCE_NAME));
auto perms = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
all_details.push_back(permission_details{std::move(role_name), std::move(resource), std::move(perms)});
}
}
return all_details;
});
}
future<> default_authorizer::revoke_all(stdx::string_view role_name) const {
static const sstring query = sprint(
"DELETE FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME);
return _qp.process(
query,
db::consistency_level::ONE,
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
});
}
future<> default_authorizer::revoke_all(const resource& resource) const {
static const sstring query = sprint(
"SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",
ROLE_NAME,
meta::AUTH_KS,
PERMISSIONS_CF,
RESOURCE_NAME);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
return parallel_for_each(res->begin(), res->end(), [&qp, res, resource](const cql3::untyped_result_set::row& r) {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ? AND %s = ?"
, auth::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return qp.process(query, db::consistency_level::LOCAL_ONE, { r.get_as<sstring>(USER_NAME), resource.name() })
.discard_result().handle_exception([resource](auto ep) {
return parallel_for_each(
res->begin(),
res->end(),
[this, res, resource](const cql3::untyped_result_set::row& r) {
static const sstring query = sprint(
"DELETE FROM %s.%s WHERE %s = ? AND %s = ?",
meta::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
@@ -229,12 +331,9 @@ future<> auth::default_authorizer::revoke_all(data_resource resource) {
});
}
const auth::resource_ids& auth::default_authorizer::protected_resources() {
static const resource_ids ids({ data_resource(auth::AUTH_KS, PERMISSIONS_CF) });
return ids;
const resource_set& default_authorizer::protected_resources() const {
static const resource_set resources({ make_data_resource(meta::AUTH_KS, PERMISSIONS_CF) });
return resources;
}
future<> auth::default_authorizer::validate_configuration() const {
return make_ready_future();
}

View File

@@ -41,37 +41,62 @@
#pragma once
#include "authorizer.hh"
#include <functional>
#include <seastar/core/abort_source.hh>
#include "auth/authorizer.hh"
#include "cql3/query_processor.hh"
#include "service/migration_manager.hh"
namespace auth {
class default_authorizer : public authorizer {
public:
static const sstring DEFAULT_AUTHORIZER_NAME;
const sstring& default_authorizer_name();
class default_authorizer : public authorizer {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
abort_source _as{};
future<> _finished{make_ready_future<>()};
public:
default_authorizer(cql3::query_processor&, ::service::migration_manager&);
default_authorizer();
~default_authorizer();
future<> init();
virtual future<> start() override;
future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const override;
virtual future<> stop() override;
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;
virtual const sstring& qualified_java_name() const override {
return default_authorizer_name();
}
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override;
future<std::vector<permission_details>> list(::shared_ptr<authenticated_user>, permission_set, optional<data_resource>, optional<sstring>) const override;
virtual future<> grant(stdx::string_view, permission_set, const resource&) const override;
future<> revoke_all(sstring) override;
virtual future<> revoke( stdx::string_view, permission_set, const resource&) const override;
future<> revoke_all(data_resource) override;
virtual future<std::vector<permission_details>> list_all() const override;
const resource_ids& protected_resources() override;
virtual future<> revoke_all(stdx::string_view) const override;
future<> validate_configuration() const override;
virtual future<> revoke_all(const resource&) const override;
virtual const resource_set& protected_resources() const override;
private:
future<> modify(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring, sstring);
bool legacy_metadata_exists() const;
future<bool> any_granted() const;
future<> migrate_legacy_metadata() const;
future<> modify(stdx::string_view, permission_set, const resource&, stdx::string_view) const;
};
} /* namespace auth */

View File

@@ -39,35 +39,57 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unistd.h>
#include <crypt.h>
#include <random>
#include <chrono>
#include "auth/password_authenticator.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
#include <algorithm>
#include <chrono>
#include <random>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <seastar/core/reactor.hh>
#include "auth.hh"
#include "password_authenticator.hh"
#include "authenticated_user.hh"
#include "cql3/query_processor.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/roles-metadata.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
const sstring auth::password_authenticator::PASSWORD_AUTHENTICATOR_NAME("org.apache.cassandra.auth.PasswordAuthenticator");
namespace auth {
const sstring& password_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "PasswordAuthenticator";
return name;
}
// name of the hash column.
static const sstring SALTED_HASH = "salted_hash";
static const sstring USER_NAME = "username";
static const sstring DEFAULT_USER_NAME = auth::auth::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = auth::auth::DEFAULT_SUPERUSER_NAME;
static const sstring CREDENTIALS_CF = "credentials";
static const sstring DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = meta::DEFAULT_SUPERUSER_NAME;
static logging::logger plogger("password_authenticator");
auth::password_authenticator::~password_authenticator()
{}
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
password_authenticator,
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
auth::password_authenticator::password_authenticator()
{}
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm)
, _stopped(make_ready_future<>()) {
}
// TODO: blowfish
// Origin uses Java bcrypt library, i.e. blowfish salt
@@ -88,12 +110,10 @@ auth::password_authenticator::password_authenticator()
// and some old-fashioned random salt generation.
static constexpr size_t rand_bytes = 16;
static thread_local crypt_data tlcrypt = { 0, };
static sstring hashpw(const sstring& pass, const sstring& salt) {
// crypt_data is huge. should this be a thread_local static?
auto tmp = std::make_unique<crypt_data>();
tmp->initialized = 0;
auto res = crypt_r(pass.c_str(), salt.c_str(), tmp.get());
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (res == nullptr) {
throw std::system_error(errno, std::system_category());
}
@@ -122,17 +142,16 @@ static sstring gensalt() {
sstring salt;
if (!prefix.empty()) {
return prefix + salt;
return prefix + input;
}
auto tmp = std::make_unique<crypt_data>();
tmp->initialized = 0;
// Try in order:
// blowfish 2011 fix, blowfish, sha512, sha256, md5
for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {
salt = pfx + input;
if (crypt_r("fisk", salt.c_str(), tmp.get())) {
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
prefix = pfx;
return salt;
}
@@ -144,63 +163,125 @@ static sstring hashpw(const sstring& pass) {
return hashpw(pass, gensalt());
}
future<> auth::password_authenticator::init() {
gensalt(); // do this once to determine usable hashing
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return utf8_type->deserialize(row.get_blob(SALTED_HASH)) != data_value::make_null(utf8_type);
}
sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text," // salt + hash + number of rounds
"options map<text,text>,"// for future extensions
"PRIMARY KEY(%s)"
") WITH gc_grace_seconds=%d",
auth::auth::AUTH_KS,
CREDENTIALS_CF, USER_NAME, SALTED_HASH, USER_NAME,
90 * 24 * 60 * 60); // 3 months.
static const sstring update_row_query = sprint(
"UPDATE %s SET %s = ? WHERE %s = ?",
meta::roles_table::qualified_name(),
SALTED_HASH,
meta::roles_table::role_col_name);
return auth::setup_table(CREDENTIALS_CF, create_table).then([this] {
// instead of once-timer, just schedule this later
auth::schedule_when_up([] {
return auth::has_existing_users(CREDENTIALS_CF, DEFAULT_USER_NAME, USER_NAME).then([](bool exists) {
if (!exists) {
cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
auth::AUTH_KS,
CREDENTIALS_CF,
USER_NAME, SALTED_HASH
),
db::consistency_level::ONE, {DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD)}).then([](auto) {
plogger.info("Created default user '{}'", DEFAULT_USER_NAME);
});
}
});
});
static const sstring legacy_table_name{"credentials"};
bool password_authenticator::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::QUORUM).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
return _qp.process(
update_row_query,
consistency_for_user(username),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
plogger.info("Finished migrating legacy authentication metadata.");
}).handle_exception([](std::exception_ptr ep) {
plogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
db::consistency_level auth::password_authenticator::consistency_for_user(const sstring& username) {
if (username == DEFAULT_USER_NAME) {
future<> password_authenticator::create_default_if_missing() const {
return default_role_row_satisfies(_qp, &has_salted_hash).then([this](bool exists) {
if (!exists) {
return _qp.process(
update_row_query,
db::consistency_level::QUORUM,
{hashpw(DEFAULT_USER_PASSWORD), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
}
return make_ready_future<>();
});
}
future<> password_authenticator::start() {
return once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
auto f = create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager);
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get0();
return;
}
create_default_if_missing().get0();
});
});
return f;
});
}
future<> password_authenticator::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
}
db::consistency_level password_authenticator::consistency_for_user(stdx::string_view role_name) {
if (role_name == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
const sstring& auth::password_authenticator::class_name() const {
return PASSWORD_AUTHENTICATOR_NAME;
const sstring& password_authenticator::qualified_java_name() const {
return password_authenticator_name();
}
bool auth::password_authenticator::require_authentication() const {
bool password_authenticator::require_authentication() const {
return true;
}
auth::authenticator::option_set auth::password_authenticator::supported_options() const {
return option_set::of<option::PASSWORD>();
authentication_option_set password_authenticator::supported_options() const {
return authentication_option_set{authentication_option::password};
}
auth::authenticator::option_set auth::password_authenticator::alterable_options() const {
return option_set::of<option::PASSWORD>();
authentication_option_set password_authenticator::alterable_options() const {
return authentication_option_set{authentication_option::password};
}
future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::authenticate(
future<authenticated_user> password_authenticator::authenticate(
const credentials_map& credentials) const {
if (!credentials.count(USERNAME_KEY)) {
throw exceptions::authentication_exception(sprint("Required key '%s' is missing", USERNAME_KEY));
@@ -218,17 +299,24 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
return futurize_apply([this, username, password] {
auto& qp = cql3::get_local_query_processor();
return qp.process(sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), {username}, true);
static const sstring query = sprint(
"SELECT %s FROM %s WHERE %s = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_user(username),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>(username));
return make_ready_future<authenticated_user>(username);
} catch (std::system_error &) {
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
@@ -239,54 +327,60 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
});
}
future<> auth::password_authenticator::create(sstring username,
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME, SALTED_HASH);
auto& qp = cql3::get_local_query_processor();
return qp.process(query, consistency_for_user(username), { username, hashpw(password) }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
future<> password_authenticator::create(stdx::string_view role_name, const authentication_options& options) const {
if (!options.password) {
return make_ready_future<>();
}
return _qp.process(
update_row_query,
consistency_for_user(role_name),
{hashpw(*options.password), sstring(role_name)}).discard_result();
}
future<> auth::password_authenticator::alter(sstring username,
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("UPDATE %s.%s SET %s = ? WHERE %s = ?",
auth::AUTH_KS, CREDENTIALS_CF, SALTED_HASH, USER_NAME);
auto& qp = cql3::get_local_query_processor();
return qp.process(query, consistency_for_user(username), { hashpw(password), username }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
future<> password_authenticator::alter(stdx::string_view role_name, const authentication_options& options) const {
if (!options.password) {
return make_ready_future<>();
}
static const sstring query = sprint(
"UPDATE %s SET %s = ? WHERE %s = ?",
meta::roles_table::qualified_name(),
SALTED_HASH,
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_user(role_name),
{hashpw(*options.password), sstring(role_name)}).discard_result();
}
future<> auth::password_authenticator::drop(sstring username) {
try {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?",
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME);
auto& qp = cql3::get_local_query_processor();
return qp.process(query, consistency_for_user(username), { username }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
}
future<> password_authenticator::drop(stdx::string_view name) const {
static const sstring query = sprint(
"DELETE %s FROM %s WHERE %s = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(query, consistency_for_user(name), {sstring(name)}).discard_result();
}
const auth::resource_ids& auth::password_authenticator::protected_resources() const {
static const resource_ids ids({ data_resource(auth::AUTH_KS, CREDENTIALS_CF) });
return ids;
future<custom_options> password_authenticator::query_custom_options(stdx::string_view role_name) const {
return make_ready_future<custom_options>();
}
::shared_ptr<auth::authenticator::sasl_challenge> auth::password_authenticator::new_sasl_challenge() const {
class plain_text_password_challenge: public sasl_challenge {
const resource_set& password_authenticator::protected_resources() const {
static const resource_set resources({make_data_resource(meta::AUTH_KS, meta::roles_table::name)});
return resources;
}
::shared_ptr<authenticator::sasl_challenge> password_authenticator::new_sasl_challenge() const {
class plain_text_password_challenge : public sasl_challenge {
const password_authenticator& _self;
public:
plain_text_password_challenge(const password_authenticator& a)
: _authenticator(a)
{}
plain_text_password_challenge(const password_authenticator& self) : _self(self) {
}
/**
* SASL PLAIN mechanism specifies that credentials are encoded in a
@@ -336,16 +430,19 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
_complete = true;
return {};
}
bool is_complete() const override {
return _complete;
}
future<::shared_ptr<authenticated_user>> get_authenticated_user() const override {
return _authenticator.authenticate(_credentials);
future<authenticated_user> get_authenticated_user() const override {
return _self.authenticate(_credentials);
}
private:
const password_authenticator& _authenticator;
credentials_map _credentials;
bool _complete = false;
};
return ::make_shared<plain_text_password_challenge>(*this);
}
}

View File

@@ -41,32 +41,64 @@
#pragma once
#include "authenticator.hh"
#include <seastar/core/abort_source.hh>
#include "auth/authenticator.hh"
#include "cql3/query_processor.hh"
namespace service {
class migration_manager;
}
namespace auth {
class password_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
const sstring& password_authenticator_name();
class password_authenticator : public authenticator {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
future<> _stopped;
seastar::abort_source _as;
public:
static db::consistency_level consistency_for_user(stdx::string_view role_name);
password_authenticator(cql3::query_processor&, ::service::migration_manager&);
password_authenticator();
~password_authenticator();
future<> init();
virtual future<> start() override;
const sstring& class_name() const override;
bool require_authentication() const override;
option_set supported_options() const override;
option_set alterable_options() const override;
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override;
future<> create(sstring username, const option_map& options) override;
future<> alter(sstring username, const option_map& options) override;
future<> drop(sstring username) override;
const resource_ids& protected_resources() const override;
::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
virtual future<> stop() override;
virtual const sstring& qualified_java_name() const override;
static db::consistency_level consistency_for_user(const sstring& username);
virtual bool require_authentication() const override;
virtual authentication_option_set supported_options() const override;
virtual authentication_option_set alterable_options() const override;
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override;
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const override;
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const override;
virtual future<> drop(stdx::string_view role_name) const override;
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override;
virtual const resource_set& protected_resources() const override;
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
private:
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata() const;
future<> create_default_if_missing() const;
};
}

View File

@@ -39,32 +39,33 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include "permission.hh"
#include "auth/permission.hh"
#include <boost/algorithm/string.hpp>
#include <unordered_map>
const auth::permission_set auth::permissions::ALL = auth::permission_set::of<
auth::permission::CREATE,
auth::permission::ALTER,
auth::permission::DROP,
auth::permission::SELECT,
auth::permission::MODIFY,
auth::permission::AUTHORIZE,
auth::permission::DESCRIBE>();
const auth::permission_set auth::permissions::ALL_DATA =
auth::permission_set::of<auth::permission::CREATE,
auth::permission::ALTER, auth::permission::DROP,
auth::permission::SELECT,
auth::permission::MODIFY,
auth::permission::AUTHORIZE>();
const auth::permission_set auth::permissions::ALL = auth::permissions::ALL_DATA;
const auth::permission_set auth::permissions::NONE;
const auth::permission_set auth::permissions::ALTERATIONS =
auth::permission_set::of<auth::permission::CREATE,
auth::permission::ALTER, auth::permission::DROP>();
static const std::unordered_map<sstring, auth::permission> permission_names({
{ "READ", auth::permission::READ },
{ "WRITE", auth::permission::WRITE },
{ "CREATE", auth::permission::CREATE },
{ "ALTER", auth::permission::ALTER },
{ "DROP", auth::permission::DROP },
{ "SELECT", auth::permission::SELECT },
{ "MODIFY", auth::permission::MODIFY },
{ "AUTHORIZE", auth::permission::AUTHORIZE },
});
{"READ", auth::permission::READ},
{"WRITE", auth::permission::WRITE},
{"CREATE", auth::permission::CREATE},
{"ALTER", auth::permission::ALTER},
{"DROP", auth::permission::DROP},
{"SELECT", auth::permission::SELECT},
{"MODIFY", auth::permission::MODIFY},
{"AUTHORIZE", auth::permission::AUTHORIZE},
{"DESCRIBE", auth::permission::DESCRIBE}});
const sstring& auth::permissions::to_string(permission p) {
for (auto& v : permission_names) {

View File

@@ -42,10 +42,11 @@
#pragma once
#include <unordered_set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "enum_set.hh"
#include "seastarx.hh"
namespace auth {
@@ -66,9 +67,13 @@ enum class permission {
// permission management
AUTHORIZE, // required for GRANT and REVOKE.
DESCRIBE, // required on the root-level role resource to list all roles.
};
typedef enum_set<super_enum<permission,
typedef enum_set<
super_enum<
permission,
permission::READ,
permission::WRITE,
permission::CREATE,
@@ -76,16 +81,15 @@ typedef enum_set<super_enum<permission,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>> permission_set;
permission::AUTHORIZE,
permission::DESCRIBE>> permission_set;
bool operator<(const permission_set&, const permission_set&);
namespace permissions {
extern const permission_set ALL_DATA;
extern const permission_set ALL;
extern const permission_set NONE;
extern const permission_set ALTERATIONS;
const sstring& to_string(permission);
permission from_string(const sstring&);
@@ -93,7 +97,6 @@ permission from_string(const sstring&);
std::unordered_set<sstring> to_strings(const permission_set&);
permission_set from_strings(const std::unordered_set<sstring>&);
}
}

53
auth/permissions_cache.cc Normal file
View File

@@ -0,0 +1,53 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/permissions_cache.hh"
#include "auth/authorizer.hh"
#include "auth/common.hh"
#include "auth/service.hh"
#include "db/config.hh"
namespace auth {
permissions_cache_config permissions_cache_config::from_db_config(const db::config& dc) {
permissions_cache_config c;
c.max_entries = dc.permissions_cache_max_entries();
c.validity_period = std::chrono::milliseconds(dc.permissions_validity_in_ms());
c.update_period = std::chrono::milliseconds(dc.permissions_update_interval_in_ms());
return c;
}
permissions_cache::permissions_cache(const permissions_cache_config& c, service& ser, logging::logger& log)
: _cache(c.max_entries, c.validity_period, c.update_period, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first);
return ser.get_uncached_permissions(k.first, k.second);
}) {
}
future<permission_set> permissions_cache::get(const role_or_anonymous& maybe_role, const resource& r) {
return do_with(key_type(maybe_role, r), [this](const auto& k) {
return _cache.get(k);
});
}
}

91
auth/permissions_cache.hh Normal file
View File

@@ -0,0 +1,91 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <chrono>
#include <experimental/string_view>
#include <functional>
#include <iostream>
#include <optional>
#include <utility>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "auth/authenticated_user.hh"
#include "auth/permission.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
#include "log.hh"
#include "stdx.hh"
#include "utils/hash.hh"
#include "utils/loading_cache.hh"
namespace std {
inline std::ostream& operator<<(std::ostream& os, const pair<auth::role_or_anonymous, auth::resource>& p) {
os << "{role: " << p.first << ", resource: " << p.second << "}";
return os;
}
}
namespace db {
class config;
}
namespace auth {
class service;
struct permissions_cache_config final {
static permissions_cache_config from_db_config(const db::config&);
std::size_t max_entries;
std::chrono::milliseconds validity_period;
std::chrono::milliseconds update_period;
};
class permissions_cache final {
using cache_type = utils::loading_cache<
std::pair<role_or_anonymous, resource>,
permission_set,
utils::loading_cache_reload_enabled::yes,
utils::simple_entry_size<permission_set>,
utils::tuple_hash>;
using key_type = typename cache_type::key_type;
cache_type _cache;
public:
explicit permissions_cache(const permissions_cache_config&, service&, logging::logger&);
future <> stop() {
return _cache.stop();
}
future<permission_set> get(const role_or_anonymous&, const resource&);
};
}

296
auth/resource.cc Normal file
View File

@@ -0,0 +1,296 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/resource.hh"
#include <algorithm>
#include <iterator>
#include <unordered_map>
#include <boost/algorithm/string/join.hpp>
#include <boost/algorithm/string/split.hpp>
#include "service/storage_proxy.hh"
namespace auth {
std::ostream& operator<<(std::ostream& os, resource_kind kind) {
switch (kind) {
case resource_kind::data: os << "data"; break;
case resource_kind::role: os << "role"; break;
}
return os;
}
static const std::unordered_map<resource_kind, stdx::string_view> roots{
{resource_kind::data, "data"},
{resource_kind::role, "roles"}};
static const std::unordered_map<resource_kind, std::size_t> max_parts{
{resource_kind::data, 2},
{resource_kind::role, 1}};
static permission_set applicable_permissions(const data_resource_view& dv) {
if (dv.table()) {
return permission_set::of<
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>();
}
return permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>();
}
static permission_set applicable_permissions(const role_resource_view& rv) {
if (rv.role()) {
return permission_set::of<permission::ALTER, permission::DROP, permission::AUTHORIZE>();
}
return permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::AUTHORIZE,
permission::DESCRIBE>();
}
resource::resource(resource_kind kind) : _kind(kind), _parts{sstring(roots.at(kind))} {
}
resource::resource(resource_kind kind, std::vector<sstring> parts) : resource(kind) {
_parts.reserve(parts.size() + 1);
_parts.insert(_parts.end(), std::make_move_iterator(parts.begin()), std::make_move_iterator(parts.end()));
}
resource::resource(data_resource_t, stdx::string_view keyspace)
: resource(resource_kind::data, std::vector<sstring>{sstring(keyspace)}) {
}
resource::resource(data_resource_t, stdx::string_view keyspace, stdx::string_view table)
: resource(resource_kind::data, std::vector<sstring>{sstring(keyspace), sstring(table)}) {
}
resource::resource(role_resource_t, stdx::string_view role)
: resource(resource_kind::role, std::vector<sstring>{sstring(role)}) {
}
sstring resource::name() const {
return boost::algorithm::join(_parts, "/");
}
std::optional<resource> resource::parent() const {
if (_parts.size() == 1) {
return {};
}
resource copy = *this;
copy._parts.pop_back();
return copy;
}
permission_set resource::applicable_permissions() const {
permission_set ps;
switch (_kind) {
case resource_kind::data: ps = ::auth::applicable_permissions(data_resource_view(*this)); break;
case resource_kind::role: ps = ::auth::applicable_permissions(role_resource_view(*this)); break;
}
return ps;
}
bool operator<(const resource& r1, const resource& r2) {
if (r1._kind != r2._kind) {
return r1._kind < r2._kind;
}
return std::lexicographical_compare(
r1._parts.cbegin() + 1,
r1._parts.cend(),
r2._parts.cbegin() + 1,
r2._parts.cend());
}
std::ostream& operator<<(std::ostream& os, const resource& r) {
switch (r.kind()) {
case resource_kind::data: return os << data_resource_view(r);
case resource_kind::role: return os << role_resource_view(r);
}
return os;
}
data_resource_view::data_resource_view(const resource& r) : _resource(r) {
if (r._kind != resource_kind::data) {
throw resource_kind_mismatch(resource_kind::data, r._kind);
}
}
std::optional<stdx::string_view> data_resource_view::keyspace() const {
if (_resource._parts.size() == 1) {
return {};
}
return _resource._parts[1];
}
std::optional<stdx::string_view> data_resource_view::table() const {
if (_resource._parts.size() <= 2) {
return {};
}
return _resource._parts[2];
}
std::ostream& operator<<(std::ostream& os, const data_resource_view& v) {
const auto keyspace = v.keyspace();
const auto table = v.table();
if (!keyspace) {
os << "<all keyspaces>";
} else if (!table) {
os << "<keyspace " << *keyspace << '>';
} else {
os << "<table " << *keyspace << '.' << *table << '>';
}
return os;
}
role_resource_view::role_resource_view(const resource& r) : _resource(r) {
if (r._kind != resource_kind::role) {
throw resource_kind_mismatch(resource_kind::role, r._kind);
}
}
std::optional<stdx::string_view> role_resource_view::role() const {
if (_resource._parts.size() == 1) {
return {};
}
return _resource._parts[1];
}
std::ostream& operator<<(std::ostream& os, const role_resource_view& v) {
const auto role = v.role();
if (!role) {
os << "<all roles>";
} else {
os << "<role " << *role << '>';
}
return os;
}
resource parse_resource(stdx::string_view name) {
static const std::unordered_map<stdx::string_view, resource_kind> reverse_roots = [] {
std::unordered_map<stdx::string_view, resource_kind> result;
for (const auto& pair : roots) {
result.emplace(pair.second, pair.first);
}
return result;
}();
std::vector<sstring> parts;
boost::split(parts, name, [](char ch) { return ch == '/'; });
if (parts.empty()) {
throw invalid_resource_name(name);
}
const auto iter = reverse_roots.find(parts[0]);
if (iter == reverse_roots.end()) {
throw invalid_resource_name(name);
}
const auto kind = iter->second;
parts.erase(parts.begin());
if (parts.size() > max_parts.at(kind)) {
throw invalid_resource_name(name);
}
return resource(kind, std::move(parts));
}
static const resource the_root_data_resource{resource_kind::data};
const resource& root_data_resource() {
return the_root_data_resource;
}
static const resource the_root_role_resource{resource_kind::role};
const resource& root_role_resource() {
return the_root_role_resource;
}
resource_set expand_resource_family(const resource& rr) {
resource r = rr;
resource_set rs;
while (true) {
const auto pr = r.parent();
rs.insert(std::move(r));
if (!pr) {
break;
}
r = std::move(*pr);
}
return rs;
}
}

254
auth/resource.hh Normal file
View File

@@ -0,0 +1,254 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <iostream>
#include <optional>
#include <stdexcept>
#include <tuple>
#include <vector>
#include <unordered_set>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "auth/permission.hh"
#include "seastarx.hh"
#include "stdx.hh"
#include "utils/hash.hh"
namespace auth {
class invalid_resource_name : public std::invalid_argument {
public:
explicit invalid_resource_name(stdx::string_view name)
: std::invalid_argument(sprint("The resource name '%s' is invalid.", name)) {
}
};
enum class resource_kind {
data, role
};
std::ostream& operator<<(std::ostream&, resource_kind);
///
/// Type tag for constructing data resources.
///
struct data_resource_t final {};
///
/// Type tag for constructing role resources.
///
struct role_resource_t final {};
///
/// Resources are entities that users can be granted permissions on.
///
/// There are data (keyspaces and tables) and role resources. There may be other kinds of resources in the future.
///
/// When they are stored as system metadata, resources have the form `root/part_0/part_1/.../part_n`. Each kind of
/// resource has a specific root prefix, followed by a maximum of `n` parts (where `n` is distinct for each kind of
/// resource as well). In this code, this form is called the "name".
///
/// Since all resources have this same structure, all the different kinds are stored in instances of the same class:
/// \ref resource. When we wish to query a resource for kind-specific data (like the table of a "data" resource), we
/// create a kind-specific "view" of the resource.
///
class resource final {
resource_kind _kind;
std::vector<sstring> _parts;
public:
///
/// A root resource of a particular kind.
///
explicit resource(resource_kind);
resource(data_resource_t, stdx::string_view keyspace);
resource(data_resource_t, stdx::string_view keyspace, stdx::string_view table);
resource(role_resource_t, stdx::string_view role);
resource_kind kind() const noexcept {
return _kind;
}
///
/// A machine-friendly identifier unique to each resource.
///
sstring name() const;
std::optional<resource> parent() const;
permission_set applicable_permissions() const;
private:
resource(resource_kind, std::vector<sstring> parts);
friend class std::hash<resource>;
friend class data_resource_view;
friend class role_resource_view;
friend bool operator<(const resource&, const resource&);
friend bool operator==(const resource&, const resource&);
friend resource parse_resource(stdx::string_view);
};
bool operator<(const resource&, const resource&);
inline bool operator==(const resource& r1, const resource& r2) {
return (r1._kind == r2._kind) && (r1._parts == r2._parts);
}
inline bool operator!=(const resource& r1, const resource& r2) {
return !(r1 == r2);
}
std::ostream& operator<<(std::ostream&, const resource&);
class resource_kind_mismatch : public std::invalid_argument {
public:
explicit resource_kind_mismatch(resource_kind expected, resource_kind actual)
: std::invalid_argument(
sprint("This resource has kind '%s', but was expected to have kind '%s'.", actual, expected)) {
}
};
/// A "data" view of \ref resource.
///
/// If neither `keyspace` nor `table` is present, this is the root resource.
class data_resource_view final {
const resource& _resource;
public:
///
/// \throws `resource_kind_mismatch` if the argument is not a `data` resource.
///
explicit data_resource_view(const resource& r);
std::optional<stdx::string_view> keyspace() const;
std::optional<stdx::string_view> table() const;
};
std::ostream& operator<<(std::ostream&, const data_resource_view&);
///
/// A "role" view of \ref resource.
///
/// If `role` is not present, this is the root resource.
///
class role_resource_view final {
const resource& _resource;
public:
///
/// \throws \ref resource_kind_mismatch if the argument is not a "role" resource.
///
explicit role_resource_view(const resource&);
std::optional<stdx::string_view> role() const;
};
std::ostream& operator<<(std::ostream&, const role_resource_view&);
///
/// Parse a resource from its name.
///
/// \throws \ref invalid_resource_name when the name is malformed.
///
resource parse_resource(stdx::string_view name);
const resource& root_data_resource();
inline resource make_data_resource(stdx::string_view keyspace) {
return resource(data_resource_t{}, keyspace);
}
inline resource make_data_resource(stdx::string_view keyspace, stdx::string_view table) {
return resource(data_resource_t{}, keyspace, table);
}
const resource& root_role_resource();
inline resource make_role_resource(stdx::string_view role) {
return resource(role_resource_t{}, role);
}
}
namespace std {
template <>
struct hash<auth::resource> {
static size_t hash_data(const auth::data_resource_view& dv) {
return utils::tuple_hash()(std::make_tuple(auth::resource_kind::data, dv.keyspace(), dv.table()));
}
static size_t hash_role(const auth::role_resource_view& rv) {
return utils::tuple_hash()(std::make_tuple(auth::resource_kind::role, rv.role()));
}
size_t operator()(const auth::resource& r) const {
std::size_t value;
switch (r._kind) {
case auth::resource_kind::data: value = hash_data(auth::data_resource_view(r)); break;
case auth::resource_kind::role: value = hash_role(auth::role_resource_view(r)); break;
}
return value;
}
};
}
namespace auth {
using resource_set = std::unordered_set<resource>;
//
// A resource and all of its parents.
//
resource_set expand_resource_family(const resource&);
}

169
auth/role_manager.hh Normal file
View File

@@ -0,0 +1,169 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <memory>
#include <optional>
#include <stdexcept>
#include <unordered_set>
#include <seastar/core/future.hh>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "auth/resource.hh"
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
struct role_config final {
bool is_superuser{false};
bool can_login{false};
};
///
/// Differential update for altering existing roles.
///
struct role_config_update final {
std::optional<bool> is_superuser{};
std::optional<bool> can_login{};
};
///
/// A logical argument error for a role-management operation.
///
class roles_argument_exception : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
class role_already_exists : public roles_argument_exception {
public:
explicit role_already_exists(stdx::string_view role_name)
: roles_argument_exception(sprint("Role %s already exists.", role_name)) {
}
};
class nonexistant_role : public roles_argument_exception {
public:
explicit nonexistant_role(stdx::string_view role_name)
: roles_argument_exception(sprint("Role %s doesn't exist.", role_name)) {
}
};
class role_already_included : public roles_argument_exception {
public:
role_already_included(stdx::string_view grantee_name, stdx::string_view role_name)
: roles_argument_exception(
sprint("%s already includes role %s.", grantee_name, role_name)) {
}
};
class revoke_ungranted_role : public roles_argument_exception {
public:
revoke_ungranted_role(stdx::string_view revokee_name, stdx::string_view role_name)
: roles_argument_exception(
sprint("%s was not granted role %s, so it cannot be revoked.", revokee_name, role_name)) {
}
};
using role_set = std::unordered_set<sstring>;
enum class recursive_role_query { yes, no };
///
/// Abstract client for managing roles.
///
/// All state necessary for managing roles is stored externally to the client instance.
///
/// All implementations should throw role-related exceptions as documented. Authorization is not addressed here, and
/// access-control should never be enforced in implementations.
///
class role_manager {
public:
virtual ~role_manager() = default;
virtual stdx::string_view qualified_java_name() const noexcept = 0;
virtual const resource_set& protected_resources() const = 0;
virtual future<> start() = 0;
virtual future<> stop() = 0;
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///
virtual future<> create(stdx::string_view role_name, const role_config&) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<> drop(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<> alter(stdx::string_view role_name, const role_config_update&) const = 0;
///
/// Grant `role_name` to `grantee_name`.
///
/// \returns an exceptional future with \ref nonexistant_role if either the role or the grantee do not exist.
///
/// \returns an exceptional future with \ref role_already_included if granting the role would be redundant, or
/// create a cycle.
///
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) const = 0;
///
/// Revoke `role_name` from `revokee_name`.
///
/// \returns an exceptional future with \ref nonexistant_role if either the role or the revokee do not exist.
///
/// \returns an exceptional future with \ref revoke_ungranted_role if the role was not granted.
///
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<role_set> query_granted(stdx::string_view grantee, recursive_role_query) const = 0;
virtual future<role_set> query_all() const = 0;
virtual future<bool> exists(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<bool> is_superuser(stdx::string_view role_name) const = 0;
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
virtual future<bool> can_login(stdx::string_view role_name) const = 0;
};
}

41
auth/role_or_anonymous.cc Normal file
View File

@@ -0,0 +1,41 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/role_or_anonymous.hh"
#include <iostream>
namespace auth {
std::ostream& operator<<(std::ostream& os, const role_or_anonymous& mr) {
os << mr.name.value_or("<anonymous>");
return os;
}
bool operator==(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {
return mr1.name == mr2.name;
}
bool is_anonymous(const role_or_anonymous& mr) noexcept {
return !mr.name.has_value();
}
}

66
auth/role_or_anonymous.hh Normal file
View File

@@ -0,0 +1,66 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <functional>
#include <iosfwd>
#include <optional>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "stdx.hh"
namespace auth {
class role_or_anonymous final {
public:
std::optional<sstring> name{};
role_or_anonymous() = default;
role_or_anonymous(stdx::string_view name) : name(name) {
}
};
std::ostream& operator<<(std::ostream&, const role_or_anonymous&);
bool operator==(const role_or_anonymous&, const role_or_anonymous&) noexcept;
inline bool operator!=(const role_or_anonymous& mr1, const role_or_anonymous& mr2) noexcept {
return !(mr1 == mr2);
}
bool is_anonymous(const role_or_anonymous&) noexcept;
}
namespace std {
template <>
struct hash<auth::role_or_anonymous> {
size_t operator()(const auth::role_or_anonymous& mr) const {
return hash<std::optional<sstring>>()(mr.name);
}
};
}

119
auth/roles-metadata.cc Normal file
View File

@@ -0,0 +1,119 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/roles-metadata.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
#include <seastar/core/print.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
namespace auth {
namespace meta {
namespace roles_table {
stdx::string_view creation_query() {
static const sstring instance = sprint(
"CREATE TABLE %s ("
" %s text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
qualified_name(),
role_col_name);
return instance;
}
stdx::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
}
}
}
future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = sprint(
"SELECT * FROM %s WHERE %s = ?",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.process(
query,
db::consistency_level::QUORUM,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return make_ready_future<bool>(false);
}
return make_ready_future<bool>(p(results->one()));
});
}
return make_ready_future<bool>(p(results->one()));
});
});
}
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = sprint("SELECT * FROM %s", meta::roles_table::qualified_name());
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
query,
db::consistency_level::QUORUM).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}
static const sstring col_name = sstring(meta::roles_table::role_col_name);
return boost::algorithm::any_of(*results, [&p](const cql3::untyped_result_set_row& row) {
const bool is_nondefault = row.get_as<sstring>(col_name) != meta::DEFAULT_SUPERUSER_NAME;
return is_nondefault && p(row);
});
});
});
}
}

69
auth/roles-metadata.hh Normal file
View File

@@ -0,0 +1,69 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <functional>
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include "stdx.hh"
namespace cql3 {
class query_processor;
class untyped_result_set_row;
}
namespace auth {
namespace meta {
namespace roles_table {
stdx::string_view creation_query();
constexpr stdx::string_view name{"roles", 5};
stdx::string_view qualified_name() noexcept;
constexpr stdx::string_view role_col_name{"role", 4};
}
}
///
/// Check that the default role satisfies a predicate, or `false` if the default role does not exist.
///
future<bool> default_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>);
///
/// Check that any nondefault role satisfies a predicate. `false` if no nondefault roles exist.
///
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>);
}

580
auth/service.cc Normal file
View File

@@ -0,0 +1,580 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/service.hh"
#include <algorithm>
#include <map>
#include <seastar/core/future-util.hh>
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_ptr.hh>
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/standard_role_manager.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/config.hh"
#include "db/consistency_level.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_listener.hh"
#include "utils/class_registrator.hh"
namespace auth {
namespace meta {
static const sstring user_name_col_name("name");
static const sstring superuser_col_name("super");
}
static logging::logger log("auth_service");
class auth_migration_listener final : public ::service::migration_listener {
authorizer& _authorizer;
public:
explicit auth_migration_listener(authorizer& a) : _authorizer(a) {
}
private:
void on_create_keyspace(const sstring& ks_name) override {}
void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_create_view(const sstring& ks_name, const sstring& view_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
_authorizer.revoke_all(
auth::make_data_resource(ks_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
_authorizer.revoke_all(
auth::make_data_resource(
ks_name, cf_name)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static future<> validate_role_exists(const service& ser, stdx::string_view role_name) {
return ser.underlying_role_manager().exists(role_name).then([role_name](bool exists) {
if (!exists) {
throw nonexistant_role(role_name);
}
});
}
service_config service_config::from_db_config(const db::config& dc) {
const qualified_name qualified_authorizer_name(meta::AUTH_PACKAGE_NAME, dc.authorizer());
const qualified_name qualified_authenticator_name(meta::AUTH_PACKAGE_NAME, dc.authenticator());
const qualified_name qualified_role_manager_name(meta::AUTH_PACKAGE_NAME, dc.role_manager());
service_config c;
c.authorizer_java_name = qualified_authorizer_name;
c.authenticator_java_name = qualified_authenticator_name;
c.role_manager_java_name = qualified_role_manager_name;
return c;
}
service::service(
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_manager& mm,
std::unique_ptr<authorizer> z,
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r)
: _permissions_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _qp(qp)
, _migration_manager(mm)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {
// The password authenticator requires that the `standard_role_manager` is running so that the roles metadata table
// it manages is created and updated. This cross-module dependency is rather gross, but we have to maintain it for
// the sake of compatibility with Apache Cassandra and its choice of auth. schema.
if ((_authenticator->qualified_java_name() == password_authenticator_name())
&& (_role_manager->qualified_java_name() != standard_role_manager_name())) {
throw incompatible_module_combination(
sprint(
"The %s authenticator must be loaded alongside the %s role-manager.",
password_authenticator_name(),
standard_role_manager_name()));
}
}
service::service(
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_manager& mm,
const service_config& sc)
: service(
std::move(c),
qp,
mm,
create_object<authorizer>(sc.authorizer_java_name, qp, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, mm),
create_object<role_manager>(sc.role_manager_java_name, qp, mm)) {
}
future<> service::create_keyspace_if_missing() const {
auto& db = _qp.db().local();
if (!db.has_keyspace(meta::AUTH_KS)) {
std::map<sstring, sstring> opts{{"replication_factor", "1"}};
auto ksm = keyspace_metadata::new_keyspace(
meta::AUTH_KS,
"org.apache.cassandra.locator.SimpleStrategy",
opts,
true);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
// See issue #2129.
return _migration_manager.announce_new_keyspace(ksm, api::min_timestamp, false);
}
return make_ready_future<>();
}
future<> service::start() {
return once_among_shards([this] {
return create_keyspace_if_missing();
}).then([this] {
return when_all_succeed(_role_manager->start(), _authorizer->start(), _authenticator->start());
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {
return once_among_shards([this] {
_migration_manager.register_listener(_migration_listener.get());
return make_ready_future<>();
});
});
}
future<> service::stop() {
return _permissions_cache->stop().then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
});
}
future<bool> service::has_existing_legacy_users() const {
if (!_qp.db().local().has_schema(meta::AUTH_KS, meta::USERS_CF)) {
return make_ready_future<bool>(false);
}
static const sstring default_user_query = sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name);
static const sstring all_users_query = sprint(
"SELECT * FROM %s.%s LIMIT 1",
meta::AUTH_KS,
meta::USERS_CF);
// This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we
// can potentially avoid doing a range query with a high consistency level.
return _qp.process(
default_user_query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
});
}
future<permission_set>
service::get_uncached_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
if (is_anonymous(maybe_role)) {
return _authorizer->authorize(maybe_role, r);
}
const stdx::string_view role_name = *maybe_role.name;
return has_superuser(role_name).then([this, role_name, &r](bool superuser) {
if (superuser) {
return make_ready_future<permission_set>(r.applicable_permissions());
}
//
// Aggregate the permissions from all granted roles.
//
return do_with(permission_set(), [this, role_name, &r](auto& all_perms) {
return get_roles(role_name).then([this, &r, &all_perms](role_set all_roles) {
return do_with(std::move(all_roles), [this, &r, &all_perms](const auto& all_roles) {
return parallel_for_each(all_roles, [this, &r, &all_perms](stdx::string_view role_name) {
return _authorizer->authorize(role_name, r).then([&all_perms](permission_set perms) {
all_perms = permission_set::from_mask(all_perms.mask() | perms.mask());
});
});
});
}).then([&all_perms] {
return all_perms;
});
});
});
}
future<permission_set> service::get_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
return _permissions_cache->get(maybe_role, r);
}
future<bool> service::has_superuser(stdx::string_view role_name) const {
return this->get_roles(std::move(role_name)).then([this](role_set roles) {
return do_with(std::move(roles), [this](const role_set& roles) {
return do_with(false, roles.begin(), [this, &roles](bool& any_super, auto& iter) {
return do_until(
[&roles, &any_super, &iter] { return any_super || (iter == roles.end()); },
[this, &any_super, &iter] {
return _role_manager->is_superuser(*iter++).then([&any_super](bool super) {
any_super = super;
});
}).then([&any_super] {
return any_super;
});
});
});
});
}
future<role_set> service::get_roles(stdx::string_view role_name) const {
//
// We may wish to cache this information in the future (as Apache Cassandra does).
//
return _role_manager->query_granted(role_name, recursive_role_query::yes);
}
future<bool> service::exists(const resource& r) const {
switch (r.kind()) {
case resource_kind::data: {
const auto& db = _qp.db().local();
data_resource_view v(r);
const auto keyspace = v.keyspace();
const auto table = v.table();
if (table) {
return make_ready_future<bool>(db.has_schema(sstring(*keyspace), sstring(*table)));
}
if (keyspace) {
return make_ready_future<bool>(db.has_keyspace(sstring(*keyspace)));
}
return make_ready_future<bool>(true);
}
case resource_kind::role: {
role_resource_view v(r);
const auto role = v.role();
if (role) {
return _role_manager->exists(*role);
}
return make_ready_future<bool>(true);
}
}
return make_ready_future<bool>(false);
}
//
// Free functions.
//
future<bool> has_superuser(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<bool>(false);
}
return ser.has_superuser(*u.name);
}
future<role_set> get_roles(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<role_set>();
}
return ser.get_roles(*u.name);
}
future<permission_set> get_permissions(const service& ser, const authenticated_user& u, const resource& r) {
return do_with(role_or_anonymous(), [&ser, &u, &r](auto& maybe_role) {
maybe_role.name = u.name;
return ser.get_permissions(maybe_role, r);
});
}
bool is_enforcing(const service& ser) {
const bool enforcing_authorizer = ser.underlying_authorizer().qualified_java_name() != allow_all_authorizer_name();
const bool enforcing_authenticator = ser.underlying_authenticator().qualified_java_name()
!= allow_all_authenticator_name();
return enforcing_authorizer || enforcing_authenticator;
}
bool is_protected(const service& ser, const resource& r) noexcept {
return ser.underlying_role_manager().protected_resources().count(r)
|| ser.underlying_authenticator().protected_resources().count(r)
|| ser.underlying_authorizer().protected_resources().count(r);
}
static void validate_authentication_options_are_supported(
const authentication_options& options,
const authentication_option_set& supported) {
const auto check = [&supported](authentication_option k) {
if (supported.count(k) == 0) {
throw unsupported_authentication_option(k);
}
};
if (options.password) {
check(authentication_option::password);
}
if (options.options) {
check(authentication_option::options);
}
}
future<> create_role(
const service& ser,
stdx::string_view name,
const role_config& config,
const authentication_options& options) {
return ser.underlying_role_manager().create(name, config).then([&ser, name, &options] {
if (!auth::any_authentication_options(options)) {
return make_ready_future<>();
}
return futurize_apply(
&validate_authentication_options_are_supported,
options,
ser.underlying_authenticator().supported_options()).then([&ser, name, &options] {
return ser.underlying_authenticator().create(name, options);
}).handle_exception([&ser, &name](std::exception_ptr ep) {
// Roll-back.
return ser.underlying_role_manager().drop(name).then([ep = std::move(ep)] {
std::rethrow_exception(ep);
});
});
});
}
future<> alter_role(
const service& ser,
stdx::string_view name,
const role_config_update& config_update,
const authentication_options& options) {
return ser.underlying_role_manager().alter(name, config_update).then([&ser, name, &options] {
if (!any_authentication_options(options)) {
return make_ready_future<>();
}
return futurize_apply(
&validate_authentication_options_are_supported,
options,
ser.underlying_authenticator().supported_options()).then([&ser, name, &options] {
return ser.underlying_authenticator().alter(name, options);
});
});
}
future<> drop_role(const service& ser, stdx::string_view name) {
return do_with(make_role_resource(name), [&ser, name](const resource& r) {
auto& a = ser.underlying_authorizer();
return when_all_succeed(
a.revoke_all(name),
a.revoke_all(r)).handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}).then([&ser, name] {
return ser.underlying_authenticator().drop(name);
}).then([&ser, name] {
return ser.underlying_role_manager().drop(name);
});
}
future<bool> has_role(const service& ser, stdx::string_view grantee, stdx::string_view name) {
return when_all_succeed(
validate_role_exists(ser, name),
ser.get_roles(grantee)).then([name](role_set all_roles) {
return make_ready_future<bool>(all_roles.count(sstring(name)) != 0);
});
}
future<bool> has_role(const service& ser, const authenticated_user& u, stdx::string_view name) {
if (is_anonymous(u)) {
return make_ready_future<bool>(false);
}
return has_role(ser, *u.name, name);
}
future<> grant_permissions(
const service& ser,
stdx::string_view role_name,
permission_set perms,
const resource& r) {
return validate_role_exists(ser, role_name).then([&ser, role_name, perms, &r] {
return ser.underlying_authorizer().grant(role_name, perms, r);
});
}
future<> grant_applicable_permissions(const service& ser, stdx::string_view role_name, const resource& r) {
return grant_permissions(ser, role_name, r.applicable_permissions(), r);
}
future<> grant_applicable_permissions(const service& ser, const authenticated_user& u, const resource& r) {
if (is_anonymous(u)) {
return make_ready_future<>();
}
return grant_applicable_permissions(ser, *u.name, r);
}
future<> revoke_permissions(
const service& ser,
stdx::string_view role_name,
permission_set perms,
const resource& r) {
return validate_role_exists(ser, role_name).then([&ser, role_name, perms, &r] {
return ser.underlying_authorizer().revoke(role_name, perms, r);
});
}
future<std::vector<permission_details>> list_filtered_permissions(
const service& ser,
permission_set perms,
std::optional<stdx::string_view> role_name,
const std::optional<std::pair<resource, recursive_permissions>>& resource_filter) {
return ser.underlying_authorizer().list_all().then([&ser, perms, role_name, &resource_filter](
std::vector<permission_details> all_details) {
if (resource_filter) {
const resource r = resource_filter->first;
const auto resources = resource_filter->second
? auth::expand_resource_family(r)
: auth::resource_set{r};
all_details.erase(
std::remove_if(
all_details.begin(),
all_details.end(),
[&resources](const permission_details& pd) {
return resources.count(pd.resource) == 0;
}),
all_details.end());
}
std::transform(
std::make_move_iterator(all_details.begin()),
std::make_move_iterator(all_details.end()),
all_details.begin(),
[perms](permission_details pd) {
pd.permissions = permission_set::from_mask(pd.permissions.mask() & perms.mask());
return pd;
});
// Eliminate rows with an empty permission set.
all_details.erase(
std::remove_if(all_details.begin(), all_details.end(), [](const permission_details& pd) {
return pd.permissions.mask() == 0;
}),
all_details.end());
if (!role_name) {
return make_ready_future<std::vector<permission_details>>(std::move(all_details));
}
//
// Filter out rows based on whether permissions have been granted to this role (directly or indirectly).
//
return do_with(std::move(all_details), [&ser, role_name](auto& all_details) {
return ser.get_roles(*role_name).then([&all_details](role_set all_roles) {
all_details.erase(
std::remove_if(
all_details.begin(),
all_details.end(),
[&all_roles](const permission_details& pd) {
return all_roles.count(pd.role_name) == 0;
}),
all_details.end());
return make_ready_future<std::vector<permission_details>>(std::move(all_details));
});
});
});
}
}

296
auth/service.hh Normal file
View File

@@ -0,0 +1,296 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <experimental/string_view>
#include <memory>
#include <optional>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/permission.hh"
#include "auth/permissions_cache.hh"
#include "auth/role_manager.hh"
#include "seastarx.hh"
#include "stdx.hh"
namespace cql3 {
class query_processor;
}
namespace db {
class config;
}
namespace service {
class migration_manager;
class migration_listener;
}
namespace auth {
class role_or_anonymous;
struct service_config final {
static service_config from_db_config(const db::config&);
sstring authorizer_java_name;
sstring authenticator_java_name;
sstring role_manager_java_name;
};
///
/// Due to poor (in this author's opinion) decisions of Apache Cassandra, certain choices of one role-manager,
/// authenticator, or authorizer imply restrictions on the rest.
///
/// This exception is thrown when an invalid combination of modules is selected, with a message explaining the
/// incompatibility.
///
class incompatible_module_combination : public std::invalid_argument {
public:
using std::invalid_argument::invalid_argument;
};
///
/// Client for access-control in the system.
///
/// Access control encompasses user/role management, authentication, and authorization. This client provides access to
/// the dynamically-loaded implementations of these modules (through the `underlying_*` member functions), but also
/// builds on their functionality with caching and abstractions for common operations.
///
/// All state associated with access-control is stored externally to any particular instance of this class.
///
class service final {
permissions_cache_config _permissions_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
std::unique_ptr<authorizer> _authorizer;
std::unique_ptr<authenticator> _authenticator;
std::unique_ptr<role_manager> _role_manager;
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
public:
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_manager&,
std::unique_ptr<authorizer>,
std::unique_ptr<authenticator>,
std::unique_ptr<role_manager>);
///
/// This constructor is intended to be used when the class is sharded via \ref seastar::sharded. In that case, the
/// arguments must be copyable, which is why we delay construction with instance-construction instructions instead
/// of the instances themselves.
///
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_manager&,
const service_config&);
future<> start();
future<> stop();
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
future<permission_set> get_permissions(const role_or_anonymous&, const resource&) const;
///
/// Like \ref get_permissions, but never returns cached permissions.
///
future<permission_set> get_uncached_permissions(const role_or_anonymous&, const resource&) const;
///
/// Query whether the named role has been granted a role that is a superuser.
///
/// A role is always granted to itself. Therefore, a role that "is" a superuser also "has" superuser.
///
/// \returns an exceptional future with \ref nonexistant_role if the role does not exist.
///
future<bool> has_superuser(stdx::string_view role_name) const;
///
/// Return the set of all roles granted to the given role, including itself and roles granted through other roles.
///
/// \returns an exceptional future with \ref nonexistent_role if the role does not exist.
future<role_set> get_roles(stdx::string_view role_name) const;
future<bool> exists(const resource&) const;
const authenticator& underlying_authenticator() const {
return *_authenticator;
}
const authorizer& underlying_authorizer() const {
return *_authorizer;
}
const role_manager& underlying_role_manager() const {
return *_role_manager;
}
private:
future<bool> has_existing_legacy_users() const;
future<> create_keyspace_if_missing() const;
};
future<bool> has_superuser(const service&, const authenticated_user&);
future<role_set> get_roles(const service&, const authenticated_user&);
future<permission_set> get_permissions(const service&, const authenticated_user&, const resource&);
///
/// Access-control is "enforcing" when either the authenticator or the authorizer are not their "allow-all" variants.
///
/// Put differently, when access control is not enforcing, all operations on resources will be allowed and users do not
/// need to authenticate themselves.
///
bool is_enforcing(const service&);
///
/// Protected resources cannot be modified even if the performer has permissions to do so.
///
bool is_protected(const service&, const resource&) noexcept;
///
/// Create a role with optional authentication information.
///
/// \returns an exceptional future with \ref role_already_exists if the user or role exists.
///
/// \returns an exceptional future with \ref unsupported_authentication_option if an unsupported option is included.
///
future<> create_role(
const service&,
stdx::string_view name,
const role_config&,
const authentication_options&);
///
/// Alter an existing role and its authentication information.
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authentication_option if an unsupported option is included.
///
future<> alter_role(
const service&,
stdx::string_view name,
const role_config_update&,
const authentication_options&);
///
/// Drop a role from the system, including all permissions and authentication information.
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///
future<> drop_role(const service&, stdx::string_view name);
///
/// Check if `grantee` has been granted the named role.
///
/// \returns an exceptional future with \ref nonexistent_role if `grantee` or `name` do not exist.
///
future<bool> has_role(const service&, stdx::string_view grantee, stdx::string_view name);
///
/// Check if the authenticated user has been granted the named role.
///
/// \returns an exceptional future with \ref nonexistent_role if the user or `name` do not exist.
///
future<bool> has_role(const service&, const authenticated_user&, stdx::string_view name);
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if granting permissions is not
/// supported.
///
future<> grant_permissions(
const service&,
stdx::string_view role_name,
permission_set,
const resource&);
///
/// Like \ref grant_permissions, but grants all applicable permissions on the resource.
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if granting permissions is not
/// supported.
///
future<> grant_applicable_permissions(const service&, stdx::string_view role_name, const resource&);
future<> grant_applicable_permissions(const service&, const authenticated_user&, const resource&);
///
/// \returns an exceptional future with \ref nonexistent_role if the named role does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if revoking permissions is not
/// supported.
///
future<> revoke_permissions(
const service&,
stdx::string_view role_name,
permission_set,
const resource&);
using recursive_permissions = bool_class<struct recursive_permissions_tag>;
///
/// Query for all granted permissions according to filtering criteria.
///
/// Only permissions included in the provided set are included.
///
/// If a role name is provided, only permissions granted (directly or recursively) to the role are included.
///
/// If a resource filter is provided, only permissions granted on the resource are included. When \ref
/// recursive_permissions is `true`, permissions on a parent resource are included.
///
/// \returns an exceptional future with \ref nonexistent_role if a role name is included which refers to a role that
/// does not exist.
///
/// \returns an exceptional future with \ref unsupported_authorization_operation if listing permissions is not
/// supported.
///
future<std::vector<permission_details>> list_filtered_permissions(
const service&,
permission_set,
std::optional<stdx::string_view> role_name,
const std::optional<std::pair<resource, recursive_permissions>>& resource_filter);
}

View File

@@ -0,0 +1,542 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/standard_role_manager.hh"
#include <experimental/optional>
#include <unordered_set>
#include <vector>
#include <boost/algorithm/string/join.hpp>
#include <seastar/core/future-util.hh>
#include <seastar/core/print.hh>
#include <seastar/core/sleep.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/thread.hh>
#include "auth/common.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "utils/class_registrator.hh"
namespace auth {
namespace meta {
namespace role_members_table {
constexpr stdx::string_view name{"role_members" , 12};
static stdx::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
}
}
}
static logging::logger log("standard_role_manager");
static const class_registrator<
role_manager,
standard_role_manager,
cql3::query_processor&,
::service::migration_manager&> registration("org.apache.cassandra.auth.CassandraRoleManager");
struct record final {
sstring name;
bool is_superuser;
bool can_login;
role_set member_of;
};
static db::consistency_level consistency_for_role(stdx::string_view role_name) noexcept {
if (role_name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
static future<stdx::optional<record>> find_record(cql3::query_processor& qp, stdx::string_view role_name) {
static const sstring query = sprint(
"SELECT * FROM %s WHERE %s = ?",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return qp.process(
query,
consistency_for_role(role_name),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return stdx::optional<record>();
}
const cql3::untyped_result_set_row& row = results->one();
return stdx::make_optional(
record{
row.get_as<sstring>(sstring(meta::roles_table::role_col_name)),
row.get_as<bool>("is_superuser"),
row.get_as<bool>("can_login"),
(row.has("member_of")
? row.get_set<sstring>("member_of")
: role_set())});
});
}
static future<record> require_record(cql3::query_processor& qp, stdx::string_view role_name) {
return find_record(qp, role_name).then([role_name](stdx::optional<record> mr) {
if (!mr) {
throw nonexistant_role(role_name);
}
return make_ready_future<record>(*mr);
});
}
static bool has_can_login(const cql3::untyped_result_set_row& row) {
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());
}
stdx::string_view standard_role_manager_name() noexcept {
static const sstring instance = meta::AUTH_PACKAGE_NAME + "CassandraRoleManager";
return instance;
}
stdx::string_view standard_role_manager::qualified_java_name() const noexcept {
return standard_role_manager_name();
}
const resource_set& standard_role_manager::protected_resources() const {
static const resource_set resources({
make_data_resource(meta::AUTH_KS, meta::roles_table::name),
make_data_resource(meta::AUTH_KS, meta::role_members_table::name)});
return resources;
}
future<> standard_role_manager::create_metadata_tables_if_missing() const {
static const sstring create_role_members_query = sprint(
"CREATE TABLE %s ("
" role text,"
" member text,"
" PRIMARY KEY (role, member)"
")",
meta::role_members_table::qualified_name());
return when_all_succeed(
create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
meta::roles_table::creation_query(),
_migration_manager),
create_metadata_table_if_missing(
meta::role_members_table::name,
_qp,
create_role_members_query,
_migration_manager));
}
future<> standard_role_manager::create_default_role_if_missing() const {
return default_role_row_satisfies(_qp, &has_can_login).then([this](bool exists) {
if (!exists) {
static const sstring query = sprint(
"INSERT INTO %s (%s, is_superuser, can_login) VALUES (?, true, true)",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
db::consistency_level::QUORUM,
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
});
}
return make_ready_future<>();
}).handle_exception_type([](const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
return make_exception_future<>(e);
});
}
static const sstring legacy_table_name{"users"};
bool standard_role_manager::legacy_metadata_exists() const {
return _qp.db().local().has_schema(meta::AUTH_KS, legacy_table_name);
}
future<> standard_role_manager::migrate_legacy_metadata() const {
log.info("Starting migration of legacy user metadata.");
static const sstring query = sprint("SELECT * FROM %s.%s", meta::AUTH_KS, legacy_table_name);
return _qp.process(
query,
db::consistency_level::QUORUM).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_as<bool>("super");
config.can_login = true;
return do_with(
row.get_as<sstring>("name"),
std::move(config),
[this](const auto& name, const auto& config) {
return this->create_or_replace(name, config);
});
}).finally([results] {});
}).then([] {
log.info("Finished migrating legacy user metadata.");
}).handle_exception([](std::exception_ptr ep) {
log.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> standard_role_manager::start() {
return once_among_shards([this] {
return this->create_metadata_tables_if_missing().then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
return;
}
if (this->legacy_metadata_exists()) {
this->migrate_legacy_metadata().get0();
return;
}
create_default_role_if_missing().get0();
});
});
});
});
}
future<> standard_role_manager::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
}
future<> standard_role_manager::create_or_replace(stdx::string_view role_name, const role_config& c) const {
static const sstring query = sprint(
"INSERT INTO %s (%s, is_superuser, can_login) VALUES (?, ?, ?)",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_role(role_name),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
future<>
standard_role_manager::create(stdx::string_view role_name, const role_config& c) const {
return this->exists(role_name).then([this, role_name, &c](bool role_exists) {
if (role_exists) {
throw role_already_exists(role_name);
}
return this->create_or_replace(role_name, c);
});
}
future<>
standard_role_manager::alter(stdx::string_view role_name, const role_config_update& u) const {
static const auto build_column_assignments = [](const role_config_update& u) -> sstring {
std::vector<sstring> assignments;
if (u.is_superuser) {
assignments.push_back(sstring("is_superuser = ") + (*u.is_superuser ? "true" : "false"));
}
if (u.can_login) {
assignments.push_back(sstring("can_login = ") + (*u.can_login ? "true" : "false"));
}
return boost::algorithm::join(assignments, ", ");
};
return require_record(_qp, role_name).then([this, role_name, &u](record) {
if (!u.is_superuser && !u.can_login) {
return make_ready_future<>();
}
return _qp.process(
sprint(
"UPDATE %s SET %s WHERE %s = ?",
meta::roles_table::qualified_name(),
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
{sstring(role_name)}).discard_result();
});
}
future<> standard_role_manager::drop(stdx::string_view role_name) const {
return this->exists(role_name).then([this, role_name](bool role_exists) {
if (!role_exists) {
throw nonexistant_role(role_name);
}
// First, revoke this role from all roles that are members of it.
const auto revoke_from_members = [this, role_name] {
static const sstring query = sprint(
"SELECT member FROM %s WHERE role = ?",
meta::role_members_table::qualified_name());
return _qp.process(
query,
consistency_for_role(role_name),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
members->end(),
[this, role_name](const cql3::untyped_result_set_row& member_row) {
const sstring member = member_row.template get_as<sstring>("member");
return this->modify_membership(member, role_name, membership_change::remove);
}).finally([members] {});
});
};
// In parallel, revoke all roles that this role is members of.
const auto revoke_members_of = [this, grantee = role_name] {
return this->query_granted(
grantee,
recursive_role_query::no).then([this, grantee](role_set granted_roles) {
return do_with(
std::move(granted_roles),
[this, grantee](const role_set& granted_roles) {
return parallel_for_each(
granted_roles.begin(),
granted_roles.end(),
[this, grantee](const sstring& role_name) {
return this->modify_membership(grantee, role_name, membership_change::remove);
});
});
});
};
// Finally, delete the role itself.
auto delete_role = [this, role_name] {
static const sstring query = sprint(
"DELETE FROM %s WHERE %s = ?",
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_role(role_name),
{sstring(role_name)}).discard_result();
};
return when_all_succeed(revoke_from_members(), revoke_members_of()).then([delete_role = std::move(delete_role)] {
return delete_role();
});
});
}
future<>
standard_role_manager::modify_membership(
stdx::string_view grantee_name,
stdx::string_view role_name,
membership_change ch) const {
const auto modify_roles = [this, role_name, grantee_name, ch] {
const auto query = sprint(
"UPDATE %s SET member_of = member_of %s ? WHERE %s = ?",
meta::roles_table::qualified_name(),
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
return _qp.process(
query,
consistency_for_role(grantee_name),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
const auto modify_role_members = [this, role_name, grantee_name, ch] {
switch (ch) {
case membership_change::add:
return _qp.process(
sprint(
"INSERT INTO %s (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
return _qp.process(
sprint(
"DELETE FROM %s WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
return make_ready_future<>();
};
return when_all_succeed(modify_roles(), modify_role_members());
}
future<>
standard_role_manager::grant(stdx::string_view grantee_name, stdx::string_view role_name) const {
const auto check_redundant = [this, role_name, grantee_name] {
return this->query_granted(
grantee_name,
recursive_role_query::yes).then([role_name, grantee_name](role_set roles) {
if (roles.count(sstring(role_name)) != 0) {
throw role_already_included(grantee_name, role_name);
}
return make_ready_future<>();
});
};
const auto check_cycle = [this, role_name, grantee_name] {
return this->query_granted(
role_name,
recursive_role_query::yes).then([role_name, grantee_name](role_set roles) {
if (roles.count(sstring(grantee_name)) != 0) {
throw role_already_included(role_name, grantee_name);
}
return make_ready_future<>();
});
};
return when_all_succeed(check_redundant(), check_cycle()).then([this, role_name, grantee_name] {
return this->modify_membership(grantee_name, role_name, membership_change::add);
});
}
future<>
standard_role_manager::revoke(stdx::string_view revokee_name, stdx::string_view role_name) const {
return this->exists(role_name).then([this, revokee_name, role_name](bool role_exists) {
if (!role_exists) {
throw nonexistant_role(sstring(role_name));
}
}).then([this, revokee_name, role_name] {
return this->query_granted(
revokee_name,
recursive_role_query::no).then([revokee_name, role_name](role_set roles) {
if (roles.count(sstring(role_name)) == 0) {
throw revoke_ungranted_role(revokee_name, role_name);
}
return make_ready_future<>();
}).then([this, revokee_name, role_name] {
return this->modify_membership(revokee_name, role_name, membership_change::remove);
});
});
}
static future<> collect_roles(
cql3::query_processor& qp,
stdx::string_view grantee_name,
bool recurse,
role_set& roles) {
return require_record(qp, grantee_name).then([&qp, &roles, recurse](record r) {
return do_with(std::move(r.member_of), [&qp, &roles, recurse](const role_set& memberships) {
return do_for_each(memberships.begin(), memberships.end(), [&qp, &roles, recurse](const sstring& role_name) {
roles.insert(role_name);
if (recurse) {
return collect_roles(qp, role_name, true, roles);
}
return make_ready_future<>();
});
});
});
}
future<role_set> standard_role_manager::query_granted(stdx::string_view grantee_name, recursive_role_query m) const {
const bool recurse = (m == recursive_role_query::yes);
return do_with(
role_set{sstring(grantee_name)},
[this, grantee_name, recurse](role_set& roles) {
return collect_roles(_qp, grantee_name, recurse, roles).then([&roles] { return roles; });
});
}
future<role_set> standard_role_manager::query_all() const {
static const sstring query = sprint(
"SELECT %s FROM %s",
meta::roles_table::role_col_name,
meta::roles_table::qualified_name());
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
return _qp.process(query, db::consistency_level::QUORUM).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(
results->begin(),
results->end(),
std::inserter(roles, roles.begin()),
[](const cql3::untyped_result_set_row& row) {
return row.get_as<sstring>(role_col_name_string);
});
return roles;
});
}
future<bool> standard_role_manager::exists(stdx::string_view role_name) const {
return find_record(_qp, role_name).then([](stdx::optional<record> mr) {
return static_cast<bool>(mr);
});
}
future<bool> standard_role_manager::is_superuser(stdx::string_view role_name) const {
return require_record(_qp, role_name).then([](record r) {
return r.is_superuser;
});
}
future<bool> standard_role_manager::can_login(stdx::string_view role_name) const {
return require_record(_qp, role_name).then([](record r) {
return r.can_login;
});
}
}

View File

@@ -0,0 +1,105 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "auth/role_manager.hh"
#include <experimental/string_view>
#include <unordered_set>
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include "stdx.hh"
#include "seastarx.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class migration_manager;
}
namespace auth {
stdx::string_view standard_role_manager_name() noexcept;
class standard_role_manager final : public role_manager {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
future<> _stopped;
seastar::abort_source _as;
public:
standard_role_manager(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm)
, _stopped(make_ready_future<>()) {
}
virtual stdx::string_view qualified_java_name() const noexcept override;
virtual const resource_set& protected_resources() const override;
virtual future<> start() override;
virtual future<> stop() override;
virtual future<> create(stdx::string_view role_name, const role_config&) const override;
virtual future<> drop(stdx::string_view role_name) const override;
virtual future<> alter(stdx::string_view role_name, const role_config_update&) const override;
virtual future<> grant(stdx::string_view grantee_name, stdx::string_view role_name) const override;
virtual future<> revoke(stdx::string_view revokee_name, stdx::string_view role_name) const override;
virtual future<role_set> query_granted(stdx::string_view grantee_name, recursive_role_query) const override;
virtual future<role_set> query_all() const override;
virtual future<bool> exists(stdx::string_view role_name) const override;
virtual future<bool> is_superuser(stdx::string_view role_name) const override;
virtual future<bool> can_login(stdx::string_view role_name) const override;
private:
enum class membership_change { add, remove };
future<> create_metadata_tables_if_missing() const;
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata() const;
future<> create_default_role_if_missing() const;
future<> create_or_replace(stdx::string_view role_name, const role_config&) const;
future<> modify_membership(stdx::string_view role_name, stdx::string_view grantee_name, membership_change) const;
};
}

262
auth/transitional.cc Normal file
View File

@@ -0,0 +1,262 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2017 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/default_authorizer.hh"
#include "auth/password_authenticator.hh"
#include "auth/permission.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
namespace auth {
static const sstring PACKAGE_NAME("com.scylladb.auth.");
static const sstring& transitional_authenticator_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthenticator";
return name;
}
static const sstring& transitional_authorizer_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthorizer";
return name;
}
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, mm)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
}
virtual future<> start() override {
return _authenticator->start();
}
virtual future<> stop() override {
return _authenticator->stop();
}
virtual const sstring& qualified_java_name() const override {
return transitional_authenticator_name();
}
virtual bool require_authentication() const override {
return true;
}
virtual authentication_option_set supported_options() const override {
return _authenticator->supported_options();
}
virtual authentication_option_set alterable_options() const override {
return _authenticator->alterable_options();
}
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty())
&& (!credentials.count(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
}
virtual future<> create(stdx::string_view role_name, const authentication_options& options) const override {
return _authenticator->create(role_name, options);
}
virtual future<> alter(stdx::string_view role_name, const authentication_options& options) const override {
return _authenticator->alter(role_name, options);
}
virtual future<> drop(stdx::string_view role_name) const override {
return _authenticator->drop(role_name);
}
virtual future<custom_options> query_custom_options(stdx::string_view role_name) const override {
return _authenticator->query_custom_options(role_name);
}
virtual const resource_set& protected_resources() const override {
return _authenticator->protected_resources();
}
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl)) {
}
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (exceptions::authentication_exception&) {
_complete = true;
return {};
}
}
virtual bool is_complete() const override {
return _complete || _sasl->is_complete();
}
virtual future<authenticated_user> get_authenticated_user() const {
return futurize_apply([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
});
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
}
};
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
: transitional_authorizer(std::make_unique<default_authorizer>(qp, mm)) {
}
transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a)) {
}
~transitional_authorizer() {
}
virtual future<> start() override {
return _authorizer->start();
}
virtual future<> stop() override {
return _authorizer->stop();
}
virtual const sstring& qualified_java_name() const override {
return transitional_authorizer_name();
}
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
static const permission_set transitional_permissions =
permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY>();
return make_ready_future<permission_set>(transitional_permissions);
}
virtual future<> grant(stdx::string_view s, permission_set ps, const resource& r) const override {
return _authorizer->grant(s, std::move(ps), r);
}
virtual future<> revoke(stdx::string_view s, permission_set ps, const resource& r) const override {
return _authorizer->revoke(s, std::move(ps), r);
}
virtual future<std::vector<permission_details>> list_all() const override {
return _authorizer->list_all();
}
virtual future<> revoke_all(stdx::string_view s) const override {
return _authorizer->revoke_all(s);
}
virtual future<> revoke_all(const resource& r) const override {
return _authorizer->revoke_all(r);
}
virtual const resource_set& protected_resources() const override {
return _authorizer->protected_resources();
}
};
}
//
// To ensure correct initialization order, we unfortunately need to use string literals.
//
static const class_registrator<
auth::authenticator,
auth::transitional_authenticator,
cql3::query_processor&,
::service::migration_manager&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,
auth::transitional_authorizer,
cql3::query_processor&,
::service::migration_manager&> transitional_authorizer_reg(auth::PACKAGE_NAME + "TransitionalAuthorizer");

138
backlog_controller.hh Normal file
View File

@@ -0,0 +1,138 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/scheduling.hh>
#include <seastar/core/timer.hh>
#include <seastar/core/gate.hh>
#include <chrono>
// Simple proportional controller to adjust shares for processes for which a backlog can be clearly
// defined.
//
// Goal is to consume the backlog as fast as we can, but not so fast that we steal all the CPU from
// incoming requests, and at the same time minimize user-visible fluctuations in the quota.
//
// What that translates to is we'll try to keep the backlog's firt derivative at 0 (IOW, we keep
// backlog constant). As the backlog grows we increase CPU usage, decreasing CPU usage as the
// backlog diminishes.
//
// The exact point at which the controller stops determines the desired CPU usage. As the backlog
// grows and approach a maximum desired, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// Doing that divides the range in three (before the first, between first and second, and after
// second threshold), and we'll be slow to grow in the first region, grow normally in the second
// region, and aggressively in the third region.
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
class backlog_controller {
public:
future<> shutdown() {
_update_timer.cancel();
return std::move(_inflight_update);
}
protected:
struct control_point {
float input;
float output;
};
seastar::scheduling_group _scheduling_group;
const ::io_priority_class& _io_priority;
std::chrono::milliseconds _interval;
timer<> _update_timer;
std::vector<control_point> _control_points;
std::function<float()> _current_backlog;
// updating shares for an I/O class may contact another shard and returns a future.
future<> _inflight_update;
virtual void update_controller(float quota);
void adjust();
backlog_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval,
std::vector<control_point> control_points, std::function<float()> backlog)
: _scheduling_group(sg)
, _io_priority(iop)
, _interval(interval)
, _update_timer([this] { adjust(); })
, _control_points({{0,0}})
, _current_backlog(std::move(backlog))
, _inflight_update(make_ready_future<>())
{
_control_points.insert(_control_points.end(), control_points.begin(), control_points.end());
_update_timer.arm_periodic(_interval);
}
// Used when the controllers are disabled and a static share is used
// When that option is deprecated we should remove this.
backlog_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares)
: _scheduling_group(sg)
, _io_priority(iop)
, _inflight_update(make_ready_future<>())
{
update_controller(static_shares);
}
virtual ~backlog_controller() {}
};
// memtable flush CPU controller.
//
// - First threshold is the soft limit line,
// - Maximum is the point in which we'd stop consuming request,
// - Second threshold is halfway between them.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
class flush_controller : public backlog_controller {
static constexpr float hard_dirty_limit = 1.0f;
public:
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{soft_limit, 100}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::move(current_dirty)
)
{}
};
class compaction_controller : public backlog_controller {
public:
static constexpr unsigned normalization_factor = 30;
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{0.5, 10}, {1.5, 100} , {normalization_factor, 1000}}),
std::move(current_backlog)
)
{}
};

View File

@@ -0,0 +1,664 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <vector>
#include "row_cache.hh"
#include "mutation_reader.hh"
#include "mutation_fragment.hh"
#include "partition_version.hh"
#include "utils/logalloc.hh"
#include "query-request.hh"
#include "partition_snapshot_reader.hh"
#include "partition_snapshot_row_cursor.hh"
#include "read_context.hh"
#include "flat_mutation_reader.hh"
namespace cache {
extern logging::logger clogger;
class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
enum class state {
before_static_row,
// Invariants:
// - position_range(_lower_bound, _upper_bound) covers all not yet emitted positions from current range
// - if _next_row has valid iterators:
// - _next_row points to the nearest row in cache >= _lower_bound
// - _next_row_in_range = _next.position() < _upper_bound
// - if _next_row doesn't have valid iterators, it has no meaning.
reading_from_cache,
// Starts reading from underlying reader.
// The range to read is position_range(_lower_bound, min(_next_row.position(), _upper_bound)).
// Invariants:
// - _next_row_in_range = _next.position() < _upper_bound
move_to_underlying,
// Invariants:
// - Upper bound of the read is min(_next_row.position(), _upper_bound)
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
reading_from_underlying,
end_of_stream
};
lw_shared_ptr<partition_snapshot> _snp;
position_in_partition::tri_compare _position_cmp;
query::clustering_key_filter_ranges _ck_ranges;
query::clustering_row_ranges::const_iterator _ck_ranges_curr;
query::clustering_row_ranges::const_iterator _ck_ranges_end;
lsa_manager _lsa_manager;
partition_snapshot_row_weakref _last_row;
// Holds the lower bound of a position range which hasn't been processed yet.
// Only rows with positions < _lower_bound have been emitted, and only
// range_tombstones with positions <= _lower_bound.
position_in_partition _lower_bound;
position_in_partition_view _upper_bound;
state _state = state::before_static_row;
lw_shared_ptr<read_context> _read_context;
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
// True iff current population interval, since the previous clustering row, starts before all clustered rows.
// We cannot just look at _lower_bound, because emission of range tombstones changes _lower_bound and
// because we mark clustering intervals as continuous when consuming a clustering_row, it would prevent
// us from marking the interval as continuous.
// Valid when _state == reading_from_underlying.
bool _population_range_starts_before_all_rows;
future<> do_fill_buffer(db::timeout_clock::time_point);
void copy_from_cache_to_buffer();
future<> process_static_row(db::timeout_clock::time_point);
void move_to_end();
void move_to_next_range();
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
future<> read_from_underlying(db::timeout_clock::time_point);
void start_reading_from_underlying();
bool after_current_range(position_in_partition_view position);
bool can_populate() const;
// Marks the range between _last_row (exclusive) and _next_row (exclusive) as continuous,
// provided that the underlying reader still matches the latest version of the partition.
void maybe_update_continuity();
// Tries to ensure that the lower bound of the current population range exists.
// Returns false if it failed and range cannot be populated.
// Assumes can_populate().
bool ensure_population_lower_bound();
void maybe_add_to_cache(const mutation_fragment& mf);
void maybe_add_to_cache(const clustering_row& cr);
void maybe_add_to_cache(const range_tombstone& rt);
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
void finish_reader() {
push_mutation_fragment(partition_end());
_end_of_stream = true;
_state = state::end_of_stream;
}
void touch_partition();
public:
cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
row_cache& cache)
: flat_mutation_reader::impl(std::move(s))
, _snp(std::move(snp))
, _position_cmp(*_schema)
, _ck_ranges(std::move(crr))
, _ck_ranges_curr(_ck_ranges.begin())
, _ck_ranges_end(_ck_ranges.end())
, _lsa_manager(cache)
, _lower_bound(position_in_partition::before_all_clustered_rows())
, _upper_bound(position_in_partition_view::before_all_clustered_rows())
, _read_context(std::move(ctx))
, _next_row(*_schema, *_snp)
{
clogger.trace("csm {}: table={}.{}", this, _schema->ks_name(), _schema->cf_name());
push_mutation_fragment(partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
virtual ~cache_flat_mutation_reader() {
maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_end_of_stream = true;
}
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point timeout) override {
clear_buffer();
_end_of_stream = true;
return make_ready_future<>();
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw std::bad_function_call();
}
};
inline
future<> cache_flat_mutation_reader::process_static_row(db::timeout_clock::time_point timeout) {
if (_snp->static_row_continuous()) {
_read_context->cache().on_row_hit();
static_row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row(_read_context->digest_requested());
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(std::move(sr)));
}
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment(timeout).then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
});
}
}
inline
void cache_flat_mutation_reader::touch_partition() {
if (_snp->at_latest_version()) {
rows_entry& last_dummy = *_snp->version()->partition().clustered_rows().rbegin();
_snp->tracker()->touch(last_dummy);
}
}
inline
future<> cache_flat_mutation_reader::fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::before_static_row) {
auto after_static_row = [this, timeout] {
if (_ck_ranges_curr == _ck_ranges_end) {
touch_partition();
finish_reader();
return make_ready_future<>();
}
_state = state::reading_from_cache;
_lsa_manager.run_in_read_section([this] {
move_to_range(_ck_ranges_curr);
});
return fill_buffer(timeout);
};
if (_schema->has_static_columns()) {
return process_static_row(timeout).then(std::move(after_static_row));
} else {
return after_static_row();
}
}
clogger.trace("csm {}: fill_buffer(), range={}, lb={}", this, *_ck_ranges_curr, _lower_bound);
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this, timeout] {
return do_fill_buffer(timeout);
});
}
inline
future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
return read_from_underlying(timeout);
});
}
if (_state == state::reading_from_underlying) {
return read_from_underlying(timeout);
}
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto next_valid = _next_row.iterators_valid();
clogger.trace("csm {}: reading_from_cache, range=[{}, {}), next={}, valid={}", this, _lower_bound,
_upper_bound, _next_row.position(), next_valid);
// We assume that if there was eviction, and thus the range may
// no longer be continuous, the cursor was invalidated.
if (!next_valid) {
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
if (!adjacent && !_next_row.continuous()) {
_last_row = nullptr; // We could insert a dummy here, but this path is unlikely.
start_reading_from_underlying();
return make_ready_future<>();
}
}
_next_row.maybe_refresh();
clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());
while (!is_buffer_full() && _state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
break;
}
}
return make_ready_future<>();
});
}
inline
future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::time_point timeout) {
return consume_mutation_fragments_until(_read_context->underlying().underlying(),
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
maybe_add_to_cache(mf);
add_to_buffer(std::move(mf));
},
[this] {
_state = state::reading_from_cache;
_lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
if (!same_pos) {
_read_context->cache().on_mispopulate(); // FIXME: Insert dummy entry at _upper_bound.
_next_row_in_range = !after_current_range(_next_row.position());
if (!_next_row.continuous()) {
start_reading_from_underlying();
}
return;
}
if (_next_row_in_range) {
maybe_update_continuity();
_last_row = _next_row;
add_to_buffer(_next_row);
try {
move_to_next_entry();
} catch (const std::bad_alloc&) {
// We cannot reenter the section, since we may have moved to the new range, and
// because add_to_buffer() should not be repeated.
_snp->region().allocator().invalidate_references(); // Invalidates _next_row
}
} else {
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else if (can_populate()) {
rows_entry::compare less(*_schema);
auto& rows = _snp->version()->partition().clustered_rows();
if (query::is_single_row(*_schema, *_ck_ranges_curr)) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
auto it = insert_result.first;
if (inserted) {
_snp->tracker()->insert(*e);
e.release();
auto next = std::next(it);
it->set_continuous(next->continuous());
clogger.trace("csm {}: inserted dummy at {}, cont={}", this, it->position(), it->continuous());
}
});
} else if (ensure_population_lower_bound()) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", this, _upper_bound);
_snp->tracker()->insert(*e);
e.release();
} else {
clogger.trace("csm {}: mark {} as continuous", this, insert_result.first->position());
insert_result.first->set_continuous(true);
}
});
}
} else {
_read_context->cache().on_mispopulate();
}
try {
move_to_next_range();
} catch (const std::bad_alloc&) {
// We cannot reenter the section, since we may have moved to the new range
_snp->region().allocator().invalidate_references(); // Invalidates _next_row
}
}
});
return make_ready_future<>();
});
}
inline
bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (_population_range_starts_before_all_rows) {
return true;
}
if (!_last_row.refresh(*_snp)) {
return false;
}
// Continuity flag we will later set for the upper bound extends to the previous row in the same version,
// so we need to ensure we have an entry in the latest version.
if (!_last_row.is_in_latest_version()) {
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::compare less(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted lower bound dummy at {}", this, e->position());
_snp->tracker()->insert(*e);
e.release();
}
});
}
return true;
}
inline
void cache_flat_mutation_reader::maybe_update_continuity() {
if (can_populate() && ensure_population_lower_bound()) {
with_allocator(_snp->region().allocator(), [&] {
rows_entry& e = _next_row.ensure_entry_in_latest().row;
e.set_continuous(true);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const mutation_fragment& mf) {
if (mf.is_range_tombstone()) {
maybe_add_to_cache(mf.as_range_tombstone());
} else {
assert(mf.is_clustering_row());
const clustering_row& cr = mf.as_clustering_row();
maybe_add_to_cache(cr);
}
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_last_row = nullptr;
_population_range_starts_before_all_rows = false;
_read_context->cache().on_mispopulate();
return;
}
clogger.trace("csm {}: populate({})", this, cr);
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
if (_read_context->digest_requested()) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
}
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_snp->tracker()->insert(*new_entry);
new_entry.release();
}
it = insert_result.first;
rows_entry& e = *it;
if (ensure_population_lower_bound()) {
clogger.trace("csm {}: set_continuous({})", this, e.position());
e.set_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
with_allocator(standard_allocator(), [&] {
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
});
_population_range_starts_before_all_rows = false;
});
}
inline
bool cache_flat_mutation_reader::after_current_range(position_in_partition_view p) {
return _position_cmp(p, _upper_bound) >= 0;
}
inline
void cache_flat_mutation_reader::start_reading_from_underlying() {
clogger.trace("csm {}: start_reading_from_underlying(), range=[{}, {})", this, _lower_bound, _next_row_in_range ? _next_row.position() : _upper_bound);
_state = state::move_to_underlying;
_next_row.touch();
}
inline
void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", this, _next_row.position(), _next_row_in_range);
_next_row.touch();
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto &&rts : _snp->range_tombstones(_lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
// This guarantees that rts starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (rts.trim_front(*_schema, _lower_bound)) {
_lower_bound = position_in_partition(rts.position());
if (is_buffer_full()) {
return;
}
push_mutation_fragment(std::move(rts));
}
}
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
if (_next_row_in_range) {
_last_row = _next_row;
add_to_buffer(_next_row);
move_to_next_entry();
} else {
move_to_next_range();
}
}
inline
void cache_flat_mutation_reader::move_to_end() {
finish_reader();
clogger.trace("csm {}: eos", this);
}
inline
void cache_flat_mutation_reader::move_to_next_range() {
auto next_it = std::next(_ck_ranges_curr);
if (next_it == _ck_ranges_end) {
move_to_end();
_ck_ranges_curr = next_it;
} else {
move_to_range(next_it);
}
}
inline
void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::const_iterator next_it) {
auto lb = position_in_partition::for_range_start(*next_it);
auto ub = position_in_partition_view::for_range_end(*next_it);
_last_row = nullptr;
_lower_bound = std::move(lb);
_upper_bound = std::move(ub);
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: move_to_range(), range={}, lb={}, ub={}, next={}", this, *_ck_ranges_curr, _lower_bound, _upper_bound, _next_row.position());
if (!adjacent && !_next_row.continuous()) {
// FIXME: We don't insert a dummy for singular range to avoid allocating 3 entries
// for a hit (before, at and after). If we supported the concept of an incomplete row,
// we could insert such a row for the lower bound if it's full instead, for both singular and
// non-singular ranges.
if (_ck_ranges_curr->start() && !query::is_single_row(*_schema, *_ck_ranges_curr)) {
// Insert dummy for lower bound
if (can_populate()) {
// FIXME: _lower_bound could be adjacent to the previous row, in which case we could skip this
clogger.trace("csm {}: insert dummy at {}", this, _lower_bound);
auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
});
_snp->tracker()->insert(*it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
} else {
_read_context->cache().on_mispopulate();
}
}
start_reading_from_underlying();
}
}
// _next_row must be inside the range.
inline
void cache_flat_mutation_reader::move_to_next_entry() {
clogger.trace("csm {}: move_to_next_entry(), curr={}", this, _next_row.position());
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
if (!_next_row.next()) {
move_to_end();
return;
}
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: next={}, cont={}, in_range={}", this, _next_row.position(), _next_row.continuous(), _next_row_in_range);
if (!_next_row.continuous()) {
start_reading_from_underlying();
}
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_to_buffer({})", this, mf);
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone());
add_to_buffer(std::move(mf).as_range_tombstone());
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row(_read_context->digest_requested()));
}
}
// Maintains the following invariants, also in case of exception:
// (1) no fragment with position >= _lower_bound was pushed yet
// (2) If _lower_bound > mf.position(), mf was emitted
inline
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", this, mf);
auto& row = mf.as_clustering_row();
auto new_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
_lower_bound = std::move(new_lower_bound);
}
inline
void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", this, rt);
// This guarantees that rt starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (!rt.trim_front(*_schema, _lower_bound)) {
return;
}
_lower_bound = position_in_partition(rt.position());
push_mutation_fragment(std::move(rt));
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
clogger.trace("csm {}: maybe_add_to_cache({})", this, rt);
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
clogger.trace("csm {}: populate({})", this, sr);
_read_context->cache().on_static_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
if (_read_context->digest_requested()) {
sr.cells().prepare_hash(*_schema, column_kind::static_column);
}
_snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_flat_mutation_reader::maybe_set_static_row_continuous() {
if (can_populate()) {
clogger.trace("csm {}: set static row continuous", this);
_snp->version()->partition().set_static_row_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
}
inline
bool cache_flat_mutation_reader::can_populate() const {
return _snp->at_latest_version() && _read_context->cache().phase_of(_read_context->key()) == _read_context->phase();
}
} // namespace cache
inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
{
return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(
std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);
}

View File

@@ -1,538 +0,0 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <vector>
#include "row_cache.hh"
#include "mutation_reader.hh"
#include "streamed_mutation.hh"
#include "partition_version.hh"
#include "utils/logalloc.hh"
#include "query-request.hh"
#include "partition_snapshot_reader.hh"
#include "partition_snapshot_row_cursor.hh"
#include "read_context.hh"
namespace cache {
class lsa_manager {
row_cache& _cache;
public:
lsa_manager(row_cache& cache) : _cache(cache) { }
template<typename Func>
decltype(auto) run_in_read_section(const Func& func) {
return _cache._read_section(_cache._tracker.region(), [&func] () {
return with_linearized_managed_bytes([&func] () {
return func();
});
});
}
template<typename Func>
decltype(auto) run_in_update_section(const Func& func) {
return _cache._update_section(_cache._tracker.region(), [&func] () {
return with_linearized_managed_bytes([&func] () {
return func();
});
});
}
template<typename Func>
void run_in_update_section_with_allocator(Func&& func) {
return _cache._update_section(_cache._tracker.region(), [this, &func] () {
return with_linearized_managed_bytes([this, &func] () {
return with_allocator(_cache._tracker.region().allocator(), [this, &func] () mutable {
return func();
});
});
});
}
logalloc::region& region() { return _cache._tracker.region(); }
logalloc::allocating_section& read_section() { return _cache._read_section; }
};
class cache_streamed_mutation final : public streamed_mutation::impl {
enum class state {
before_static_row,
// Invariants:
// - position_range(_lower_bound, _upper_bound) covers all not yet emitted positions from current range
// - _next_row points to the nearest row in cache >= _lower_bound
// - _next_row_in_range = _next.position() < _upper_bound
reading_from_cache,
// Starts reading from underlying reader.
// The range to read is position_range(_lower_bound, min(_next_row.position(), _upper_bound)).
// Invariants:
// - _next_row_in_range = _next.position() < _upper_bound
move_to_underlying,
// Invariants:
// - Upper bound of the read is min(_next_row.position(), _upper_bound)
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row_key contains the key of last emitted clustering_row
reading_from_underlying,
end_of_stream
};
lw_shared_ptr<partition_snapshot> _snp;
position_in_partition::tri_compare _position_cmp;
query::clustering_key_filter_ranges _ck_ranges;
query::clustering_row_ranges::const_iterator _ck_ranges_curr;
query::clustering_row_ranges::const_iterator _ck_ranges_end;
lsa_manager _lsa_manager;
stdx::optional<clustering_key> _last_row_key;
// We need to be prepared that we may get overlapping and out of order
// range tombstones. We must emit fragments with strictly monotonic positions,
// so we can't just trim such tombstones to the position of the last fragment.
// To solve that, range tombstones are accumulated first in a range_tombstone_stream
// and emitted once we have a fragment with a larger position.
range_tombstone_stream _tombstones;
// Holds the lower bound of a position range which hasn't been processed yet.
// Only fragments with positions < _lower_bound have been emitted.
position_in_partition _lower_bound;
position_in_partition_view _upper_bound;
state _state = state::before_static_row;
lw_shared_ptr<read_context> _read_context;
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
future<> do_fill_buffer();
void copy_from_cache_to_buffer();
future<> process_static_row();
void move_to_end();
void move_to_next_range();
void move_to_current_range();
void move_to_next_entry();
// Emits all delayed range tombstones with positions smaller than upper_bound.
void drain_tombstones(position_in_partition_view upper_bound);
// Emits all delayed range tombstones.
void drain_tombstones();
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
future<> read_from_underlying();
future<> start_reading_from_underlying();
bool after_current_range(position_in_partition_view position);
bool can_populate() const;
void maybe_update_continuity();
void maybe_add_to_cache(const mutation_fragment& mf);
void maybe_add_to_cache(const clustering_row& cr);
void maybe_add_to_cache(const range_tombstone& rt);
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
public:
cache_streamed_mutation(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
row_cache& cache)
: streamed_mutation::impl(std::move(s), dk, snp->partition_tombstone())
, _snp(std::move(snp))
, _position_cmp(*_schema)
, _ck_ranges(std::move(crr))
, _ck_ranges_curr(_ck_ranges.begin())
, _ck_ranges_end(_ck_ranges.end())
, _lsa_manager(cache)
, _tombstones(*_schema)
, _lower_bound(position_in_partition::before_all_clustered_rows())
, _upper_bound(position_in_partition_view::before_all_clustered_rows())
, _read_context(std::move(ctx))
, _next_row(*_schema, cache._tracker.region(), *_snp)
{ }
cache_streamed_mutation(const cache_streamed_mutation&) = delete;
cache_streamed_mutation(cache_streamed_mutation&&) = delete;
virtual future<> fill_buffer() override;
virtual ~cache_streamed_mutation() {
maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());
}
};
inline
future<> cache_streamed_mutation::process_static_row() {
if (_snp->version()->partition().static_row_continuous()) {
_read_context->cache().on_row_hit();
row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row();
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(static_row(std::move(sr))));
}
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment().then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
});
}
}
inline
future<> cache_streamed_mutation::fill_buffer() {
if (_state == state::before_static_row) {
auto after_static_row = [this] {
if (_ck_ranges_curr == _ck_ranges_end) {
_end_of_stream = true;
_state = state::end_of_stream;
return make_ready_future<>();
}
_state = state::reading_from_cache;
_lsa_manager.run_in_read_section([this] {
move_to_current_range();
});
return fill_buffer();
};
if (_schema->has_static_columns()) {
return process_static_row().then(std::move(after_static_row));
} else {
return after_static_row();
}
}
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this] {
return do_fill_buffer();
});
}
inline
future<> cache_streamed_mutation::do_fill_buffer() {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {
return read_from_underlying();
});
}
if (_state == state::reading_from_underlying) {
return read_from_underlying();
}
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto same_pos = _next_row.maybe_refresh();
// FIXME: If continuity changed anywhere between _lower_bound and _next_row.position()
// we need to redo the lookup with _lower_bound. There is no eviction yet, so not yet a problem.
assert(same_pos);
while (!is_buffer_full() && _state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
break;
}
}
return make_ready_future<>();
});
}
inline
future<> cache_streamed_mutation::read_from_underlying() {
return consume_mutation_fragments_until(_read_context->get_streamed_mutation(),
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
maybe_add_to_cache(mf);
add_to_buffer(std::move(mf));
},
[this] {
_state = state::reading_from_cache;
_lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
assert(same_pos); // FIXME: handle eviction
if (_next_row_in_range) {
maybe_update_continuity();
add_to_buffer(_next_row);
move_to_next_entry();
} else {
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else {
// FIXME: Insert dummy entry at _upper_bound.
_read_context->cache().on_mispopulate();
}
move_to_next_range();
}
});
return make_ready_future<>();
});
}
inline
void cache_streamed_mutation::maybe_update_continuity() {
if (can_populate() && _next_row.is_in_latest_version()) {
if (_last_row_key) {
if (_next_row.previous_row_in_latest_version_has_key(*_last_row_key)) {
_next_row.set_continuous(true);
}
} else if (!_ck_ranges_curr->start()) {
_next_row.set_continuous(true);
}
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const mutation_fragment& mf) {
if (mf.is_range_tombstone()) {
maybe_add_to_cache(mf.as_range_tombstone());
} else {
assert(mf.is_clustering_row());
const clustering_row& cr = mf.as_clustering_row();
maybe_add_to_cache(cr);
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_read_context->cache().on_mispopulate();
return;
}
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
// FIXME: If _next_row is up to date, but latest version doesn't have iterator in
// current row (could be far away, so we'd do this often), then this will do
// the lookup in mp. This is not necessary, because _next_row has iterators for
// next rows in each version, even if they're not part of the current row.
// They're currently buried in the heap, but you could keep a vector of
// iterators per each version in addition to the heap.
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.has_valid_row_from_latest_version()
? _next_row.get_iterator_in_latest_version() : mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_read_context->cache().on_row_insert();
new_entry.release();
}
it = insert_result.first;
rows_entry& e = *it;
if (_last_row_key) {
if (it == mp.clustered_rows().begin()) {
// FIXME: check whether entry for _last_row_key is in older versions and if so set
// continuity to true.
_read_context->cache().on_mispopulate();
} else {
auto prev_it = it;
--prev_it;
clustering_key_prefix::equality eq(*_schema);
if (eq(*_last_row_key, prev_it->key())) {
e.set_continuous(true);
}
}
} else if (!_ck_ranges_curr->start()) {
e.set_continuous(true);
} else {
// FIXME: Insert dummy entry at _ck_ranges_curr->start()
_read_context->cache().on_mispopulate();
}
});
}
inline
bool cache_streamed_mutation::after_current_range(position_in_partition_view p) {
return _position_cmp(p, _upper_bound) >= 0;
}
inline
future<> cache_streamed_mutation::start_reading_from_underlying() {
_state = state::move_to_underlying;
return make_ready_future<>();
}
inline
void cache_streamed_mutation::copy_from_cache_to_buffer() {
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto&& rts : _snp->range_tombstones(*_schema, _lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
add_to_buffer(std::move(rts));
if (is_buffer_full()) {
return;
}
}
if (_next_row_in_range) {
add_to_buffer(_next_row);
move_to_next_entry();
} else {
move_to_next_range();
}
}
inline
void cache_streamed_mutation::move_to_end() {
drain_tombstones();
_end_of_stream = true;
_state = state::end_of_stream;
}
inline
void cache_streamed_mutation::move_to_next_range() {
++_ck_ranges_curr;
if (_ck_ranges_curr == _ck_ranges_end) {
move_to_end();
} else {
move_to_current_range();
}
}
inline
void cache_streamed_mutation::move_to_current_range() {
_last_row_key = std::experimental::nullopt;
_lower_bound = position_in_partition::for_range_start(*_ck_ranges_curr);
_upper_bound = position_in_partition_view::for_range_end(*_ck_ranges_curr);
auto complete_until_next = _next_row.advance_to(_lower_bound) || _next_row.continuous();
_next_row_in_range = !after_current_range(_next_row.position());
if (!complete_until_next) {
start_reading_from_underlying();
}
}
// _next_row must be inside the range.
inline
void cache_streamed_mutation::move_to_next_entry() {
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
if (!_next_row.next()) {
move_to_end();
return;
}
_next_row_in_range = !after_current_range(_next_row.position());
if (!_next_row.continuous()) {
start_reading_from_underlying();
}
}
}
inline
void cache_streamed_mutation::drain_tombstones(position_in_partition_view pos) {
while (auto mfo = _tombstones.get_next(pos)) {
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_streamed_mutation::drain_tombstones() {
while (auto mfo = _tombstones.get_next()) {
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_streamed_mutation::add_to_buffer(mutation_fragment&& mf) {
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone());
add_to_buffer(std::move(mf).as_range_tombstone());
}
}
inline
void cache_streamed_mutation::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row());
}
}
inline
void cache_streamed_mutation::add_clustering_row_to_buffer(mutation_fragment&& mf) {
auto& row = mf.as_clustering_row();
drain_tombstones(row.position());
_last_row_key = row.key();
_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
}
inline
void cache_streamed_mutation::add_to_buffer(range_tombstone&& rt) {
// This guarantees that rt starts after any emitted clustering_row
if (!rt.trim_front(*_schema, _lower_bound)) {
return;
}
_lower_bound = position_in_partition(rt.position());
_tombstones.apply(std::move(rt));
drain_tombstones(_lower_bound);
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
_read_context->cache().on_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_set_static_row_continuous() {
if (can_populate()) {
_snp->version()->partition().set_static_row_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
}
inline
bool cache_streamed_mutation::can_populate() const {
return _snp->at_latest_version() && _read_context->cache().phase_of(_read_context->key()) == _read_context->phase();
}
} // namespace cache
inline streamed_mutation make_cache_streamed_mutation(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
{
return make_streamed_mutation<cache::cache_streamed_mutation>(
std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);
}

View File

@@ -75,7 +75,7 @@ mutation canonical_mutation::to_mutation(schema_ptr s) const {
auto version = mv.schema_version();
auto pk = mv.key();
mutation m(std::move(pk), std::move(s));
mutation m(std::move(s), std::move(pk));
if (version == m.schema()->version()) {
auto partition_view = mutation_partition_view::from_view(mv.partition());

View File

@@ -39,9 +39,11 @@ using small_vector = std::vector<T>;
#endif
#include "fnv1a_hasher.hh"
#include "streamed_mutation.hh"
#include "mutation_fragment.hh"
#include "mutation_partition.hh"
#include "db/timeout_clock.hh"
class cells_range {
using ids_vector_type = small_vector<column_id, 5>;
@@ -142,11 +144,7 @@ struct cell_locker_stats {
};
class cell_locker {
public:
using timeout_clock = lowres_clock;
private:
using semaphore_type = basic_semaphore<default_timeout_exception_factory, timeout_clock>;
class partition_entry;
struct cell_address {
@@ -158,7 +156,7 @@ private:
public enable_lw_shared_from_this<cell_entry> {
partition_entry& _parent;
cell_address _address;
semaphore_type _semaphore { 0 };
db::timeout_semaphore _semaphore { 0 };
friend class cell_locker;
public:
@@ -187,7 +185,7 @@ private:
return _address.position;
}
future<> lock(timeout_clock::time_point _timeout) {
future<> lock(db::timeout_clock::time_point _timeout) {
return _semaphore.wait(_timeout);
}
void unlock() {
@@ -387,7 +385,7 @@ public:
// partition_cells_range is required to be in cell_locker::schema()
future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range,
timeout_clock::time_point timeout);
db::timeout_clock::time_point timeout);
};
@@ -416,7 +414,7 @@ struct cell_locker::locker {
partition_cells_range::iterator _current_ck;
cells_range::const_iterator _current_cell;
timeout_clock::time_point _timeout;
db::timeout_clock::time_point _timeout;
std::vector<locked_cell> _locks;
cell_locker_stats& _stats;
private:
@@ -430,7 +428,7 @@ private:
bool is_done() const { return _current_ck == _range.end(); }
public:
explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, timeout_clock::time_point timeout)
explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, db::timeout_clock::time_point timeout)
: _hasher(s)
, _eq_cmp(s)
, _partition_entry(pe)
@@ -458,7 +456,7 @@ public:
};
inline
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, timeout_clock::time_point timeout) {
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, db::timeout_clock::time_point timeout) {
partition_entry::hasher pe_hash;
partition_entry::equal_compare pe_eq(*_schema);

View File

@@ -130,7 +130,7 @@ inline file make_checked_file(const io_error_handler& error_handler, file f)
future<file>
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags,
file_open_options options)
file_open_options options = {})
{
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags, options).then([&] (file f) {
@@ -139,17 +139,6 @@ inline open_checked_file_dma(const io_error_handler& error_handler,
});
}
future<file>
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags)
{
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags).then([&] (file f) {
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_directory(const io_error_handler& error_handler,
sstring name)

View File

@@ -42,17 +42,6 @@ std::ostream& operator<<(std::ostream& out, const bound_kind k);
bound_kind invert_kind(bound_kind k);
int32_t weight(bound_kind k);
static inline bound_kind flip_bound_kind(bound_kind bk)
{
switch (bk) {
case bound_kind::excl_end: return bound_kind::excl_start;
case bound_kind::incl_end: return bound_kind::incl_start;
case bound_kind::excl_start: return bound_kind::excl_end;
case bound_kind::incl_start: return bound_kind::incl_end;
}
abort();
}
class bound_view {
public:
const static thread_local clustering_key empty_prefix;

View File

@@ -25,7 +25,7 @@
#include "schema.hh"
#include "query-request.hh"
#include "streamed_mutation.hh"
#include "mutation_fragment.hh"
// Utility for in-order checking of overlap with position ranges.
class clustering_ranges_walker {
@@ -169,14 +169,14 @@ public:
bool contains_tombstone(position_in_partition_view start, position_in_partition_view end) const {
position_in_partition::less_compare less(_schema);
if (_trim && less(end, *_trim)) {
if (_trim && !less(*_trim, end)) {
return false;
}
auto i = _current;
while (i != _end) {
auto range_start = position_in_partition_view::for_range_start(*i);
if (less(end, range_start)) {
if (!less(range_start, end)) {
return false;
}
auto range_end = position_in_partition_view::for_range_end(*i);

3
coding-style.md Normal file
View File

@@ -0,0 +1,3 @@
# Scylla Coding Style
Please see the [Seastar style document](https://github.com/scylladb/seastar/blob/master/coding-style.md).

View File

@@ -21,6 +21,10 @@
#pragma once
#include "sstables/shared_sstable.hh"
#include "exceptions/exceptions.hh"
#include "sstables/compaction_backlog_manager.hh"
class column_family;
class schema;
using schema_ptr = lw_shared_ptr<const schema>;
@@ -33,6 +37,7 @@ enum class compaction_strategy_type {
size_tiered,
leveled,
date_tiered,
time_window,
};
class compaction_strategy_impl;
@@ -53,13 +58,13 @@ public:
compaction_strategy& operator=(compaction_strategy&&);
// Return a list of sstables to be compacted after applying the strategy.
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<lw_shared_ptr<sstable>> candidates);
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<shared_sstable> candidates);
std::vector<resharding_descriptor> get_resharding_jobs(column_family& cf, std::vector<lw_shared_ptr<sstable>> candidates);
std::vector<resharding_descriptor> get_resharding_jobs(column_family& cf, std::vector<shared_sstable> candidates);
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(const std::vector<lw_shared_ptr<sstable>>& removed, const std::vector<lw_shared_ptr<sstable>>& added);
void notify_completion(const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added);
// Return if parallel compaction is allowed by strategy.
bool parallel_compaction() const;
@@ -82,6 +87,8 @@ public:
return "LeveledCompactionStrategy";
case compaction_strategy_type::date_tiered:
return "DateTieredCompactionStrategy";
case compaction_strategy_type::time_window:
return "TimeWindowCompactionStrategy";
default:
throw std::runtime_error("Invalid Compaction Strategy");
}
@@ -100,6 +107,8 @@ public:
return compaction_strategy_type::leveled;
} else if (short_name == "DateTieredCompactionStrategy") {
return compaction_strategy_type::date_tiered;
} else if (short_name == "TimeWindowCompactionStrategy") {
return compaction_strategy_type::time_window;
} else {
throw exceptions::configuration_exception(sprint("Unable to find compaction strategy class '%s'", name));
}
@@ -112,6 +121,8 @@ public:
}
sstable_set make_sstable_set(schema_ptr schema) const;
compaction_backlog_tracker& get_backlog_tracker();
};
// Creates a compaction_strategy object from one of the strategies available.

View File

@@ -28,6 +28,7 @@
#include <boost/range/iterator_range.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "utils/serialization.hh"
#include "util/backtrace.hh"
#include "unimplemented.hh"
enum class allow_prefixes { no, yes };
@@ -144,7 +145,7 @@ public:
}
len = read_simple<size_type>(_v);
if (_v.size() < len) {
throw marshal_exception();
throw_with_backtrace<marshal_exception>(sprint("compound_type iterator - not enough bytes, expected %d, got %d", len, _v.size()));
}
}
_current = bytes_view(_v.begin(), len);

View File

@@ -345,7 +345,7 @@ public:
}
len = read_simple<size_type>(_v);
if (_v.size() < len) {
throw marshal_exception();
throw_with_backtrace<marshal_exception>(sprint("composite iterator - not enough bytes, expected %d, got %d", len, _v.size()));
}
}
auto value = bytes_view(_v.begin(), len);

345
compress.cc Normal file
View File

@@ -0,0 +1,345 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <lz4.h>
#include <zlib.h>
#include <snappy-c.h>
#include "compress.hh"
#include "utils/class_registrator.hh"
const sstring compressor::namespace_prefix = "org.apache.cassandra.io.compress.";
class lz4_processor: public compressor {
public:
using compressor::compressor;
size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress_max_size(size_t input_len) const override;
};
class snappy_processor: public compressor {
public:
using compressor::compressor;
size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress_max_size(size_t input_len) const override;
};
class deflate_processor: public compressor {
public:
using compressor::compressor;
size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const override;
size_t compress_max_size(size_t input_len) const override;
};
compressor::compressor(sstring name)
: _name(std::move(name))
{}
std::set<sstring> compressor::option_names() const {
return {};
}
std::map<sstring, sstring> compressor::options() const {
return {};
}
shared_ptr<compressor> compressor::create(const sstring& name, const opt_getter& opts) {
if (name.empty()) {
return {};
}
qualified_name qn(namespace_prefix, name);
for (auto& c : { lz4, snappy, deflate }) {
if (c->name() == qn) {
return c;
}
}
return compressor_registry::create(qn, opts);
}
shared_ptr<compressor> compressor::create(const std::map<sstring, sstring>& options) {
auto i = options.find(compression_parameters::SSTABLE_COMPRESSION);
if (i != options.end() && !i->second.empty()) {
return create(i->second, [&options](const sstring& key) -> opt_string {
auto i = options.find(key);
if (i == options.end()) {
return std::experimental::nullopt;
}
return { i->second };
});
}
return {};
}
thread_local const shared_ptr<compressor> compressor::lz4 = make_shared<lz4_processor>(namespace_prefix + "LZ4Compressor");
thread_local const shared_ptr<compressor> compressor::snappy = make_shared<snappy_processor>(namespace_prefix + "SnappyCompressor");
thread_local const shared_ptr<compressor> compressor::deflate = make_shared<deflate_processor>(namespace_prefix + "DeflateCompressor");
const sstring compression_parameters::SSTABLE_COMPRESSION = "sstable_compression";
const sstring compression_parameters::CHUNK_LENGTH_KB = "chunk_length_kb";
const sstring compression_parameters::CRC_CHECK_CHANCE = "crc_check_chance";
compression_parameters::compression_parameters()
: compression_parameters(nullptr)
{}
compression_parameters::~compression_parameters()
{}
compression_parameters::compression_parameters(compressor_ptr c)
: _compressor(std::move(c))
{}
compression_parameters::compression_parameters(const std::map<sstring, sstring>& options) {
_compressor = compressor::create(options);
validate_options(options);
auto chunk_length = options.find(CHUNK_LENGTH_KB);
if (chunk_length != options.end()) {
try {
_chunk_length = std::stoi(chunk_length->second) * 1024;
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid integer value ") + chunk_length->second + " for " + CHUNK_LENGTH_KB);
}
}
auto crc_chance = options.find(CRC_CHECK_CHANCE);
if (crc_chance != options.end()) {
try {
_crc_check_chance = std::stod(crc_chance->second);
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid double value ") + crc_chance->second + "for " + CRC_CHECK_CHANCE);
}
}
}
void compression_parameters::validate() {
if (_chunk_length) {
auto chunk_length = _chunk_length.value();
if (chunk_length <= 0) {
throw exceptions::configuration_exception(sstring("Invalid negative or null ") + CHUNK_LENGTH_KB);
}
// _chunk_length must be a power of two
if (chunk_length & (chunk_length - 1)) {
throw exceptions::configuration_exception(sstring(CHUNK_LENGTH_KB) + " must be a power of 2.");
}
}
if (_crc_check_chance && (_crc_check_chance.value() < 0.0 || _crc_check_chance.value() > 1.0)) {
throw exceptions::configuration_exception(sstring(CRC_CHECK_CHANCE) + " must be between 0.0 and 1.0.");
}
}
std::map<sstring, sstring> compression_parameters::get_options() const {
if (!_compressor) {
return std::map<sstring, sstring>();
}
auto opts = _compressor->options();
opts.emplace(compression_parameters::SSTABLE_COMPRESSION, _compressor->name());
if (_chunk_length) {
opts.emplace(sstring(CHUNK_LENGTH_KB), std::to_string(_chunk_length.value() / 1024));
}
if (_crc_check_chance) {
opts.emplace(sstring(CRC_CHECK_CHANCE), std::to_string(_crc_check_chance.value()));
}
return opts;
}
bool compression_parameters::operator==(const compression_parameters& other) const {
return _compressor == other._compressor
&& _chunk_length == other._chunk_length
&& _crc_check_chance == other._crc_check_chance;
}
bool compression_parameters::operator!=(const compression_parameters& other) const {
return !(*this == other);
}
void compression_parameters::validate_options(const std::map<sstring, sstring>& options) {
// currently, there are no options specific to a particular compressor
static std::set<sstring> keywords({
sstring(SSTABLE_COMPRESSION),
sstring(CHUNK_LENGTH_KB),
sstring(CRC_CHECK_CHANCE),
});
std::set<sstring> ckw;
if (_compressor) {
ckw = _compressor->option_names();
}
for (auto&& opt : options) {
if (!keywords.count(opt.first) && !ckw.count(opt.first)) {
throw exceptions::configuration_exception(sprint("Unknown compression option '%s'.", opt.first));
}
}
}
size_t lz4_processor::uncompress(const char* input, size_t input_len,
char* output, size_t output_len) const {
// We use LZ4_decompress_safe(). According to the documentation, the
// function LZ4_decompress_fast() is slightly faster, but maliciously
// crafted compressed data can cause it to overflow the output buffer.
// Theoretically, our compressed data is created by us so is not malicious
// (and accidental corruption is avoided by the compressed-data checksum),
// but let's not take that chance for now, until we've actually measured
// the performance benefit that LZ4_decompress_fast() would bring.
// Cassandra's LZ4Compressor prepends to the chunk its uncompressed length
// in 4 bytes little-endian (!) order. We don't need this information -
// we already know the uncompressed data is at most the given chunk size
// (and usually is exactly that, except in the last chunk). The advance
// knowledge of the uncompressed size could be useful if we used
// LZ4_decompress_fast(), but we prefer LZ4_decompress_safe() anyway...
input += 4;
input_len -= 4;
auto ret = LZ4_decompress_safe(input, output, input_len, output_len);
if (ret < 0) {
throw std::runtime_error("LZ4 uncompression failure");
}
return ret;
}
size_t lz4_processor::compress(const char* input, size_t input_len,
char* output, size_t output_len) const {
if (output_len < LZ4_COMPRESSBOUND(input_len) + 4) {
throw std::runtime_error("LZ4 compression failure: length of output is too small");
}
// Write input_len (32-bit data) to beginning of output in little-endian representation.
output[0] = input_len & 0xFF;
output[1] = (input_len >> 8) & 0xFF;
output[2] = (input_len >> 16) & 0xFF;
output[3] = (input_len >> 24) & 0xFF;
#ifdef HAVE_LZ4_COMPRESS_DEFAULT
auto ret = LZ4_compress_default(input, output + 4, input_len, LZ4_compressBound(input_len));
#else
auto ret = LZ4_compress(input, output + 4, input_len);
#endif
if (ret == 0) {
throw std::runtime_error("LZ4 compression failure: LZ4_compress() failed");
}
return ret + 4;
}
size_t lz4_processor::compress_max_size(size_t input_len) const {
return LZ4_COMPRESSBOUND(input_len) + 4;
}
size_t deflate_processor::uncompress(const char* input,
size_t input_len, char* output, size_t output_len) const {
z_stream zs;
zs.zalloc = Z_NULL;
zs.zfree = Z_NULL;
zs.opaque = Z_NULL;
zs.avail_in = 0;
zs.next_in = Z_NULL;
if (inflateInit(&zs) != Z_OK) {
throw std::runtime_error("deflate uncompression init failure");
}
// yuck, zlib is not const-correct, and also uses unsigned char while we use char :-(
zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(input));
zs.avail_in = input_len;
zs.next_out = reinterpret_cast<unsigned char*>(output);
zs.avail_out = output_len;
auto res = inflate(&zs, Z_FINISH);
inflateEnd(&zs);
if (res == Z_STREAM_END) {
return output_len - zs.avail_out;
} else {
throw std::runtime_error("deflate uncompression failure");
}
}
size_t deflate_processor::compress(const char* input,
size_t input_len, char* output, size_t output_len) const {
z_stream zs;
zs.zalloc = Z_NULL;
zs.zfree = Z_NULL;
zs.opaque = Z_NULL;
zs.avail_in = 0;
zs.next_in = Z_NULL;
if (deflateInit(&zs, Z_DEFAULT_COMPRESSION) != Z_OK) {
throw std::runtime_error("deflate compression init failure");
}
zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(input));
zs.avail_in = input_len;
zs.next_out = reinterpret_cast<unsigned char*>(output);
zs.avail_out = output_len;
auto res = ::deflate(&zs, Z_FINISH);
deflateEnd(&zs);
if (res == Z_STREAM_END) {
return output_len - zs.avail_out;
} else {
throw std::runtime_error("deflate compression failure");
}
}
size_t deflate_processor::compress_max_size(size_t input_len) const {
z_stream zs;
zs.zalloc = Z_NULL;
zs.zfree = Z_NULL;
zs.opaque = Z_NULL;
zs.avail_in = 0;
zs.next_in = Z_NULL;
if (deflateInit(&zs, Z_DEFAULT_COMPRESSION) != Z_OK) {
throw std::runtime_error("deflate compression init failure");
}
auto res = deflateBound(&zs, input_len);
deflateEnd(&zs);
return res;
}
size_t snappy_processor::uncompress(const char* input, size_t input_len,
char* output, size_t output_len) const {
if (snappy_uncompress(input, input_len, output, &output_len)
== SNAPPY_OK) {
return output_len;
} else {
throw std::runtime_error("snappy uncompression failure");
}
}
size_t snappy_processor::compress(const char* input, size_t input_len,
char* output, size_t output_len) const {
auto ret = snappy_compress(input, input_len, output, &output_len);
if (ret != SNAPPY_OK) {
throw std::runtime_error("snappy compression failure: snappy_compress() failed");
}
return output_len;
}
size_t snappy_processor::compress_max_size(size_t input_len) const {
return snappy_max_compressed_length(input_len);
}

View File

@@ -21,135 +21,103 @@
#pragma once
#include "exceptions/exceptions.hh"
#include <map>
#include <set>
enum class compressor {
none,
lz4,
snappy,
deflate,
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "exceptions/exceptions.hh"
#include "stdx.hh"
class compressor {
sstring _name;
public:
compressor(sstring);
virtual ~compressor() {}
/**
* Unpacks data in "input" to output. If output_len is of insufficient size,
* exception is thrown. I.e. you should keep track of the uncompressed size.
*/
virtual size_t uncompress(const char* input, size_t input_len, char* output,
size_t output_len) const = 0;
/**
* Packs data in "input" to output. If output_len is of insufficient size,
* exception is thrown. Maximum required size is obtained via "compress_max_size"
*/
virtual size_t compress(const char* input, size_t input_len, char* output,
size_t output_len) const = 0;
/**
* Returns the maximum output size for compressing data on "input_len" size.
*/
virtual size_t compress_max_size(size_t input_len) const = 0;
/**
* Returns accepted option names for this compressor
*/
virtual std::set<sstring> option_names() const;
/**
* Returns original options used in instantiating this compressor
*/
virtual std::map<sstring, sstring> options() const;
/**
* Compressor class name.
*/
const sstring& name() const {
return _name;
}
// to cheaply bridge sstable compression options / maps
using opt_string = stdx::optional<sstring>;
using opt_getter = std::function<opt_string(const sstring&)>;
static shared_ptr<compressor> create(const sstring& name, const opt_getter&);
static shared_ptr<compressor> create(const std::map<sstring, sstring>&);
static thread_local const shared_ptr<compressor> lz4;
static thread_local const shared_ptr<compressor> snappy;
static thread_local const shared_ptr<compressor> deflate;
static const sstring namespace_prefix;
};
template<typename BaseType, typename... Args>
class class_registry;
using compressor_ptr = shared_ptr<compressor>;
using compressor_registry = class_registry<compressor_ptr, const typename compressor::opt_getter&>;
class compression_parameters {
public:
static constexpr int32_t DEFAULT_CHUNK_LENGTH = 4 * 1024;
static constexpr double DEFAULT_CRC_CHECK_CHANCE = 1.0;
static constexpr auto SSTABLE_COMPRESSION = "sstable_compression";
static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";
static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";
static const sstring SSTABLE_COMPRESSION;
static const sstring CHUNK_LENGTH_KB;
static const sstring CRC_CHECK_CHANCE;
private:
compressor _compressor;
compressor_ptr _compressor;
std::experimental::optional<int> _chunk_length;
std::experimental::optional<double> _crc_check_chance;
public:
compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }
compression_parameters(const std::map<sstring, sstring>& options) {
validate_options(options);
compression_parameters();
compression_parameters(compressor_ptr);
compression_parameters(const std::map<sstring, sstring>& options);
~compression_parameters();
auto it = options.find(SSTABLE_COMPRESSION);
if (it == options.end() || it->second.empty()) {
_compressor = compressor::none;
return;
}
const auto& compressor_class = it->second;
if (is_compressor_class(compressor_class, "LZ4Compressor")) {
_compressor = compressor::lz4;
} else if (is_compressor_class(compressor_class, "SnappyCompressor")) {
_compressor = compressor::snappy;
} else if (is_compressor_class(compressor_class, "DeflateCompressor")) {
_compressor = compressor::deflate;
} else {
throw exceptions::configuration_exception(sstring("Unsupported compression class '") + compressor_class + "'.");
}
auto chunk_length = options.find(CHUNK_LENGTH_KB);
if (chunk_length != options.end()) {
try {
_chunk_length = std::stoi(chunk_length->second) * 1024;
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid integer value ") + chunk_length->second + " for " + CHUNK_LENGTH_KB);
}
}
auto crc_chance = options.find(CRC_CHECK_CHANCE);
if (crc_chance != options.end()) {
try {
_crc_check_chance = std::stod(crc_chance->second);
} catch (const std::exception& e) {
throw exceptions::syntax_exception(sstring("Invalid double value ") + crc_chance->second + "for " + CRC_CHECK_CHANCE);
}
}
}
compressor get_compressor() const { return _compressor; }
compressor_ptr get_compressor() const { return _compressor; }
int32_t chunk_length() const { return _chunk_length.value_or(int(DEFAULT_CHUNK_LENGTH)); }
double crc_check_chance() const { return _crc_check_chance.value_or(double(DEFAULT_CRC_CHECK_CHANCE)); }
void validate() {
if (_chunk_length) {
auto chunk_length = _chunk_length.value();
if (chunk_length <= 0) {
throw exceptions::configuration_exception(sstring("Invalid negative or null ") + CHUNK_LENGTH_KB);
}
// _chunk_length must be a power of two
if (chunk_length & (chunk_length - 1)) {
throw exceptions::configuration_exception(sstring(CHUNK_LENGTH_KB) + " must be a power of 2.");
}
}
if (_crc_check_chance && (_crc_check_chance.value() < 0.0 || _crc_check_chance.value() > 1.0)) {
throw exceptions::configuration_exception(sstring(CRC_CHECK_CHANCE) + " must be between 0.0 and 1.0.");
}
}
std::map<sstring, sstring> get_options() const {
if (_compressor == compressor::none) {
return std::map<sstring, sstring>();
}
std::map<sstring, sstring> opts;
opts.emplace(sstring(SSTABLE_COMPRESSION), compressor_name());
if (_chunk_length) {
opts.emplace(sstring(CHUNK_LENGTH_KB), std::to_string(_chunk_length.value() / 1024));
}
if (_crc_check_chance) {
opts.emplace(sstring(CRC_CHECK_CHANCE), std::to_string(_crc_check_chance.value()));
}
return opts;
}
bool operator==(const compression_parameters& other) const {
return _compressor == other._compressor
&& _chunk_length == other._chunk_length
&& _crc_check_chance == other._crc_check_chance;
}
bool operator!=(const compression_parameters& other) const {
return !(*this == other);
}
void validate();
std::map<sstring, sstring> get_options() const;
bool operator==(const compression_parameters& other) const;
bool operator!=(const compression_parameters& other) const;
private:
void validate_options(const std::map<sstring, sstring>& options) {
// currently, there are no options specific to a particular compressor
static std::set<sstring> keywords({
sstring(SSTABLE_COMPRESSION),
sstring(CHUNK_LENGTH_KB),
sstring(CRC_CHECK_CHANCE),
});
for (auto&& opt : options) {
if (!keywords.count(opt.first)) {
throw exceptions::configuration_exception(sprint("Unknown compression option '%s'.", opt.first));
}
}
}
bool is_compressor_class(const sstring& value, const sstring& class_name) {
static const sstring namespace_prefix = "org.apache.cassandra.io.compress.";
return value == class_name || value == namespace_prefix + class_name;
}
sstring compressor_name() const {
switch (_compressor) {
case compressor::lz4:
return "org.apache.cassandra.io.compress.LZ4Compressor";
case compressor::snappy:
return "org.apache.cassandra.io.compress.SnappyCompressor";
case compressor::deflate:
return "org.apache.cassandra.io.compress.DeflateCompressor";
default:
abort();
}
}
void validate_options(const std::map<sstring, sstring>&);
};

View File

@@ -12,7 +12,9 @@
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'Test Cluster'
# It is recommended to change the default value when creating a new cluster.
# You can NOT modify this value for an existing cluster
#cluster_name: 'Test Cluster'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
@@ -85,6 +87,13 @@ listen_address: localhost
# Leaving this blank will set it to the same value as listen_address
# broadcast_address: 1.2.3.4
# When using multiple physical network interfaces, set this to true to listen on broadcast_address
# in addition to the listen_address, allowing nodes to communicate in both interfaces.
# Ignore this property if the network configuration automatically routes between the public and private networks such as EC2.
#
# listen_on_broadcast_address: false
# port for the CQL native transport to listen for clients on
# For security reasons, you should not expose this port to the internet. Firewall it if needed.
native_transport_port: 9042
@@ -98,13 +107,6 @@ native_transport_port: 9042
# keeping native_transport_port unencrypted.
#native_transport_port_ssl: 9142
# Throttles all outbound streaming file transfers on this node to the
# given total throughput in Mbps. This is necessary because Scylla does
# mostly sequential IO when streaming data during bootstrap or repair, which
# can lead to saturating the network connection and degrading rpc performance.
# When unset, the default is 200 Mbps or 25 MB/s.
# stream_throughput_outbound_megabits_per_sec: 200
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
@@ -238,9 +240,8 @@ batch_size_fail_threshold_in_kb: 50
# Uncomment to enable experimental features
# experimental: true
###################################################
## Not currently supported, reserved for future use
###################################################
# The directory where hints files are stored if hinted handoff is enabled.
# hints_directory: /var/lib/scylla/hints
# See http://wiki.apache.org/cassandra/HintedHandoff
# May either be "true" or "false" to enable globally, or contain a list
@@ -264,23 +265,27 @@ batch_size_fail_threshold_in_kb: 50
# cross-dc handoff tends to be slower
# max_hints_delivery_threads: 2
###################################################
## Not currently supported, reserved for future use
###################################################
# Maximum throttle in KBs per second, total. This will be
# reduced proportionally to the number of nodes in the cluster.
# batchlog_replay_throttle_in_kb: 1024
# Validity period for permissions cache (fetching permissions can be an
# expensive operation depending on the authorizer, CassandraAuthorizer is
# one example). Defaults to 2000, set to 0 to disable.
# one example). Defaults to 10000, set to 0 to disable.
# Will be disabled automatically for AllowAllAuthorizer.
# permissions_validity_in_ms: 2000
# permissions_validity_in_ms: 10000
# Refresh interval for permissions cache (if enabled).
# After this interval, cache entries become eligible for refresh. Upon next
# access, an async reload is scheduled and the old value returned until it
# completes. If permissions_validity_in_ms is non-zero, then this must be
# also.
# Defaults to the same value as permissions_validity_in_ms.
# permissions_update_interval_in_ms: 1000
# completes. If permissions_validity_in_ms is non-zero, then this also must have
# a non-zero value. Defaults to 2000. It's recommended to set this value to
# be at least 3 times smaller than the permissions_validity_in_ms.
# permissions_update_interval_in_ms: 2000
# The partitioner is responsible for distributing groups of rows (by
# partition key) across nodes in the cluster. You should leave this

View File

@@ -20,9 +20,11 @@
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
#
import os, os.path, textwrap, argparse, sys, shlex, subprocess, tempfile, re
import os, os.path, textwrap, argparse, sys, shlex, subprocess, tempfile, re, platform
from distutils.spawn import find_executable
tempfile.tempdir = "./build/tmp"
configure_args = str.join(' ', [shlex.quote(x) for x in sys.argv[1:]])
for line in open('/etc/os-release'):
@@ -34,7 +36,7 @@ for line in open('/etc/os-release'):
os_ids += value.split(' ')
# distribution "internationalization", converting package names.
# Fedora name is key, values is distro -> package name dict.
# Fedora name is key, values is distro -> package name dict.
i18n_xlat = {
'boost-devel': {
'debian': 'libboost-dev',
@@ -48,7 +50,7 @@ def pkgname(name):
for id in os_ids:
if id in dict:
return dict[id]
return name
return name
def get_flags():
with open('/proc/cpuinfo') as f:
@@ -83,17 +85,33 @@ def pkg_config(option, package):
return output.decode('utf-8').strip()
def try_compile(compiler, source = '', flags = []):
with tempfile.NamedTemporaryFile() as sfile:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
return subprocess.call([compiler, '-x', 'c++', '-o', '/dev/null', '-c', sfile.name] + flags,
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL) == 0
return try_compile_and_link(compiler, source, flags = flags + ['-c'])
def warning_supported(warning, compiler):
def ensure_tmp_dir_exists():
if not os.path.exists(tempfile.tempdir):
os.makedirs(tempfile.tempdir)
def try_compile_and_link(compiler, source = '', flags = []):
ensure_tmp_dir_exists()
with tempfile.NamedTemporaryFile() as sfile:
ofile = tempfile.mktemp()
try:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
# We can't write to /dev/null, since in some cases (-ftest-coverage) gcc will create an auxiliary
# output file based on the name of the output file, and "/dev/null.gcsa" is not a good name
return subprocess.call([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL) == 0
finally:
if os.path.exists(ofile):
os.unlink(ofile)
def flag_supported(flag, compiler):
# gcc ignores -Wno-x even if it is not supported
adjusted = re.sub('^-Wno-', '-W', warning)
return try_compile(flags = ['-Werror', adjusted], compiler = compiler)
adjusted = re.sub('^-Wno-', '-W', flag)
split = adjusted.split(' ')
return try_compile(flags = ['-Werror'] + split, compiler = compiler)
def debug_flag(compiler):
src_with_auto = textwrap.dedent('''\
@@ -108,6 +126,14 @@ def debug_flag(compiler):
print('Note: debug information disabled; upgrade your compiler')
return ''
def gold_supported(compiler):
src_main = 'int main(int argc, char **argv) { return 0; }'
if try_compile_and_link(source = src_main, flags = ['-fuse-ld=gold'], compiler = compiler):
return '-fuse-ld=gold'
else:
print('Note: gold not found; using default system linker')
return ''
def maybe_static(flag, libs):
if flag and not args.static:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
@@ -133,6 +159,13 @@ class Thrift(object):
def endswith(self, end):
return self.source.endswith(end)
def default_target_arch():
mach = platform.machine()
if platform.machine() in ['i386', 'i686', 'x86_64']:
return 'nehalem'
else:
return ''
class Antlr3Grammar(object):
def __init__(self, source):
self.source = source
@@ -154,20 +187,22 @@ modes = {
'debug': {
'sanitize': '-fsanitize=address -fsanitize=leak -fsanitize=undefined',
'sanitize_libs': '-lasan -lubsan',
'opt': '-O0 -DDEBUG -DDEBUG_SHARED_PTR -DDEFAULT_ALLOCATOR',
'opt': '-O0 -DDEBUG -DDEBUG_SHARED_PTR -DDEFAULT_ALLOCATOR -DDEBUG_LSA_SANITIZER',
'libs': '',
},
'release': {
'sanitize': '',
'sanitize_libs': '',
'opt': '-O2',
'opt': '-O3',
'libs': '',
},
}
scylla_tests = [
'tests/mutation_test',
'tests/streamed_mutation_test',
'tests/mvcc_test',
'tests/mutation_fragment_test',
'tests/flat_mutation_reader_test',
'tests/schema_registry_test',
'tests/canonical_mutation_test',
'tests/range_test',
@@ -176,6 +211,7 @@ scylla_tests = [
'tests/partitioner_test',
'tests/frozen_mutation_test',
'tests/serialized_action_test',
'tests/hint_test',
'tests/clustering_ranges_walker_test',
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
@@ -186,7 +222,8 @@ scylla_tests = [
'tests/perf/perf_cql_parser',
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/cache_streamed_mutation_test',
'tests/perf/perf_cache_eviction',
'tests/cache_flat_mutation_reader_test',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/perf/perf_sstable',
@@ -212,6 +249,7 @@ scylla_tests = [
'tests/config_test',
'tests/gossiping_property_file_snitch_test',
'tests/ec2_snitch_test',
'tests/gce_snitch_test',
'tests/snitch_reset_test',
'tests/network_topology_strategy_test',
'tests/query_processor_test',
@@ -221,7 +259,7 @@ scylla_tests = [
'tests/murmur_hash_test',
'tests/allocation_strategy_test',
'tests/logalloc_test',
'tests/log_histogram_test',
'tests/log_heap_test',
'tests/managed_vector_test',
'tests/crc_test',
'tests/flush_queue_test',
@@ -233,19 +271,40 @@ scylla_tests = [
'tests/database_test',
'tests/nonwrapping_range_test',
'tests/input_stream_test',
'tests/sstable_atomic_deletion_test',
'tests/virtual_reader_test',
'tests/view_schema_test',
'tests/counter_test',
'tests/cell_locker_test',
'tests/row_locker_test',
'tests/streaming_histogram_test',
'tests/duration_test',
'tests/vint_serialization_test',
'tests/compress_test',
'tests/chunked_vector_test',
'tests/loading_cache_test',
'tests/castas_fcts_test',
'tests/big_decimal_test',
'tests/aggregate_fcts_test',
'tests/role_manager_test',
'tests/caching_options_test',
'tests/auth_resource_test',
'tests/cql_auth_query_test',
'tests/enum_set_test',
'tests/extensions_test',
'tests/cql_auth_syntax_test',
'tests/querier_cache',
'tests/querier_cache_resource_based_eviction',
]
perf_tests = [
'tests/perf/perf_mutation_readers'
]
apps = [
'scylla',
]
tests = scylla_tests
tests = scylla_tests + perf_tests
other = [
'iotune',
@@ -267,6 +326,8 @@ arg_parser.add_argument('--cflags', action = 'store', dest = 'user_cflags', defa
help = 'Extra flags for the C++ compiler')
arg_parser.add_argument('--ldflags', action = 'store', dest = 'user_ldflags', default = '',
help = 'Extra flags for the linker')
arg_parser.add_argument('--target', action = 'store', dest = 'target', default = default_target_arch(),
help = 'Target architecture (-march)')
arg_parser.add_argument('--compiler', action = 'store', dest = 'cxx', default = 'g++',
help = 'C++ compiler path')
arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='gcc',
@@ -285,6 +346,8 @@ arg_parser.add_argument('--static-thrift', dest = 'staticthrift', action = 'stor
help = 'Link libthrift statically')
arg_parser.add_argument('--static-boost', dest = 'staticboost', action = 'store_true',
help = 'Link boost statically')
arg_parser.add_argument('--static-yaml-cpp', dest = 'staticyamlcpp', action = 'store_true',
help = 'Link libyaml-cpp statically')
arg_parser.add_argument('--tests-debuginfo', action = 'store', dest = 'tests_debuginfo', type = int, default = 0,
help = 'Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--python', action = 'store', dest = 'python', default = 'python3',
@@ -309,7 +372,7 @@ scylla_core = (['database.cc',
'schema_registry.cc',
'bytes.cc',
'mutation.cc',
'streamed_mutation.cc',
'mutation_fragment.cc',
'partition_version.cc',
'row_cache.cc',
'canonical_mutation.cc',
@@ -320,22 +383,25 @@ scylla_core = (['database.cc',
'supervisor.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'flat_mutation_reader.cc',
'mutation_query.cc',
'keys.cc',
'counters.cc',
'counters.cc',
'compress.cc',
'sstables/sstables.cc',
'sstables/compress.cc',
'sstables/row.cc',
'sstables/partition.cc',
'sstables/filter.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/compaction_manager.cc',
'sstables/atomic_deletion.cc',
'sstables/integrity_checked_file_impl.cc',
'sstables/prepended_input_stream.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
@@ -350,6 +416,7 @@ scylla_core = (['database.cc',
'cql3/sets.cc',
'cql3/maps.cc',
'cql3/functions/functions.cc',
'cql3/functions/castas_fcts.cc',
'cql3/statements/cf_prop_defs.cc',
'cql3/statements/cf_statement.cc',
'cql3/statements/authentication_statement.cc',
@@ -357,7 +424,6 @@ scylla_core = (['database.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_index_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
'cql3/statements/drop_table_statement.cc',
@@ -379,8 +445,6 @@ scylla_core = (['database.cc',
'cql3/statements/truncate_statement.cc',
'cql3/statements/alter_table_statement.cc',
'cql3/statements/alter_view_statement.cc',
'cql3/statements/alter_user_statement.cc',
'cql3/statements/drop_user_statement.cc',
'cql3/statements/list_users_statement.cc',
'cql3/statements/authorization_statement.cc',
'cql3/statements/permission_altering_statement.cc',
@@ -389,9 +453,10 @@ scylla_core = (['database.cc',
'cql3/statements/revoke_statement.cc',
'cql3/statements/alter_type_statement.cc',
'cql3/statements/alter_keyspace_statement.cc',
'cql3/statements/role-management-statements.cc',
'cql3/update_parameters.cc',
'cql3/ut_name.cc',
'cql3/user_options.cc',
'cql3/role_name.cc',
'thrift/handler.cc',
'thrift/server.cc',
'thrift/thrift_validation.cc',
@@ -433,15 +498,16 @@ scylla_core = (['database.cc',
'db/commitlog/commitlog.cc',
'db/commitlog/commitlog_replayer.cc',
'db/commitlog/commitlog_entry.cc',
'db/hints/manager.cc',
'db/config.cc',
'db/extensions.cc',
'db/heat_load_balance.cc',
'db/index/secondary_index.cc',
'db/marshal/type_parser.cc',
'db/batchlog_manager.cc',
'db/view/view.cc',
'db/view/row_locking.cc',
'index/secondary_index_manager.cc',
'io/io.cc',
'utils/utils.cc',
'utils/UUID_gen.cc',
'utils/i_filter.cc',
'utils/bloom_filter.cc',
@@ -451,6 +517,7 @@ scylla_core = (['database.cc',
'utils/dynamic_bitset.cc',
'utils/managed_bytes.cc',
'utils/exceptions.cc',
'utils/config_file.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -476,7 +543,6 @@ scylla_core = (['database.cc',
'locator/network_topology_strategy.cc',
'locator/everywhere_replication_strategy.cc',
'locator/token_metadata.cc',
'locator/locator.cc',
'locator/snitch_base.cc',
'locator/simple_snitch.cc',
'locator/rack_inferring_snitch.cc',
@@ -484,6 +550,7 @@ scylla_core = (['database.cc',
'locator/production_snitch_base.cc',
'locator/ec2_snitch.cc',
'locator/ec2_multi_region_snitch.cc',
'locator/gce_snitch.cc',
'message/messaging_service.cc',
'service/client_state.cc',
'service/migration_task.cc',
@@ -510,20 +577,33 @@ scylla_core = (['database.cc',
'lister.cc',
'repair/repair.cc',
'exceptions/exceptions.cc',
'auth/auth.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
'auth/authenticated_user.cc',
'auth/authenticator.cc',
'auth/authorizer.cc',
'auth/common.cc',
'auth/default_authorizer.cc',
'auth/data_resource.cc',
'auth/resource.cc',
'auth/roles-metadata.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/permissions_cache.cc',
'auth/service.cc',
'auth/standard_role_manager.cc',
'auth/transitional.cc',
'auth/authentication_options.cc',
'auth/role_or_anonymous.cc',
'tracing/tracing.cc',
'tracing/trace_keyspace_helper.cc',
'tracing/trace_state.cc',
'table_helper.cc',
'range_tombstone.cc',
'range_tombstone_list.cc',
'disk-error-handler.cc'
'disk-error-handler.cc',
'duration.cc',
'vint-serialization.cc',
'utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc',
'querier.cc',
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -619,6 +699,16 @@ pure_boost_tests = set([
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/cartesian_product_test',
'tests/streaming_histogram_test',
'tests/duration_test',
'tests/vint_serialization_test',
'tests/compress_test',
'tests/chunked_vector_test',
'tests/big_decimal_test',
'tests/caching_options_test',
'tests/auth_resource_test',
'tests/enum_set_test',
'tests/cql_auth_syntax_test',
])
tests_not_using_seastar_test_framework = set([
@@ -632,10 +722,12 @@ tests_not_using_seastar_test_framework = set([
'tests/message',
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/perf/perf_cache_eviction',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/gossip',
'tests/perf/perf_sstable',
'tests/querier_cache_resource_based_eviction',
]) | pure_boost_tests
for t in tests_not_using_seastar_test_framework:
@@ -645,19 +737,27 @@ for t in tests_not_using_seastar_test_framework:
for t in scylla_tests:
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc']
perf_tests_seastar_deps = [
'seastar/tests/perf/perf_tests.cc'
]
for t in perf_tests:
deps[t] = [t + '.cc'] + scylla_tests_dependencies + perf_tests_seastar_deps
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc', 'tests/sstable_utils.cc']
deps['tests/mutation_reader_test'] += ['tests/sstable_utils.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc', 'utils/managed_bytes.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/input_stream_test'] = ['tests/input_stream_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc', 'utils/uuid.cc', 'utils/managed_bytes.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/murmur_hash_test.cc']
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/log_histogram_test'] = ['tests/log_histogram_test.cc']
deps['tests/log_heap_test'] = ['tests/log_heap_test.cc']
deps['tests/anchorless_list_test'] = ['tests/anchorless_list_test.cc']
warnings = [
@@ -671,14 +771,28 @@ warnings = [
'-Wno-return-stack-address',
'-Wno-missing-braces',
'-Wno-unused-lambda-capture',
'-Wno-misleading-indentation',
'-Wno-overflow',
'-Wno-noexcept-type',
'-Wno-nonnull-compare'
]
warnings = [w
for w in warnings
if warning_supported(warning = w, compiler = args.cxx)]
if flag_supported(flag = w, compiler = args.cxx)]
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
optimization_flags = [
'--param inline-unit-growth=300',
]
optimization_flags = [o
for o in optimization_flags
if flag_supported(flag = o, compiler = args.cxx)]
modes['release']['opt'] += ' ' + ' '.join(optimization_flags)
gold_linker_flag = gold_supported(compiler = args.cxx)
dbgflag = debug_flag(args.cxx) if args.debuginfo else ''
tests_link_rule = 'link' if args.tests_debuginfo else 'link_stripped'
@@ -766,13 +880,20 @@ if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
if args.staticyamlcpp:
seastar_flags += ['--static-yaml-cpp']
if args.gcc6_concepts:
seastar_flags += ['--enable-gcc6-concepts']
if args.alloc_failure_injector:
seastar_flags += ['--enable-alloc-failure-injector']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags)]
seastar_cflags = args.user_cflags
if args.target != '':
seastar_cflags += ' -march=' + args.target
seastar_ldflags = args.user_ldflags
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags), '--ldflags=%s' %(seastar_ldflags),
'--c++-dialect=gnu++1z', '--optflags=%s' % (modes['release']['opt']),
]
status = subprocess.call([python, './configure.py'] + seastar_flags, cwd = 'seastar')
@@ -803,11 +924,16 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = ' '.join(['-lyaml-cpp', '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt',
libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt', ' -lcryptopp',
maybe_static(args.staticboost, '-lboost_date_time'),
])
xxhash_dir = 'xxHash'
if not os.path.exists(xxhash_dir) or not os.listdir(xxhash_dir):
raise Exception(xxhash_dir + ' is empty. Run "git submodule update --init".')
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
@@ -830,13 +956,14 @@ os.makedirs(outdir, exist_ok = True)
do_sanitize = True
if args.static:
do_sanitize = False
with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
configure_args = {configure_args}
builddir = {outdir}
cxx = {cxx}
cxxflags = {user_cflags} {warnings} {defines}
ldflags = {user_ldflags}
ldflags = {gold_linker_flag} {user_ldflags}
libs = {libs}
pool link_pool
depth = {link_pool_depth}
@@ -865,7 +992,7 @@ with open(buildfile, 'w') as f:
for mode in build_modes:
modeval = modes[mode]
f.write(textwrap.dedent('''\
cxxflags_{mode} = -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
cxxflags_{mode} = {opt} -DXXH_PRIVATE_API -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
rule cxx.{mode}
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} $obj_cxxflags -c -o $out $in
description = CXX $out
@@ -893,7 +1020,8 @@ with open(buildfile, 'w') as f:
&& sed -i -e 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' $
-e '1i using ExceptionBaseType = int;' $
-e 's/^{{/{{ ExceptionBaseType\* ex = nullptr;/; $
s/ExceptionBaseType\* ex = new/ex = new/' $
s/ExceptionBaseType\* ex = new/ex = new/; $
s/exceptions::syntax_exception e/exceptions::syntax_exception\& e/' $
build/{mode}/gen/${{stem}}Parser.cpp
description = ANTLR3 $in
''').format(mode = mode, **modeval))
@@ -912,6 +1040,7 @@ with open(buildfile, 'w') as f:
objs = ['$builddir/' + mode + '/' + src.replace('.cc', '.o')
for src in srcs
if src.endswith('.cc')]
objs.append('$builddir/../utils/arch/powerpc/crc32-vpmsum/crc32.S')
has_thrift = False
for dep in deps[binary]:
if isinstance(dep, Thrift):
@@ -919,25 +1048,13 @@ with open(buildfile, 'w') as f:
objs += dep.objects('$builddir/' + mode + '/gen')
if isinstance(dep, Antlr3Grammar):
objs += dep.objects('$builddir/' + mode + '/gen')
if binary.endswith('.pc'):
vars = modeval.copy()
vars.update(globals())
pc = textwrap.dedent('''\
Name: Seastar
URL: http://seastar-project.org/
Description: Advanced C++ framework for high-performance server applications on modern hardware.
Version: 1.0
Libs: -L{srcdir}/{builddir} -Wl,--whole-archive -lseastar -Wl,--no-whole-archive {dbgflag} -Wl,--no-as-needed {static} {pie} -fvisibility=hidden -pthread {user_ldflags} {libs} {sanitize_libs}
Cflags: -std=gnu++1y {dbgflag} {fpie} -Wall -Werror -fvisibility=hidden -pthread -I{srcdir} -I{srcdir}/{builddir}/gen {user_cflags} {warnings} {defines} {sanitize} {opt}
''').format(builddir = 'build/' + mode, srcdir = os.getcwd(), **vars)
f.write('build $builddir/{}/{}: gen\n text = {}\n'.format(mode, binary, repr(pc)))
elif binary.endswith('.a'):
if binary.endswith('.a'):
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
@@ -1027,7 +1144,7 @@ with open(buildfile, 'w') as f:
rule configure
command = {python} configure.py $configure_args
generator = 1
build build.ninja: configure | configure.py
build build.ninja: configure | configure.py seastar/configure.py
rule cscope
command = find -name '*.[chS]' -o -name "*.cc" -o -name "*.hh" | cscope -bq -i-
description = CSCOPE

View File

@@ -1,89 +0,0 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/thread.hh>
#include <seastar/core/timer.hh>
#include <chrono>
// Simple proportional controller to adjust shares of memtable/streaming flushes.
//
// Goal is to flush as fast as we can, but not so fast that we steal all the CPU from incoming
// requests, and at the same time minimize user-visible fluctuations in the flush quota.
//
// What that translates to is we'll try to keep virtual dirty's firt derivative at 0 (IOW, we keep
// virtual dirty constant), which means that the rate of incoming writes is equal to the rate of
// flushed bytes.
//
// The exact point at which the controller stops determines the desired flush CPU usage. As we
// approach the hard dirty limit, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// 1) the soft limit line
// 2) halfway between soft limit and dirty limit
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
//
// For now we'll just set them in the structure not to complicate the constructor. But q1, q2 and
// qmax can easily become parameters if we find another user.
class flush_cpu_controller {
static constexpr float hard_dirty_limit = 0.50;
static constexpr float q1 = 0.01;
static constexpr float q2 = 0.2;
static constexpr float qmax = 1;
float _current_quota = 0.0f;
float _goal;
std::function<float()> _current_dirty;
std::chrono::milliseconds _interval;
timer<> _update_timer;
seastar::thread_scheduling_group _scheduling_group;
seastar::thread_scheduling_group *_current_scheduling_group = nullptr;
void adjust();
public:
seastar::thread_scheduling_group* scheduling_group() {
return _current_scheduling_group;
}
float current_quota() const {
return _current_quota;
}
struct disabled {
seastar::thread_scheduling_group *backup;
};
flush_cpu_controller(disabled d) : _scheduling_group(std::chrono::nanoseconds(0), 0), _current_scheduling_group(d.backup) {}
flush_cpu_controller(std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty);
flush_cpu_controller(flush_cpu_controller&&) = default;
};

View File

@@ -56,13 +56,16 @@ options {
#include "cql3/statements/index_prop_defs.hh"
#include "cql3/statements/raw/use_statement.hh"
#include "cql3/statements/raw/batch_statement.hh"
#include "cql3/statements/create_user_statement.hh"
#include "cql3/statements/alter_user_statement.hh"
#include "cql3/statements/drop_user_statement.hh"
#include "cql3/statements/list_users_statement.hh"
#include "cql3/statements/grant_statement.hh"
#include "cql3/statements/revoke_statement.hh"
#include "cql3/statements/list_permissions_statement.hh"
#include "cql3/statements/alter_role_statement.hh"
#include "cql3/statements/list_roles_statement.hh"
#include "cql3/statements/grant_role_statement.hh"
#include "cql3/statements/revoke_role_statement.hh"
#include "cql3/statements/drop_role_statement.hh"
#include "cql3/statements/create_role_statement.hh"
#include "cql3/statements/index_target.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "cql3/selection/raw_selector.hh"
@@ -80,6 +83,8 @@ options {
#include "cql3/maps.hh"
#include "cql3/sets.hh"
#include "cql3/lists.hh"
#include "cql3/role_name.hh"
#include "cql3/role_options.hh"
#include "cql3/type_cast.hh"
#include "cql3/tuples.hh"
#include "cql3/user_types.hh"
@@ -89,6 +94,7 @@ options {
#include "core/sstring.hh"
#include "CqlLexer.hpp"
#include <algorithm>
#include <unordered_map>
#include <map>
}
@@ -236,6 +242,12 @@ struct uninitialized {
return res;
}
bool convert_boolean_literal(stdx::string_view s) {
std::string lower_s(s.size(), '\0');
std::transform(s.cbegin(), s.cend(), lower_s.begin(), &::tolower);
return lower_s == "true";
}
void add_raw_update(std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>,::shared_ptr<cql3::operation::raw_update>>>& operations,
::shared_ptr<cql3::column_identifier::raw> key, ::shared_ptr<cql3::operation::raw_update> update)
{
@@ -345,6 +357,12 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st32=createViewStatement { $stmt = st32; }
| st33=alterViewStatement { $stmt = st33; }
| st34=dropViewStatement { $stmt = st34; }
| st35=listRolesStatement { $stmt = st35; }
| st36=grantRoleStatement { $stmt = st36; }
| st37=revokeRoleStatement { $stmt = st37; }
| st38=dropRoleStatement { $stmt = st38; }
| st39=createRoleStatement { $stmt = st39; }
| st40=alterRoleStatement { $stmt = st40; }
;
/*
@@ -369,7 +387,6 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
}
: K_SELECT ( ( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
| sclause=selectCountClause
)
K_FROM cf=columnFamilyName
( K_WHERE wclause=whereClause )?
@@ -396,9 +413,11 @@ selector returns [shared_ptr<raw_selector> s]
unaliasedSelector returns [shared_ptr<selectable::raw> s]
@init { shared_ptr<selectable::raw> tmp; }
: ( c=cident { tmp = c; }
| K_COUNT '(' countArgument ')' { tmp = selectable::with_function::raw::make_count_rows_function(); }
| K_WRITETIME '(' c=cident ')' { tmp = make_shared<selectable::writetime_or_ttl::raw>(c, true); }
| K_TTL '(' c=cident ')' { tmp = make_shared<selectable::writetime_or_ttl::raw>(c, false); }
| f=functionName args=selectionFunctionArgs { tmp = ::make_shared<selectable::with_function::raw>(std::move(f), std::move(args)); }
| K_CAST '(' arg=unaliasedSelector K_AS t=native_type ')' { tmp = ::make_shared<selectable::with_cast::raw>(std::move(arg), std::move(t)); }
)
( '.' fi=cident { tmp = make_shared<selectable::with_field_selection::raw>(std::move(tmp), std::move(fi)); } )*
{ $s = tmp; }
@@ -411,16 +430,6 @@ selectionFunctionArgs returns [std::vector<shared_ptr<selectable::raw>> a]
')'
;
selectCountClause returns [std::vector<shared_ptr<raw_selector>> expr]
@init{ auto alias = make_shared<cql3::column_identifier>("count", false); }
: K_COUNT '(' countArgument ')' (K_AS c=ident { alias = c; })? {
auto&& with_fn = ::make_shared<cql3::selection::selectable::with_function::raw>(
cql3::functions::function_name::native_function("countRows"),
std::vector<shared_ptr<cql3::selection::selectable::raw>>());
$expr.push_back(make_shared<cql3::selection::raw_selector>(with_fn, alias));
}
;
countArgument
: '*'
| i=INTEGER { if (i->getText() != "1") {
@@ -974,7 +983,7 @@ truncateStatement returns [::shared_ptr<truncate_statement> stmt]
;
/**
* GRANT <permission> ON <resource> TO <username>
* GRANT <permission> ON <resource> TO <grantee>
*/
grantStatement returns [::shared_ptr<grant_statement> stmt]
: K_GRANT
@@ -982,12 +991,12 @@ grantStatement returns [::shared_ptr<grant_statement> stmt]
K_ON
resource
K_TO
username
{ $stmt = ::make_shared<grant_statement>($permissionOrAll.perms, $resource.res, $username.text); }
grantee=userOrRoleName
{ $stmt = ::make_shared<grant_statement>($permissionOrAll.perms, $resource.res, std::move(grantee)); }
;
/**
* REVOKE <permission> ON <resource> FROM <username>
* REVOKE <permission> ON <resource> FROM <revokee>
*/
revokeStatement returns [::shared_ptr<revoke_statement> stmt]
: K_REVOKE
@@ -995,80 +1004,104 @@ revokeStatement returns [::shared_ptr<revoke_statement> stmt]
K_ON
resource
K_FROM
username
{ $stmt = ::make_shared<revoke_statement>($permissionOrAll.perms, $resource.res, $username.text); }
revokee=userOrRoleName
{ $stmt = ::make_shared<revoke_statement>($permissionOrAll.perms, $resource.res, std::move(revokee)); }
;
/**
* GRANT <rolename> to <grantee>
*/
grantRoleStatement returns [::shared_ptr<grant_role_statement> stmt]
: K_GRANT role=userOrRoleName K_TO grantee=userOrRoleName
{ $stmt = ::make_shared<grant_role_statement>(std::move(role), std::move(grantee)); }
;
/**
* REVOKE <rolename> FROM <revokee>
*/
revokeRoleStatement returns [::shared_ptr<revoke_role_statement> stmt]
: K_REVOKE role=userOrRoleName K_FROM revokee=userOrRoleName
{ $stmt = ::make_shared<revoke_role_statement>(std::move(role), std::move(revokee)); }
;
listPermissionsStatement returns [::shared_ptr<list_permissions_statement> stmt]
@init {
std::experimental::optional<auth::data_resource> r;
std::experimental::optional<sstring> u;
std::optional<auth::resource> r;
std::optional<sstring> role;
bool recursive = true;
}
: K_LIST
permissionOrAll
( K_ON resource { r = $resource.res; } )?
( K_OF username { u = sstring($username.text); } )?
( K_OF rn=userOrRoleName { role = sstring(static_cast<cql3::role_name>(rn).to_string()); } )?
( K_NORECURSIVE { recursive = false; } )?
{ $stmt = ::make_shared<list_permissions_statement>($permissionOrAll.perms, std::move(r), std::move(u), recursive); }
{ $stmt = ::make_shared<list_permissions_statement>($permissionOrAll.perms, std::move(r), std::move(role), recursive); }
;
permission returns [auth::permission perm]
: p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE)
: p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE | K_DESCRIBE)
{ $perm = auth::permissions::from_string($p.text); }
;
permissionOrAll returns [auth::permission_set perms]
: K_ALL ( K_PERMISSIONS )? { $perms = auth::permissions::ALL_DATA; }
: K_ALL ( K_PERMISSIONS )? { $perms = auth::permissions::ALL; }
| p=permission ( K_PERMISSION )? { $perms = auth::permission_set::from_mask(auth::permission_set::mask_for($p.perm)); }
;
resource returns [auth::data_resource res]
: r=dataResource { $res = $r.res; }
resource returns [uninitialized<auth::resource> res]
: d=dataResource { $res = std::move(d); }
| r=roleResource { $res = std::move(r); }
;
dataResource returns [auth::data_resource res]
: K_ALL K_KEYSPACES { $res = auth::data_resource(); }
| K_KEYSPACE ks = keyspaceName { $res = auth::data_resource($ks.id); }
dataResource returns [uninitialized<auth::resource> res]
: K_ALL K_KEYSPACES { $res = auth::resource(auth::resource_kind::data); }
| K_KEYSPACE ks = keyspaceName { $res = auth::make_data_resource($ks.id); }
| ( K_COLUMNFAMILY )? cf = columnFamilyName
{ $res = auth::data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
{ $res = auth::make_data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
;
roleResource returns [uninitialized<auth::resource> res]
: K_ALL K_ROLES { $res = auth::resource(auth::resource_kind::role); }
| K_ROLE role = userOrRoleName { $res = auth::make_role_resource(static_cast<const cql3::role_name&>(role).to_string()); }
;
/**
* CREATE USER [IF NOT EXISTS] <username> [WITH PASSWORD <password>] [SUPERUSER|NOSUPERUSER]
*/
createUserStatement returns [::shared_ptr<create_user_statement> stmt]
createUserStatement returns [::shared_ptr<create_role_statement> stmt]
@init {
auto opts = ::make_shared<cql3::user_options>();
bool superuser = false;
cql3::role_options opts;
opts.is_superuser = false;
opts.can_login = true;
bool ifNotExists = false;
}
: K_CREATE K_USER (K_IF K_NOT K_EXISTS { ifNotExists = true; })? username
( K_WITH userOptions[opts] )?
( K_SUPERUSER { superuser = true; } | K_NOSUPERUSER { superuser = false; } )?
{ $stmt = ::make_shared<create_user_statement>($username.text, std::move(opts), superuser, ifNotExists); }
( K_WITH K_PASSWORD v=STRING_LITERAL { opts.password = $v.text; })?
( K_SUPERUSER { opts.is_superuser = true; } | K_NOSUPERUSER { opts.is_superuser = false; } )?
{ $stmt = ::make_shared<create_role_statement>(cql3::role_name($username.text, cql3::preserve_role_case::yes), std::move(opts), ifNotExists); }
;
/**
* ALTER USER <username> [WITH PASSWORD <password>] [SUPERUSER|NOSUPERUSER]
*/
alterUserStatement returns [::shared_ptr<alter_user_statement> stmt]
alterUserStatement returns [::shared_ptr<alter_role_statement> stmt]
@init {
auto opts = ::make_shared<cql3::user_options>();
std::experimental::optional<bool> superuser;
cql3::role_options opts;
}
: K_ALTER K_USER username
( K_WITH userOptions[opts] )?
( K_SUPERUSER { superuser = true; } | K_NOSUPERUSER { superuser = false; } )?
{ $stmt = ::make_shared<alter_user_statement>($username.text, std::move(opts), std::move(superuser)); }
( K_WITH K_PASSWORD v=STRING_LITERAL { opts.password = $v.text; })?
( K_SUPERUSER { opts.is_superuser = true; } | K_NOSUPERUSER { opts.is_superuser = false; } )?
{ $stmt = ::make_shared<alter_role_statement>(cql3::role_name($username.text, cql3::preserve_role_case::yes), std::move(opts)); }
;
/**
* DROP USER [IF EXISTS] <username>
*/
dropUserStatement returns [::shared_ptr<drop_user_statement> stmt]
dropUserStatement returns [::shared_ptr<drop_role_statement> stmt]
@init { bool ifExists = false; }
: K_DROP K_USER (K_IF K_EXISTS { ifExists = true; })? username { $stmt = ::make_shared<drop_user_statement>($username.text, ifExists); }
: K_DROP K_USER (K_IF K_EXISTS { ifExists = true; })? username
{ $stmt = ::make_shared<drop_role_statement>(cql3::role_name($username.text, cql3::preserve_role_case::yes), ifExists); }
;
/**
@@ -1078,12 +1111,67 @@ listUsersStatement returns [::shared_ptr<list_users_statement> stmt]
: K_LIST K_USERS { $stmt = ::make_shared<list_users_statement>(); }
;
userOptions[::shared_ptr<cql3::user_options> opts]
: userOption[opts]
/**
* CREATE ROLE [IF NOT EXISTS] <role_name> [WITH <roleOption> [AND <roleOption>]*]
*/
createRoleStatement returns [::shared_ptr<create_role_statement> stmt]
@init {
cql3::role_options opts;
opts.is_superuser = false;
opts.can_login = false;
bool if_not_exists = false;
}
: K_CREATE K_ROLE (K_IF K_NOT K_EXISTS { if_not_exists = true; })? name=userOrRoleName
(K_WITH roleOptions[opts])?
{ $stmt = ::make_shared<create_role_statement>(name, std::move(opts), if_not_exists); }
;
userOption[::shared_ptr<cql3::user_options> opts]
: k=K_PASSWORD v=STRING_LITERAL { opts->put($k.text, $v.text); }
/**
* ALTER ROLE <rolename> [WITH <roleOption> [AND <roleOption>]*]
*/
alterRoleStatement returns [::shared_ptr<alter_role_statement> stmt]
@init {
cql3::role_options opts;
}
: K_ALTER K_ROLE name=userOrRoleName
(K_WITH roleOptions[opts])?
{ $stmt = ::make_shared<alter_role_statement>(name, std::move(opts)); }
;
/**
* DROP ROLE [IF EXISTS] <rolename>
*/
dropRoleStatement returns [::shared_ptr<drop_role_statement> stmt]
@init {
bool if_exists = false;
}
: K_DROP K_ROLE (K_IF K_EXISTS { if_exists = true; })? name=userOrRoleName
{ $stmt = ::make_shared<drop_role_statement>(name, if_exists); }
;
/**
* LIST ROLES [OF <rolename>] [NORECURSIVE]
*/
listRolesStatement returns [::shared_ptr<list_roles_statement> stmt]
@init {
bool recursive = true;
std::optional<cql3::role_name> grantee;
}
: K_LIST K_ROLES
(K_OF g=userOrRoleName { grantee = std::move(g); })?
(K_NORECURSIVE { recursive = false; })?
{ $stmt = ::make_shared<list_roles_statement>(grantee, recursive); }
;
roleOptions[cql3::role_options& opts]
: roleOption[opts] (K_AND roleOption[opts])*
;
roleOption[cql3::role_options& opts]
: K_PASSWORD '=' v=STRING_LITERAL { opts.password = $v.text; }
| K_OPTIONS '=' m=mapLiteral { opts.options = convert_property_map(m); }
| K_SUPERUSER '=' b=BOOLEAN { opts.is_superuser = convert_boolean_literal($b.text); }
| K_LOGIN '=' b=BOOLEAN { opts.can_login = convert_boolean_literal($b.text); }
;
/** DEFINITIONS **/
@@ -1124,12 +1212,13 @@ userTypeName returns [uninitialized<cql3::ut_name> name]
: (ks=ident '.')? ut=non_type_ident { $name = cql3::ut_name(ks, ut); }
;
#if 0
userOrRoleName returns [RoleName name]
@init { $name = new RoleName(); }
: roleName[name] {return $name;}
userOrRoleName returns [uninitialized<cql3::role_name> name]
: t=IDENT { $name = cql3::role_name($t.text, cql3::preserve_role_case::no); }
| t=STRING_LITERAL { $name = cql3::role_name($t.text, cql3::preserve_role_case::yes); }
| t=QUOTED_NAME { $name = cql3::role_name($t.text, cql3::preserve_role_case::yes); }
| k=unreserved_keyword { $name = cql3::role_name(k, cql3::preserve_role_case::no); }
| QMARK {add_recognition_error("Bind variables cannot be used for role names");}
;
#endif
ksName[::shared_ptr<cql3::keyspace_element_name> name]
: t=IDENT { $name->set_keyspace($t.text, false);}
@@ -1152,21 +1241,13 @@ idxName[::shared_ptr<cql3::index_name> name]
| QMARK {add_recognition_error("Bind variables cannot be used for index names");}
;
#if 0
roleName[RoleName name]
: t=IDENT { $name.setName($t.text, false); }
| t=QUOTED_NAME { $name.setName($t.text, true); }
| k=unreserved_keyword { $name.setName(k, false); }
| QMARK {addRecognitionError("Bind variables cannot be used for role names");}
;
#endif
constant returns [shared_ptr<cql3::constants::literal> constant]
@init{std::string sign;}
: t=STRING_LITERAL { $constant = cql3::constants::literal::string(sstring{$t.text}); }
| t=INTEGER { $constant = cql3::constants::literal::integer(sstring{$t.text}); }
| t=FLOAT { $constant = cql3::constants::literal::floating_point(sstring{$t.text}); }
| t=BOOLEAN { $constant = cql3::constants::literal::bool_(sstring{$t.text}); }
| t=DURATION { $constant = cql3::constants::literal::duration(sstring{$t.text}); }
| t=UUID { $constant = cql3::constants::literal::uuid(sstring{$t.text}); }
| t=HEXNUMBER { $constant = cql3::constants::literal::hex(sstring{$t.text}); }
| { sign=""; } ('-' {sign = "-"; } )? t=(K_NAN | K_INFINITY) { $constant = cql3::constants::literal::floating_point(sstring{sign + $t.text}); }
@@ -1464,6 +1545,7 @@ native_type returns [shared_ptr<cql3_type> t]
| K_COUNTER { $t = cql3_type::counter; }
| K_DECIMAL { $t = cql3_type::decimal; }
| K_DOUBLE { $t = cql3_type::double_; }
| K_DURATION { $t = cql3_type::duration; }
| K_FLOAT { $t = cql3_type::float_; }
| K_INET { $t = cql3_type::inet; }
| K_INT { $t = cql3_type::int_; }
@@ -1503,6 +1585,7 @@ tuple_type returns [shared_ptr<cql3::cql3_type::raw> t]
username
: IDENT
| STRING_LITERAL
| QUOTED_NAME { add_recognition_error("Quoted strings are not supported for user names"); }
;
// Basically the same as cident, but we need to exlude existing CQL3 types
@@ -1541,8 +1624,13 @@ basic_unreserved_keyword returns [sstring str]
| K_ALL
| K_USER
| K_USERS
| K_ROLE
| K_ROLES
| K_SUPERUSER
| K_NOSUPERUSER
| K_LOGIN
| K_NOLOGIN
| K_OPTIONS
| K_PASSWORD
| K_EXISTS
| K_CUSTOM
@@ -1569,6 +1657,7 @@ basic_unreserved_keyword returns [sstring str]
K_SELECT: S E L E C T;
K_FROM: F R O M;
K_AS: A S;
K_CAST: C A S T;
K_WHERE: W H E R E;
K_AND: A N D;
K_KEY: K E Y;
@@ -1633,13 +1722,19 @@ K_OF: O F;
K_REVOKE: R E V O K E;
K_MODIFY: M O D I F Y;
K_AUTHORIZE: A U T H O R I Z E;
K_DESCRIBE: D E S C R I B E;
K_NORECURSIVE: N O R E C U R S I V E;
K_USER: U S E R;
K_USERS: U S E R S;
K_ROLE: R O L E;
K_ROLES: R O L E S;
K_SUPERUSER: S U P E R U S E R;
K_NOSUPERUSER: N O S U P E R U S E R;
K_PASSWORD: P A S S W O R D;
K_LOGIN: L O G I N;
K_NOLOGIN: N O L O G I N;
K_OPTIONS: O P T I O N S;
K_CLUSTERING: C L U S T E R I N G;
K_ASCII: A S C I I;
@@ -1649,6 +1744,7 @@ K_BOOLEAN: B O O L E A N;
K_COUNTER: C O U N T E R;
K_DECIMAL: D E C I M A L;
K_DOUBLE: D O U B L E;
K_DURATION: D U R A T I O N;
K_FLOAT: F L O A T;
K_INET: I N E T;
K_INT: I N T;
@@ -1778,6 +1874,20 @@ fragment EXPONENT
: E ('+' | '-')? DIGIT+
;
fragment DURATION_UNIT
: Y
| M O
| W
| D
| H
| M
| S
| M S
| U S
| '\u00B5' S
| N S
;
INTEGER
: '-'? DIGIT+
;
@@ -1802,6 +1912,13 @@ BOOLEAN
: T R U E | F A L S E
;
DURATION
: '-'? DIGIT+ DURATION_UNIT (DIGIT+ DURATION_UNIT)*
| '-'? 'P' (DIGIT+ 'Y')? (DIGIT+ 'M')? (DIGIT+ 'D')? ('T' (DIGIT+ 'H')? (DIGIT+ 'M')? (DIGIT+ 'S')?)? // ISO 8601 "format with designators"
| '-'? 'P' DIGIT+ 'W'
| '-'? 'P' DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT 'T' DIGIT DIGIT ':' DIGIT DIGIT ':' DIGIT DIGIT // ISO 8601 "alternative format"
;
IDENT
: LETTER (LETTER | DIGIT | '_')*
;

View File

@@ -79,6 +79,7 @@ abstract_marker::raw::raw(int32_t bind_index)
return ::make_shared<maps::marker>(_bind_index, receiver);
}
assert(0);
return shared_ptr<term>();
}
assignment_testable::test_result abstract_marker::raw::test_assignment(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) {

View File

@@ -79,7 +79,7 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
}
try {
data_type_for<int64_t>()->validate(*tval);
} catch (marshal_exception e) {
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid timestamp value");
}
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(*tval));
@@ -99,7 +99,7 @@ int32_t attributes::get_time_to_live(const query_options& options) {
try {
data_type_for<int32_t>()->validate(*tval);
}
catch (marshal_exception e) {
catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid TTL value");
}

View File

@@ -40,11 +40,29 @@
*/
#include "cql3/column_condition.hh"
#include "statements/request_validations.hh"
#include "unimplemented.hh"
#include "lists.hh"
#include "maps.hh"
#include <boost/range/algorithm_ext/push_back.hpp>
namespace {
void validate_operation_on_durations(const abstract_type& type, const cql3::operator_type& op) {
using cql3::statements::request_validations::check_false;
if (op.is_slice() && type.references_duration()) {
check_false(type.is_collection(), "Slice conditions are not supported on collections containing durations");
check_false(type.is_tuple(), "Slice conditions are not supported on tuples containing durations");
check_false(type.is_user_type(), "Slice conditions are not supported on UDTs containing durations");
// We're a duration.
throw exceptions::invalid_request_exception(sprint("Slice conditions are not supported on durations"));
}
}
}
namespace cql3 {
bool
@@ -95,6 +113,7 @@ column_condition::raw::prepare(database& db, const sstring& keyspace, const colu
}
return column_condition::in_condition(receiver, std::move(terms));
} else {
validate_operation_on_durations(*receiver.type, _op);
return column_condition::condition(receiver, _value->prepare(db, keyspace, receiver.column_specification), _op);
}
}
@@ -129,6 +148,8 @@ column_condition::raw::prepare(database& db, const sstring& keyspace, const colu
| boost::adaptors::transformed(std::bind(&term::raw::prepare, std::placeholders::_1, std::ref(db), std::ref(keyspace), value_spec)));
return column_condition::in_condition(receiver, _collection_element->prepare(db, keyspace, element_spec), terms);
} else {
validate_operation_on_durations(*receiver.type, _op);
return column_condition::condition(receiver,
_collection_element->prepare(db, keyspace, element_spec),
_value->prepare(db, keyspace, value_spec),

View File

@@ -52,14 +52,15 @@ std::ostream&
operator<<(std::ostream&out, constants::type t)
{
switch (t) {
case constants::type::STRING: return out << "STRING";
case constants::type::INTEGER: return out << "INTEGER";
case constants::type::UUID: return out << "UUID";
case constants::type::FLOAT: return out << "FLOAT";
case constants::type::BOOLEAN: return out << "BOOLEAN";
case constants::type::HEX: return out << "HEX";
};
assert(0);
case constants::type::STRING: return out << "STRING";
case constants::type::INTEGER: return out << "INTEGER";
case constants::type::UUID: return out << "UUID";
case constants::type::FLOAT: return out << "FLOAT";
case constants::type::BOOLEAN: return out << "BOOLEAN";
case constants::type::HEX: return out << "HEX";
case constants::type::DURATION: return out << "DURATION";
}
abort();
}
bytes
@@ -145,6 +146,11 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, ::sha
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
break;
case type::DURATION:
if (kind == cql3_type::kind_enum_set::prepare<cql3_type::kind::DURATION>()) {
return assignment_testable::test_result::EXACT_MATCH;
}
break;
}
return assignment_testable::test_result::NOT_ASSIGNABLE;
}

View File

@@ -60,7 +60,7 @@ public:
#endif
public:
enum class type {
STRING, INTEGER, UUID, FLOAT, BOOLEAN, HEX
STRING, INTEGER, UUID, FLOAT, BOOLEAN, HEX, DURATION
};
/**
@@ -123,7 +123,7 @@ public:
// This is a workaround for antlr3 not distinguishing between
// calling in lexer setText() with an empty string and not calling
// setText() at all.
if (text.size() == 1 && text[0] == -1) {
if (text.size() == 1 && text[0] == '\xFF') {
text.reset();
}
return ::make_shared<literal>(type::STRING, text);
@@ -149,6 +149,10 @@ public:
return ::make_shared<literal>(type::HEX, text);
}
static ::shared_ptr<literal> duration(sstring text) {
return ::make_shared<literal>(type::DURATION, text);
}
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver);
private:
bytes parsed_value(data_type validator);

View File

@@ -48,6 +48,10 @@ shared_ptr<cql3_type> cql3_type::raw::prepare(database& db, const sstring& keysp
}
}
bool cql3_type::raw::is_duration() const {
return false;
}
bool cql3_type::raw::references_user_type(const sstring& name) const {
return false;
}
@@ -78,6 +82,10 @@ public:
virtual sstring to_string() const {
return _type->to_string();
}
virtual bool is_duration() const override {
return _type->get_type()->equals(duration_type);
}
};
class cql3_type::raw_collection : public raw {
@@ -126,9 +134,15 @@ public:
if (_kind == &collection_type_impl::kind::list) {
return make_shared(cql3_type(to_string(), list_type_impl::get_instance(_values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
} else if (_kind == &collection_type_impl::kind::set) {
if (_values->is_duration()) {
throw exceptions::invalid_request_exception(sprint("Durations are not allowed inside sets: %s", *this));
}
return make_shared(cql3_type(to_string(), set_type_impl::get_instance(_values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
} else if (_kind == &collection_type_impl::kind::map) {
assert(_keys); // "Got null keys type for a collection";
if (_keys->is_duration()) {
throw exceptions::invalid_request_exception(sprint("Durations are not allowed as map keys: %s", *this));
}
return make_shared(cql3_type(to_string(), map_type_impl::get_instance(_keys->prepare_internal(keyspace, user_types)->get_type(), _values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
}
abort();
@@ -138,6 +152,10 @@ public:
return (_keys && _keys->references_user_type(name)) || _values->references_user_type(name);
}
bool is_duration() const override {
return false;
}
virtual sstring to_string() const override {
sstring start = _frozen ? "frozen<" : "";
sstring end = _frozen ? ">" : "";
@@ -329,6 +347,7 @@ thread_local shared_ptr<cql3_type> cql3_type::inet = make("inet", inet_addr_type
thread_local shared_ptr<cql3_type> cql3_type::varint = make("varint", varint_type, cql3_type::kind::VARINT);
thread_local shared_ptr<cql3_type> cql3_type::decimal = make("decimal", decimal_type, cql3_type::kind::DECIMAL);
thread_local shared_ptr<cql3_type> cql3_type::counter = make("counter", counter_type, cql3_type::kind::COUNTER);
thread_local shared_ptr<cql3_type> cql3_type::duration = make("duration", duration_type, cql3_type::kind::DURATION);
const std::vector<shared_ptr<cql3_type>>&
cql3_type::values() {
@@ -354,6 +373,7 @@ cql3_type::values() {
cql3_type::timeuuid,
cql3_type::date,
cql3_type::time,
cql3_type::duration,
};
return v;
}

View File

@@ -75,6 +75,7 @@ public:
virtual bool supports_freezing() const = 0;
virtual bool is_collection() const;
virtual bool is_counter() const;
virtual bool is_duration() const;
virtual bool references_user_type(const sstring&) const;
virtual std::experimental::optional<sstring> keyspace() const;
virtual void freeze();
@@ -102,7 +103,7 @@ private:
public:
enum class kind : int8_t {
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, EMPTY, FLOAT, INT, SMALLINT, TINYINT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID, DATE, TIME
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, EMPTY, FLOAT, INT, SMALLINT, TINYINT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID, DATE, TIME, DURATION
};
using kind_enum = super_enum<kind,
kind::ASCII,
@@ -125,7 +126,8 @@ public:
kind::VARINT,
kind::TIMEUUID,
kind::DATE,
kind::TIME>;
kind::TIME,
kind::DURATION>;
using kind_enum_set = enum_set<kind_enum>;
private:
std::experimental::optional<kind_enum_set::prepared> _kind;
@@ -154,6 +156,7 @@ public:
static thread_local shared_ptr<cql3_type> varint;
static thread_local shared_ptr<cql3_type> decimal;
static thread_local shared_ptr<cql3_type> counter;
static thread_local shared_ptr<cql3_type> duration;
static const std::vector<shared_ptr<cql3_type>>& values();
public:

View File

@@ -68,9 +68,11 @@ class error_collector : public error_listener<RecognizerType, ExceptionBaseType>
const sstring_view _query;
/**
* The error messages.
* An empty bitset to be used as a workaround for AntLR null dereference
* bug.
*/
std::vector<sstring> _error_msgs;
static typename ExceptionBaseType::BitsetListType _empty_bit_list;
public:
/**
@@ -81,7 +83,10 @@ public:
*/
error_collector(const sstring_view& query) : _query(query) {}
virtual void syntax_error(RecognizerType& recognizer, ANTLR_UINT8** token_names, ExceptionBaseType* ex) override {
/**
* Format and throw a new \c exceptions::syntax_exception.
*/
[[noreturn]] virtual void syntax_error(RecognizerType& recognizer, ANTLR_UINT8** token_names, ExceptionBaseType* ex) override {
auto hdr = get_error_header(ex);
auto msg = get_error_message(recognizer, ex, token_names);
std::stringstream result;
@@ -90,22 +95,15 @@ public:
if (recognizer instanceof Parser)
appendQuerySnippet((Parser) recognizer, builder);
#endif
_error_msgs.emplace_back(result.str());
}
virtual void syntax_error(RecognizerType& recognizer, const sstring& msg) override {
_error_msgs.emplace_back(msg);
throw exceptions::syntax_exception(result.str());
}
/**
* Throws the first syntax error found by the lexer or the parser if it exists.
*
* @throws SyntaxException the syntax error.
* Throw a new \c exceptions::syntax_exception.
*/
void throw_first_syntax_error() {
if (!_error_msgs.empty()) {
throw exceptions::syntax_exception(_error_msgs[0]);
}
[[noreturn]] virtual void syntax_error(RecognizerType&, const sstring& msg) override {
throw exceptions::syntax_exception(msg);
}
private:
@@ -152,6 +150,14 @@ private:
break;
}
default:
// AntLR Exception class has a bug of dereferencing a null
// pointer in the displayRecognitionError. The following
// if statement makes sure it will not be null before the
// call to that function (displayRecognitionError).
// bug reference: https://github.com/antlr/antlr3/issues/191
if (!ex->get_expectingSet()) {
ex->set_expectingSet(&_empty_bit_list);
}
ex->displayRecognitionError(token_names, msg);
}
return msg.str();
@@ -353,4 +359,8 @@ private:
#endif
};
template<typename RecognizerType, typename TokenType, typename ExceptionBaseType>
typename ExceptionBaseType::BitsetListType
error_collector<RecognizerType,TokenType,ExceptionBaseType>::_empty_bit_list = typename ExceptionBaseType::BitsetListType();
}

View File

@@ -53,6 +53,7 @@ namespace cql3 {
template<typename RecognizerType, typename ExceptionBaseType>
class error_listener {
public:
virtual ~error_listener() = default;
/**
* Invoked when a syntax error occurs.

View File

@@ -90,6 +90,10 @@ public:
return false;
}
virtual sstring column_name(const std::vector<sstring>& column_names) override {
return sprint("%s(%s)", _name, join(", ", column_names));
}
virtual void print(std::ostream& os) const override;
};

View File

@@ -41,6 +41,7 @@
#pragma once
#include "utils/big_decimal.hh"
#include "aggregate_function.hh"
#include "native_aggregate_function.hh"
@@ -66,6 +67,19 @@ public:
}
};
static const sstring COUNT_ROWS_FUNCTION_NAME = "countRows";
class count_rows_function final : public native_aggregate_function {
public:
count_rows_function() : native_aggregate_function(COUNT_ROWS_FUNCTION_NAME, long_type, {}) {}
virtual std::unique_ptr<aggregate> new_aggregate() override {
return std::make_unique<impl_count_function>();
}
virtual sstring column_name(const std::vector<sstring>& column_names) override {
return "count";
}
};
/**
* The function used to count the number of rows of a result set. This function is called when COUNT(*) or COUNT(1)
* is specified.
@@ -73,7 +87,7 @@ public:
inline
shared_ptr<aggregate_function>
make_count_rows_function() {
return make_native_aggregate_function_using<impl_count_function>("countRows", long_type);
return make_shared<count_rows_function>();
}
template <typename Type>
@@ -111,9 +125,70 @@ make_sum_function() {
return make_shared<sum_function_for<Type>>();
}
template <typename Type>
class impl_div_for_avg {
public:
static Type div(const Type& x, const int64_t y) {
return x/y;
}
};
template <>
class impl_div_for_avg<big_decimal> {
public:
static big_decimal div(const big_decimal& x, const int64_t y) {
return x.div(y, big_decimal::rounding_mode::HALF_EVEN);
}
};
// We need a wider accumulator for average, since summing the inputs can overflow
// the input type
template <typename T>
struct accumulator_for;
template <>
struct accumulator_for<int8_t> {
using type = __int128;
};
template <>
struct accumulator_for<int16_t> {
using type = __int128;
};
template <>
struct accumulator_for<int32_t> {
using type = __int128;
};
template <>
struct accumulator_for<int64_t> {
using type = __int128;
};
template <>
struct accumulator_for<float> {
using type = float;
};
template <>
struct accumulator_for<double> {
using type = double;
};
template <>
struct accumulator_for<boost::multiprecision::cpp_int> {
using type = boost::multiprecision::cpp_int;
};
template <>
struct accumulator_for<big_decimal> {
using type = big_decimal;
};
template <typename Type>
class impl_avg_function_for final : public aggregate_function::aggregate {
Type _sum{};
typename accumulator_for<Type>::type _sum{};
int64_t _count = 0;
public:
virtual void reset() override {
@@ -121,9 +196,9 @@ public:
_count = 0;
}
virtual opt_bytes compute(cql_serialization_format sf) override {
Type ret = 0;
Type ret{};
if (_count) {
ret = _sum / _count;
ret = impl_div_for_avg<Type>::div(_sum, _count);
}
return data_type_for<Type>()->decompose(ret);
}
@@ -152,9 +227,29 @@ make_avg_function() {
return make_shared<avg_function_for<Type>>();
}
template <typename T>
struct aggregate_type_for {
using type = T;
};
template<>
struct aggregate_type_for<simple_date_native_type> {
using type = simple_date_native_type::primary_type;
};
template<>
struct aggregate_type_for<timestamp_native_type> {
using type = timestamp_native_type::primary_type;
};
template<>
struct aggregate_type_for<timeuuid_native_type> {
using type = timeuuid_native_type::primary_type;
};
template <typename Type>
class impl_max_function_for final : public aggregate_function::aggregate {
std::experimental::optional<Type> _max{};
std::experimental::optional<typename aggregate_type_for<Type>::type> _max{};
public:
virtual void reset() override {
_max = {};
@@ -163,13 +258,13 @@ public:
if (!_max) {
return {};
}
return data_type_for<Type>()->decompose(*_max);
return data_type_for<Type>()->decompose(Type{*_max});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
auto val = value_cast<Type>(data_type_for<Type>()->deserialize(*values[0]));
auto val = value_cast<typename aggregate_type_for<Type>::type>(data_type_for<Type>()->deserialize(*values[0]));
if (!_max) {
_max = val;
} else {
@@ -201,7 +296,7 @@ make_max_function() {
template <typename Type>
class impl_min_function_for final : public aggregate_function::aggregate {
std::experimental::optional<Type> _min{};
std::experimental::optional<typename aggregate_type_for<Type>::type> _min{};
public:
virtual void reset() override {
_min = {};
@@ -210,13 +305,13 @@ public:
if (!_min) {
return {};
}
return data_type_for<Type>()->decompose(*_min);
return data_type_for<Type>()->decompose(Type{*_min});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
return;
}
auto val = value_cast<Type>(data_type_for<Type>()->deserialize(*values[0]));
auto val = value_cast<typename aggregate_type_for<Type>::type>(data_type_for<Type>()->deserialize(*values[0]));
if (!_min) {
_min = val;
} else {

View File

@@ -0,0 +1,82 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "castas_fcts.hh"
#include "cql3/functions/native_scalar_function.hh"
namespace cql3 {
namespace functions {
namespace {
using bytes_opt = std::experimental::optional<bytes>;
class castas_function_for : public cql3::functions::native_scalar_function {
castas_fctn _func;
public:
castas_function_for(data_type to_type,
data_type from_type,
castas_fctn func)
: native_scalar_function("castas" + to_type->as_cql3_type()->to_string(), to_type, {from_type})
, _func(func) {
}
virtual bool is_pure() override {
return true;
}
virtual void print(std::ostream& os) const override {
os << "cast(" << _arg_types[0]->name() << " as " << _return_type->name() << ")";
}
virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
auto from_type = arg_types()[0];
auto to_type = return_type();
auto&& val = parameters[0];
if (!val) {
return val;
}
auto val_from = from_type->deserialize(*val);
auto val_to = _func(val_from);
return to_type->decompose(val_to);
}
};
shared_ptr<function> make_castas_function(data_type to_type, data_type from_type, castas_fctn func) {
return ::make_shared<castas_function_for>(std::move(to_type), std::move(from_type), std::move(func));
}
} /* Anonymous Namespace */
shared_ptr<function> castas_functions::get(data_type to_type, const std::vector<shared_ptr<cql3::selection::selector>>& provided_args, schema_ptr s) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("Invalid CAST expression");
}
auto from_type = provided_args[0]->get_type();
auto from_type_key = from_type;
if (from_type_key->is_reversed()) {
from_type_key = dynamic_cast<const reversed_type_impl&>(*from_type).underlying_type();
}
auto f = get_castas_fctn(to_type, from_type_key);
return make_castas_function(to_type, from_type, f);
}
}
}

View File

@@ -15,10 +15,11 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Modified by ScyllaDB
*
* Copyright 2016 ScyllaDB
* Copyright (C) 2017 ScyllaDB
*/
/*
@@ -38,26 +39,25 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <string.h>
#pragma once
#include <boost/range/adaptor/map.hpp>
#include <tuple>
#include <unordered_map>
#include "auth/authenticator.hh"
#include "user_options.hh"
#include "cql3/functions/function.hh"
#include "cql3/functions/abstract_function.hh"
#include "exceptions/exceptions.hh"
#include "core/print.hh"
#include "cql3/cql3_type.hh"
#include "cql3/selection/selector.hh"
namespace cql3 {
namespace functions {
class castas_functions {
public:
static shared_ptr<function> get(data_type to_type, const std::vector<shared_ptr<cql3::selection::selector>>& provided_args, schema_ptr s);
};
void cql3::user_options::put(const sstring& name, const sstring& value) {
_options[auth::authenticator::string_to_option(name)] = value;
}
void cql3::user_options::validate() const {
auto& a = auth::authenticator::get();
for (auto o : _options | boost::adaptors::map_keys) {
if (!a.supported_options().contains(o)) {
throw exceptions::invalid_request_exception(
sprint("%s doesn't support %s option",
a.class_name(),
a.option_to_string(o)));
}
}
}

View File

@@ -81,6 +81,15 @@ public:
virtual void print(std::ostream& os) const = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) = 0;
virtual bool has_reference_to(function& f) = 0;
/**
* Returns the name of the function to use within a ResultSet.
*
* @param column_names the names of the columns used to call the function
* @return the name of the function to use within a ResultSet
*/
virtual sstring column_name(const std::vector<sstring>& column_names) = 0;
friend class function_call;
friend std::ostream& operator<<(std::ostream& os, const function& f);
};

View File

@@ -42,10 +42,16 @@
#pragma once
#include "core/sstring.hh"
#include "db/system_keyspace.hh"
#include "seastarx.hh"
#include <iosfwd>
#include <functional>
namespace db {
sstring system_keyspace_name();
}
namespace cql3 {
namespace functions {
@@ -56,7 +62,7 @@ public:
sstring name;
static function_name native_function(sstring name) {
return function_name(db::system_keyspace::NAME, name);
return function_name(db::system_keyspace_name(), name);
}
function_name() = default; // for ANTLR

View File

@@ -59,6 +59,14 @@ functions::init() {
declare(make_to_blob_function(type->get_type()));
declare(make_from_blob_function(type->get_type()));
}
declare(aggregate_fcts::make_count_function<int8_t>());
declare(aggregate_fcts::make_max_function<int8_t>());
declare(aggregate_fcts::make_min_function<int8_t>());
declare(aggregate_fcts::make_count_function<int16_t>());
declare(aggregate_fcts::make_max_function<int16_t>());
declare(aggregate_fcts::make_min_function<int16_t>());
declare(aggregate_fcts::make_count_function<int32_t>());
declare(aggregate_fcts::make_max_function<int32_t>());
declare(aggregate_fcts::make_min_function<int32_t>());
@@ -67,6 +75,14 @@ functions::init() {
declare(aggregate_fcts::make_max_function<int64_t>());
declare(aggregate_fcts::make_min_function<int64_t>());
declare(aggregate_fcts::make_count_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_max_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_min_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_count_function<big_decimal>());
declare(aggregate_fcts::make_max_function<big_decimal>());
declare(aggregate_fcts::make_min_function<big_decimal>());
declare(aggregate_fcts::make_count_function<float>());
declare(aggregate_fcts::make_max_function<float>());
declare(aggregate_fcts::make_min_function<float>());
@@ -79,6 +95,15 @@ functions::init() {
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
declare(aggregate_fcts::make_max_function<simple_date_native_type>());
declare(aggregate_fcts::make_min_function<simple_date_native_type>());
declare(aggregate_fcts::make_max_function<timestamp_native_type>());
declare(aggregate_fcts::make_min_function<timestamp_native_type>());
declare(aggregate_fcts::make_max_function<timeuuid_native_type>());
declare(aggregate_fcts::make_min_function<timeuuid_native_type>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -88,22 +113,22 @@ functions::init() {
declare(make_varchar_as_blob_fct());
declare(make_blob_as_varchar_fct());
declare(aggregate_fcts::make_sum_function<int8_t>());
declare(aggregate_fcts::make_sum_function<int16_t>());
declare(aggregate_fcts::make_sum_function<int32_t>());
declare(aggregate_fcts::make_sum_function<int64_t>());
declare(aggregate_fcts::make_sum_function<float>());
declare(aggregate_fcts::make_sum_function<double>());
#if 0
declare(AggregateFcts.sumFunctionForDecimal);
declare(AggregateFcts.sumFunctionForVarint);
#endif
declare(aggregate_fcts::make_sum_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_sum_function<big_decimal>());
declare(aggregate_fcts::make_avg_function<int8_t>());
declare(aggregate_fcts::make_avg_function<int16_t>());
declare(aggregate_fcts::make_avg_function<int32_t>());
declare(aggregate_fcts::make_avg_function<int64_t>());
declare(aggregate_fcts::make_avg_function<float>());
declare(aggregate_fcts::make_avg_function<double>());
#if 0
declare(AggregateFcts.avgFunctionForVarint);
declare(AggregateFcts.avgFunctionForDecimal);
#endif
declare(aggregate_fcts::make_avg_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_avg_function<big_decimal>());
// also needed for smp:
#if 0
@@ -342,7 +367,7 @@ function_call::execute_internal(cql_serialization_format sf, scalar_function& fu
fun.return_type()->validate(*result);
}
return result;
} catch (marshal_exception e) {
} catch (marshal_exception& e) {
throw runtime_exception(sprint("Return of function %s (%s) is not a valid value for its declared return type %s",
fun, to_hex(result),
*fun.return_type()->as_cql3_type()

View File

@@ -64,23 +64,5 @@ public:
}
};
template <class Aggregate>
class native_aggregate_function_using : public native_aggregate_function {
public:
native_aggregate_function_using(sstring name, data_type type)
: native_aggregate_function(std::move(name), type, {}) {
}
virtual std::unique_ptr<aggregate> new_aggregate() override {
return std::make_unique<Aggregate>();
}
};
template <class Aggregate>
shared_ptr<native_aggregate_function>
make_native_aggregate_function_using(sstring name, data_type type) {
return ::make_shared<native_aggregate_function_using<Aggregate>>(name, type);
}
}
}

View File

@@ -202,12 +202,6 @@ lists::delayed_value::bind(const query_options& options) {
if (bo.is_unset_value()) {
return constants::UNSET_VALUE;
}
// We don't support value > 64K because the serialization format encode the length as an unsigned short.
if (bo->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("List value is too long. List values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
bo->size()));
}
buffers.push_back(std::move(to_bytes(*bo)));
}
@@ -305,11 +299,6 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
if (!value) {
mut.cells.emplace_back(eidx, params.make_dead_cell());
} else {
if (value->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
sprint("List value is too long. List values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(), value->size()));
}
mut.cells.emplace_back(eidx, params.make_cell(*value));
}
auto smut = ltype->serialize_mutation_form(mut);

View File

@@ -245,11 +245,6 @@ maps::delayed_value::bind(const query_options& options) {
if (value_bytes.is_unset_value()) {
return constants::UNSET_VALUE;
}
if (value_bytes->size() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Map value is too long. Map values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
value_bytes->size()));
}
buffers.emplace(std::move(to_bytes(*key_bytes)), std::move(to_bytes(*value_bytes)));
}
return ::make_shared<value>(std::move(buffers));
@@ -300,12 +295,6 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
if (!key) {
throw invalid_request_exception("Invalid null map key");
}
if (value && value->size() >= std::numeric_limits<uint16_t>::max()) {
throw invalid_request_exception(
sprint("Map value is too long. Map values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
value->size()));
}
auto avalue = value ? params.make_cell(*value) : params.make_dead_cell();
map_type_impl::mutation update = { {}, { { std::move(to_bytes(*key)), std::move(avalue) } } };
// should have been verified as map earlier?

View File

@@ -46,15 +46,19 @@
namespace cql3 {
sstring
operation::set_element::to_string(const column_definition& receiver) const {
return format("{}[{}] = {}", receiver.name_as_text(), *_selector, *_value);
}
shared_ptr<operation>
operation::set_element::prepare(database& db, const sstring& keyspace, const column_definition& receiver) {
using exceptions::invalid_request_exception;
auto rtype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!rtype) {
throw invalid_request_exception(sprint("Invalid operation (%s) for non collection column %s", receiver, receiver.name()));
throw invalid_request_exception(sprint("Invalid operation (%s) for non collection column %s", to_string(receiver), receiver.name()));
} else if (!rtype->is_multi_cell()) {
throw invalid_request_exception(sprint("Invalid operation (%s) for frozen collection column %s", receiver, receiver.name()));
throw invalid_request_exception(sprint("Invalid operation (%s) for frozen collection column %s", to_string(receiver), receiver.name()));
}
if (&rtype->_kind == &collection_type_impl::kind::list) {
@@ -67,7 +71,7 @@ operation::set_element::prepare(database& db, const sstring& keyspace, const col
return make_shared<lists::setter_by_index>(receiver, idx, lval);
}
} else if (&rtype->_kind == &collection_type_impl::kind::set) {
throw invalid_request_exception(sprint("Invalid operation (%s) for set column %s", receiver, receiver.name()));
throw invalid_request_exception(sprint("Invalid operation (%s) for set column %s", to_string(receiver), receiver.name()));
} else if (&rtype->_kind == &collection_type_impl::kind::map) {
auto key = _selector->prepare(db, keyspace, maps::key_spec_of(*receiver.column_specification));
auto mval = _value->prepare(db, keyspace, maps::value_spec_of(*receiver.column_specification));
@@ -83,6 +87,11 @@ operation::set_element::is_compatible_with(shared_ptr<raw_update> other) {
return !dynamic_pointer_cast<set_value>(std::move(other));
}
sstring
operation::addition::to_string(const column_definition& receiver) const {
return format("{} = {} + {}", receiver.name_as_text(), receiver.name_as_text(), *_value);
}
shared_ptr<operation>
operation::addition::prepare(database& db, const sstring& keyspace, const column_definition& receiver) {
auto v = _value->prepare(db, keyspace, receiver.column_specification);
@@ -90,11 +99,11 @@ operation::addition::prepare(database& db, const sstring& keyspace, const column
auto ctype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!ctype) {
if (!receiver.is_counter()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non counter column %s", receiver, receiver.name()));
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non counter column %s", to_string(receiver), receiver.name()));
}
return make_shared<constants::adder>(receiver, v);
} else if (!ctype->is_multi_cell()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for frozen collection column %s", receiver, receiver.name()));
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for frozen collection column %s", to_string(receiver), receiver.name()));
}
if (&ctype->_kind == &collection_type_impl::kind::list) {
@@ -113,19 +122,24 @@ operation::addition::is_compatible_with(shared_ptr<raw_update> other) {
return !dynamic_pointer_cast<set_value>(other);
}
sstring
operation::subtraction::to_string(const column_definition& receiver) const {
return format("{} = {} - {}", receiver.name_as_text(), receiver.name_as_text(), *_value);
}
shared_ptr<operation>
operation::subtraction::prepare(database& db, const sstring& keyspace, const column_definition& receiver) {
auto ctype = dynamic_pointer_cast<const collection_type_impl>(receiver.type);
if (!ctype) {
if (!receiver.is_counter()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non counter column %s", receiver, receiver.name()));
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non counter column %s", to_string(receiver), receiver.name()));
}
auto v = _value->prepare(db, keyspace, receiver.column_specification);
return make_shared<constants::subtracter>(receiver, v);
}
if (!ctype->is_multi_cell()) {
throw exceptions::invalid_request_exception(
sprint("Invalid operation (%s) for frozen collection column %s", receiver, receiver.name()));
sprint("Invalid operation (%s) for frozen collection column %s", to_string(receiver), receiver.name()));
}
if (&ctype->_kind == &collection_type_impl::kind::list) {
@@ -150,14 +164,19 @@ operation::subtraction::is_compatible_with(shared_ptr<raw_update> other) {
return !dynamic_pointer_cast<set_value>(other);
}
sstring
operation::prepend::to_string(const column_definition& receiver) const {
return format("{} = {} + {}", receiver.name_as_text(), *_value, receiver.name_as_text());
}
shared_ptr<operation>
operation::prepend::prepare(database& db, const sstring& keyspace, const column_definition& receiver) {
auto v = _value->prepare(db, keyspace, receiver.column_specification);
if (!dynamic_cast<const list_type_impl*>(receiver.type.get())) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non list column %s", receiver, receiver.name()));
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for non list column %s", to_string(receiver), receiver.name()));
} else if (!receiver.type->is_multi_cell()) {
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for frozen list column %s", receiver, receiver.name()));
throw exceptions::invalid_request_exception(sprint("Invalid operation (%s) for frozen list column %s", to_string(receiver), receiver.name()));
}
return make_shared<lists::prepender>(receiver, std::move(v));

View File

@@ -203,6 +203,8 @@ public:
const shared_ptr<term::raw> _selector;
const shared_ptr<term::raw> _value;
const bool _by_uuid;
private:
sstring to_string(const column_definition& receiver) const;
public:
set_element(shared_ptr<term::raw> selector, shared_ptr<term::raw> value, bool by_uuid = false)
: _selector(std::move(selector)), _value(std::move(value)), _by_uuid(by_uuid) {
@@ -215,6 +217,8 @@ public:
class addition : public raw_update {
const shared_ptr<term::raw> _value;
private:
sstring to_string(const column_definition& receiver) const;
public:
addition(shared_ptr<term::raw> value)
: _value(value) {
@@ -227,6 +231,8 @@ public:
class subtraction : public raw_update {
const shared_ptr<term::raw> _value;
private:
sstring to_string(const column_definition& receiver) const;
public:
subtraction(shared_ptr<term::raw> value)
: _value(value) {
@@ -239,6 +245,8 @@ public:
class prepend : public raw_update {
shared_ptr<term::raw> _value;
private:
sstring to_string(const column_definition& receiver) const;
public:
prepend(shared_ptr<term::raw> value)
: _value(std::move(value)) {

View File

@@ -71,7 +71,12 @@ private:
, _text(std::move(text))
{}
public:
operator_type(const operator_type&) = delete;
operator_type& operator=(const operator_type&) = delete;
const operator_type& reverse() const { return _reverse; }
bool is_slice() const {
return (*this == LT) || (*this == LTE) || (*this == GT) || (*this == GTE);
}
sstring to_string() const { return _text; }
bool operator==(const operator_type& other) const { return this == &other; }
bool operator!=(const operator_type& other) const { return this != &other; }

View File

@@ -49,6 +49,23 @@ thread_local const query_options::specific_options query_options::specific_optio
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, std::experimental::nullopt,
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(value_views)
, _skip_metadata(skip_metadata)
, _options(std::move(options))
, _cql_serialization_format(sf)
{
}
query_options::query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
@@ -82,18 +99,29 @@ query_options::query_options(db::consistency_level consistency,
{
}
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values)
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values, specific_options options)
: query_options(
cl,
{},
std::move(values),
false,
query_options::specific_options::DEFAULT,
std::move(options),
cql_serialization_format::latest()
)
{
}
query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<service::pager::paging_state> paging_state)
: query_options(qo->_consistency,
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
qo->_skip_metadata,
std::move(query_options::specific_options{qo->_options.page_size, paging_state, qo->_options.serial_consistency, qo->_options.timestamp}),
qo->_cql_serialization_format) {
}
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, std::move(values))
@@ -181,19 +209,18 @@ void query_options::prepare(const std::vector<::shared_ptr<column_specification>
}
auto& names = *_names;
std::vector<cql3::raw_value> ordered_values;
std::vector<cql3::raw_value_view> ordered_values;
ordered_values.reserve(specs.size());
for (auto&& spec : specs) {
auto& spec_name = spec->name->text();
for (size_t j = 0; j < names.size(); j++) {
if (names[j] == spec_name) {
ordered_values.emplace_back(_values[j]);
ordered_values.emplace_back(_value_views[j]);
break;
}
}
}
_values = std::move(ordered_values);
fill_value_views();
_value_views = std::move(ordered_values);
}
void query_options::fill_value_views()

View File

@@ -108,6 +108,13 @@ public:
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
@@ -140,7 +147,8 @@ public:
// forInternalUse
explicit query_options(std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state);
db::consistency_level get_consistency() const;
cql3::raw_value_view get_value_at(size_t idx) const;

View File

@@ -38,19 +38,19 @@
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/metrics.hh>
#define CRYPTOPP_ENABLE_NAMESPACE_WEAK 1
#include "cql3/query_processor.hh"
#include <cryptopp/md5.h>
#include <seastar/core/metrics.hh>
#include "cql3/CqlParser.hpp"
#include "cql3/error_collector.hh"
#include "cql3/statements/batch_statement.hh"
#include "cql3/util.hh"
#include "transport/messages/result_message.hh"
#define CRYPTOPP_ENABLE_NAMESPACE_WEAK 1
#include <cryptopp/md5.h>
namespace cql3 {
using namespace statements;
@@ -68,9 +68,8 @@ const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono
class query_processor::internal_state {
service::query_state _qs;
public:
internal_state()
: _qs(service::client_state{service::client_state::internal_tag()})
{ }
internal_state() : _qs(service::client_state{service::client_state::internal_tag()}) {
}
operator service::query_state&() {
return _qs;
}
@@ -92,74 +91,102 @@ api::timestamp_type query_processor::next_timestamp() {
return _internal_state->next_timestamp();
}
query_processor::query_processor(distributed<service::storage_proxy>& proxy,
distributed<database>& db)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log)
{
query_processor::query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log) {
namespace sm = seastar::metrics;
_metrics.add_group("query_processor", {
sm::make_derive("statements_prepared", _stats.prepare_invocations,
sm::description("Counts a total number of parsed CQL requests.")),
});
_metrics.add_group(
"query_processor",
{
sm::make_derive(
"statements_prepared",
_stats.prepare_invocations,
sm::description("Counts a total number of parsed CQL requests."))});
_metrics.add_group("cql", {
sm::make_derive("reads", _cql_stats.reads,
sm::description("Counts a total number of CQL read requests.")),
_metrics.add_group(
"cql",
{
sm::make_derive(
"reads",
_cql_stats.reads,
sm::description("Counts a total number of CQL read requests.")),
sm::make_derive("inserts", _cql_stats.inserts,
sm::description("Counts a total number of CQL INSERT requests.")),
sm::make_derive(
"inserts",
_cql_stats.inserts,
sm::description("Counts a total number of CQL INSERT requests.")),
sm::make_derive("updates", _cql_stats.updates,
sm::description("Counts a total number of CQL UPDATE requests.")),
sm::make_derive(
"updates",
_cql_stats.updates,
sm::description("Counts a total number of CQL UPDATE requests.")),
sm::make_derive("deletes", _cql_stats.deletes,
sm::description("Counts a total number of CQL DELETE requests.")),
sm::make_derive(
"deletes",
_cql_stats.deletes,
sm::description("Counts a total number of CQL DELETE requests.")),
sm::make_derive("batches", _cql_stats.batches,
sm::description("Counts a total number of CQL BATCH requests.")),
sm::make_derive(
"batches",
_cql_stats.batches,
sm::description("Counts a total number of CQL BATCH requests.")),
sm::make_derive("statements_in_batches", _cql_stats.statements_in_batches,
sm::description("Counts a total number of sub-statements in CQL BATCH requests.")),
sm::make_derive(
"statements_in_batches",
_cql_stats.statements_in_batches,
sm::description("Counts a total number of sub-statements in CQL BATCH requests.")),
sm::make_derive("batches_pure_logged", _cql_stats.batches_pure_logged,
sm::description("Counts a total number of LOGGED batches that were executed as LOGGED batches.")),
sm::make_derive(
"batches_pure_logged",
_cql_stats.batches_pure_logged,
sm::description(
"Counts a total number of LOGGED batches that were executed as LOGGED batches.")),
sm::make_derive("batches_pure_unlogged", _cql_stats.batches_pure_unlogged,
sm::description("Counts a total number of UNLOGGED batches that were executed as UNLOGGED batches.")),
sm::make_derive(
"batches_pure_unlogged",
_cql_stats.batches_pure_unlogged,
sm::description(
"Counts a total number of UNLOGGED batches that were executed as UNLOGGED "
"batches.")),
sm::make_derive("batches_unlogged_from_logged", _cql_stats.batches_unlogged_from_logged,
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED batches.")),
sm::make_derive(
"batches_unlogged_from_logged",
_cql_stats.batches_unlogged_from_logged,
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED "
"batches.")),
sm::make_derive("prepared_cache_evictions", [] { return prepared_statements_cache::shard_stats().prepared_cache_evictions; },
sm::description("Counts a number of prepared statements cache entries evictions.")),
sm::make_derive(
"prepared_cache_evictions",
[] { return prepared_statements_cache::shard_stats().prepared_cache_evictions; },
sm::description("Counts a number of prepared statements cache entries evictions.")),
sm::make_gauge("prepared_cache_size", [this] { return _prepared_cache.size(); },
sm::description("A number of entries in the prepared statements cache.")),
sm::make_gauge(
"prepared_cache_size",
[this] { return _prepared_cache.size(); },
sm::description("A number of entries in the prepared statements cache.")),
sm::make_gauge("prepared_cache_memory_footprint", [this] { return _prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the prepared statements cache.")),
});
sm::make_gauge(
"prepared_cache_memory_footprint",
[this] { return _prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the prepared statements cache."))});
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
}
query_processor::~query_processor()
{}
query_processor::~query_processor() {
}
future<> query_processor::stop()
{
future<> query_processor::stop() {
service::get_local_migration_manager().unregister_listener(_migration_subscriber.get());
return make_ready_future<>();
}
future<::shared_ptr<result_message>>
query_processor::process(const sstring_view& query_string, service::query_state& query_state, query_options& options)
{
query_processor::process(const sstring_view& query_string, service::query_state& query_state, query_options& options) {
log.trace("process: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state());
@@ -179,14 +206,10 @@ query_processor::process(const sstring_view& query_string, service::query_state&
}
future<::shared_ptr<result_message>>
query_processor::process_statement(::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options)
{
#if 0
logger.trace("Process {} @CL.{}", statement, options.getConsistency());
#endif
query_processor::process_statement(
::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options) {
return statement->check_access(query_state.get_client_state()).then([this, statement, &query_state, &options]() {
auto& client_state = query_state.get_client_state();
@@ -210,38 +233,50 @@ query_processor::process_statement(::shared_ptr<cql_statement> statement,
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, service::query_state& query_state)
{
query_processor::prepare(sstring query_string, service::query_state& query_state) {
auto& client_state = query_state.get_client_state();
return prepare(std::move(query_string), client_state, client_state.is_thrift());
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, bool for_thrift)
{
query_processor::prepare(sstring query_string, const service::client_state& client_state, bool for_thrift) {
using namespace cql_transport::messages;
if (for_thrift) {
return prepare_one<result_message::prepared::thrift>(std::move(query_string), client_state, compute_thrift_id, prepared_cache_key_type::thrift_id);
return prepare_one<result_message::prepared::thrift>(
std::move(query_string),
client_state,
compute_thrift_id, prepared_cache_key_type::thrift_id);
} else {
return prepare_one<result_message::prepared::cql>(std::move(query_string), client_state, compute_id, prepared_cache_key_type::cql_id);
return prepare_one<result_message::prepared::cql>(
std::move(query_string),
client_state,
compute_id,
prepared_cache_key_type::cql_id);
}
}
::shared_ptr<cql_transport::messages::result_message::prepared>
query_processor::get_stored_prepared_statement(const std::experimental::string_view& query_string,
const sstring& keyspace,
bool for_thrift)
{
query_processor::get_stored_prepared_statement(
const std::experimental::string_view& query_string,
const sstring& keyspace,
bool for_thrift) {
using namespace cql_transport::messages;
if (for_thrift) {
return get_stored_prepared_statement_one<result_message::prepared::thrift>(query_string, keyspace, compute_thrift_id, prepared_cache_key_type::thrift_id);
return get_stored_prepared_statement_one<result_message::prepared::thrift>(
query_string,
keyspace,
compute_thrift_id,
prepared_cache_key_type::thrift_id);
} else {
return get_stored_prepared_statement_one<result_message::prepared::cql>(query_string, keyspace, compute_id, prepared_cache_key_type::cql_id);
return get_stored_prepared_statement_one<result_message::prepared::cql>(
query_string,
keyspace,
compute_id,
prepared_cache_key_type::cql_id);
}
}
static bytes md5_calculate(const std::experimental::string_view& s)
{
static bytes md5_calculate(const std::experimental::string_view& s) {
constexpr size_t size = CryptoPP::Weak1::MD5::DIGESTSIZE;
CryptoPP::Weak::MD5 hash;
unsigned char digest[size];
@@ -253,13 +288,15 @@ static sstring hash_target(const std::experimental::string_view& query_string, c
return keyspace + query_string.to_string();
}
prepared_cache_key_type query_processor::compute_id(const std::experimental::string_view& query_string, const sstring& keyspace)
{
prepared_cache_key_type query_processor::compute_id(
const std::experimental::string_view& query_string,
const sstring& keyspace) {
return prepared_cache_key_type(md5_calculate(hash_target(query_string, keyspace)));
}
prepared_cache_key_type query_processor::compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace)
{
prepared_cache_key_type query_processor::compute_thrift_id(
const std::experimental::string_view& query_string,
const sstring& keyspace) {
auto target = hash_target(query_string, keyspace);
uint32_t h = 0;
for (auto&& c : hash_target(query_string, keyspace)) {
@@ -269,11 +306,7 @@ prepared_cache_key_type query_processor::compute_thrift_id(const std::experiment
}
std::unique_ptr<prepared_statement>
query_processor::get_statement(const sstring_view& query, const service::client_state& client_state)
{
#if 0
Tracing.trace("Parsing {}", queryStr);
#endif
query_processor::get_statement(const sstring_view& query, const service::client_state& client_state) {
::shared_ptr<raw::parsed_statement> statement = parse_statement(query);
// Set keyspace for statement that require login
@@ -281,16 +314,12 @@ query_processor::get_statement(const sstring_view& query, const service::client_
if (cf_stmt) {
cf_stmt->prepare_keyspace(client_state);
}
#if 0
Tracing.trace("Preparing statement");
#endif
++_stats.prepare_invocations;
return statement->prepare(_db.local(), _cql_stats);
}
::shared_ptr<raw::parsed_statement>
query_processor::parse_statement(const sstring_view& query)
{
query_processor::parse_statement(const sstring_view& query) {
try {
auto statement = util::do_with_parser(query, std::mem_fn(&cql3_parser::CqlParser::query));
if (!statement) {
@@ -307,12 +336,14 @@ query_processor::parse_statement(const sstring_view& query)
}
}
query_options query_processor::make_internal_options(const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl)
{
query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl,
int32_t page_size) {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(sprint("Invalid number of values. Expecting %d but got %d", p->bound_names.size(), values.size()));
throw std::invalid_argument(
sprint("Invalid number of values. Expecting %d but got %d", p->bound_names.size(), values.size()));
}
auto ni = p->bound_names.begin();
std::vector<cql3::raw_value> bound_values;
@@ -326,11 +357,19 @@ query_options query_processor::make_internal_options(const statements::prepared_
bound_values.push_back(cql3::raw_value::make_value(n->type->decompose(v)));
}
}
if (page_size > 0) {
::shared_ptr<service::pager::paging_state> paging_state;
db::consistency_level serial_consistency = db::consistency_level::SERIAL;
api::timestamp_type ts = api::missing_timestamp;
return query_options(
cl,
bound_values,
cql3::query_options::specific_options{page_size, std::move(paging_state), serial_consistency, ts});
}
return query_options(cl, bound_values);
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string)
{
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
auto& p = _internal_statements[query_string];
if (p == nullptr) {
auto np = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
@@ -341,33 +380,128 @@ statements::prepared_statement::checked_weak_ptr query_processor::prepare_intern
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(const sstring& query_string,
const std::initializer_list<data_value>& values)
{
query_processor::execute_internal(const sstring& query_string, const std::initializer_list<data_value>& values) {
if (log.is_enabled(logging::log_level::trace)) {
log.trace("execute_internal: \"{}\" ({})", query_string, ::join(", ", values));
}
return execute_internal(prepare_internal(query_string), values);
}
struct internal_query_state {
sstring query_string;
std::unique_ptr<query_options> opts;
statements::prepared_statement::checked_weak_ptr p;
bool more_results = true;
};
::shared_ptr<internal_query_state> query_processor::create_paged_state(const sstring& query_string,
const std::initializer_list<data_value>& values, int32_t page_size) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, page_size);
::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
internal_query_state{
query_string,
std::make_unique<cql3::query_options>(std::move(opts)), std::move(p),
true});
return res;
}
bool query_processor::has_more_results(::shared_ptr<cql3::internal_query_state> state) const {
if (state) {
return state->more_results;
}
return false;
}
future<> query_processor::for_each_cql_result(
::shared_ptr<cql3::internal_query_state> state,
std::function<stop_iteration(const cql3::untyped_result_set::row&)>&& f) {
return do_with(seastar::shared_ptr<bool>(), [f, this, state](auto& is_done) mutable {
is_done = seastar::make_shared<bool>(false);
auto stop_when = [is_done]() {
return *is_done;
};
auto do_resuls = [is_done, state, f, this]() mutable {
return this->execute_paged_internal(
state).then([is_done, state, f, this](::shared_ptr<cql3::untyped_result_set> msg) mutable {
if (msg->empty()) {
*is_done = true;
} else {
if (!this->has_more_results(state)) {
*is_done = true;
}
for (auto& row : *msg) {
if (f(row) == stop_iteration::yes) {
*is_done = true;
break;
}
}
}
});
};
return do_until(stop_when, do_resuls);
});
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& values)
{
auto opts = make_internal_options(p, values);
query_processor::execute_paged_internal(::shared_ptr<internal_query_state> state) {
return state->p->statement->execute_internal(_proxy, *_internal_state, *state->opts).then(
[state, this](::shared_ptr<cql_transport::messages::result_message> msg) mutable {
class visitor : public result_message::visitor_base {
::shared_ptr<internal_query_state> _state;
query_processor& _qp;
public:
visitor(::shared_ptr<internal_query_state> state, query_processor& qp) : _state(state), _qp(qp) {
}
virtual ~visitor() = default;
void visit(const result_message::rows& rmrs) override {
auto& rs = rmrs.rs();
if (rs.get_metadata().paging_state()) {
bool done = !rs.get_metadata().flags().contains<cql3::metadata::flag::HAS_MORE_PAGES>();
if (done) {
_state->more_results = false;
} else {
const service::pager::paging_state& st = *rs.get_metadata().paging_state();
shared_ptr<service::pager::paging_state> shrd = ::make_shared<service::pager::paging_state>(st);
_state->opts = std::make_unique<query_options>(std::move(_state->opts), shrd);
_state->p = _qp.prepare_internal(_state->query_string);
}
} else {
_state->more_results = false;
}
}
};
visitor v(state, *this);
if (msg != nullptr) {
msg->accept(v);
}
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& values) {
query_options opts = make_internal_options(p, values);
return do_with(std::move(opts), [this, p = std::move(p)](auto& opts) {
return p->statement->execute_internal(_proxy, *_internal_state, opts).then([stmt = p->statement](auto msg) {
return p->statement->execute_internal(
_proxy,
*_internal_state,
opts).then([&opts, stmt = p->statement](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
}
future<::shared_ptr<untyped_result_set>>
query_processor::process(const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
bool cache)
{
query_processor::process(
const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
bool cache) {
if (cache) {
return process(prepare_internal(query_string), cl, values);
} else {
@@ -379,10 +513,10 @@ query_processor::process(const sstring& query_string,
}
future<::shared_ptr<untyped_result_set>>
query_processor::process(statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
const std::initializer_list<data_value>& values)
{
query_processor::process(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
const std::initializer_list<data_value>& values) {
auto opts = make_internal_options(p, values, cl);
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([](auto msg) {
@@ -392,10 +526,10 @@ query_processor::process(statements::prepared_statement::checked_weak_ptr p,
}
future<::shared_ptr<cql_transport::messages::result_message>>
query_processor::process_batch(::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options)
{
query_processor::process_batch(
::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options) {
return batch->check_access(query_state.get_client_state()).then([this, &query_state, &options, batch] {
batch->validate();
batch->validate(_proxy, query_state.get_client_state());
@@ -403,101 +537,90 @@ query_processor::process_batch(::shared_ptr<statements::batch_statement> batch,
});
}
query_processor::migration_subscriber::migration_subscriber(query_processor* qp)
: _qp{qp}
{
query_processor::migration_subscriber::migration_subscriber(query_processor* qp) : _qp{qp} {
}
void query_processor::migration_subscriber::on_create_keyspace(const sstring& ks_name)
{
void query_processor::migration_subscriber::on_create_keyspace(const sstring& ks_name) {
}
void query_processor::migration_subscriber::on_create_column_family(const sstring& ks_name, const sstring& cf_name)
{
void query_processor::migration_subscriber::on_create_column_family(const sstring& ks_name, const sstring& cf_name) {
}
void query_processor::migration_subscriber::on_create_user_type(const sstring& ks_name, const sstring& type_name)
{
void query_processor::migration_subscriber::on_create_user_type(const sstring& ks_name, const sstring& type_name) {
}
void query_processor::migration_subscriber::on_create_function(const sstring& ks_name, const sstring& function_name)
{
void query_processor::migration_subscriber::on_create_function(const sstring& ks_name, const sstring& function_name) {
log.warn("{} event ignored", __func__);
}
void query_processor::migration_subscriber::on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name)
{
void query_processor::migration_subscriber::on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) {
log.warn("{} event ignored", __func__);
}
void query_processor::migration_subscriber::on_create_view(const sstring& ks_name, const sstring& view_name)
{
void query_processor::migration_subscriber::on_create_view(const sstring& ks_name, const sstring& view_name) {
}
void query_processor::migration_subscriber::on_update_keyspace(const sstring& ks_name)
{
void query_processor::migration_subscriber::on_update_keyspace(const sstring& ks_name) {
}
void query_processor::migration_subscriber::on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool columns_changed)
{
void query_processor::migration_subscriber::on_update_column_family(
const sstring& ks_name,
const sstring& cf_name,
bool columns_changed) {
// #1255: Ignoring columns_changed deliberately.
log.info("Column definitions for {}.{} changed, invalidating related prepared statements", ks_name, cf_name);
remove_invalid_prepared_statements(ks_name, cf_name);
}
void query_processor::migration_subscriber::on_update_user_type(const sstring& ks_name, const sstring& type_name)
{
void query_processor::migration_subscriber::on_update_user_type(const sstring& ks_name, const sstring& type_name) {
}
void query_processor::migration_subscriber::on_update_function(const sstring& ks_name, const sstring& function_name)
{
void query_processor::migration_subscriber::on_update_function(const sstring& ks_name, const sstring& function_name) {
}
void query_processor::migration_subscriber::on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name)
{
void query_processor::migration_subscriber::on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) {
}
void query_processor::migration_subscriber::on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed)
{
void query_processor::migration_subscriber::on_update_view(
const sstring& ks_name,
const sstring& view_name, bool columns_changed) {
}
void query_processor::migration_subscriber::on_drop_keyspace(const sstring& ks_name)
{
void query_processor::migration_subscriber::on_drop_keyspace(const sstring& ks_name) {
remove_invalid_prepared_statements(ks_name, std::experimental::nullopt);
}
void query_processor::migration_subscriber::on_drop_column_family(const sstring& ks_name, const sstring& cf_name)
{
void query_processor::migration_subscriber::on_drop_column_family(const sstring& ks_name, const sstring& cf_name) {
remove_invalid_prepared_statements(ks_name, cf_name);
}
void query_processor::migration_subscriber::on_drop_user_type(const sstring& ks_name, const sstring& type_name)
{
void query_processor::migration_subscriber::on_drop_user_type(const sstring& ks_name, const sstring& type_name) {
}
void query_processor::migration_subscriber::on_drop_function(const sstring& ks_name, const sstring& function_name)
{
void query_processor::migration_subscriber::on_drop_function(const sstring& ks_name, const sstring& function_name) {
log.warn("{} event ignored", __func__);
}
void query_processor::migration_subscriber::on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name)
{
void query_processor::migration_subscriber::on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) {
log.warn("{} event ignored", __func__);
}
void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name, const sstring& view_name)
{
void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name, const sstring& view_name) {
remove_invalid_prepared_statements(ks_name, view_name);
}
void query_processor::migration_subscriber::remove_invalid_prepared_statements(sstring ks_name, std::experimental::optional<sstring> cf_name)
{
void query_processor::migration_subscriber::remove_invalid_prepared_statements(
sstring ks_name,
std::experimental::optional<sstring> cf_name) {
_qp->_prepared_cache.remove_if([&] (::shared_ptr<cql_statement> stmt) {
return this->should_invalidate(ks_name, cf_name, stmt);
});
}
bool query_processor::migration_subscriber::should_invalidate(sstring ks_name, std::experimental::optional<sstring> cf_name, ::shared_ptr<cql_statement> statement)
{
bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::experimental::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
}

View File

@@ -43,21 +43,22 @@
#include <experimental/string_view>
#include <unordered_map>
#include <seastar/core/metrics_registration.hh>
#include "core/shared_ptr.hh"
#include "exceptions/exceptions.hh"
#include <seastar/core/distributed.hh>
#include <seastar/core/metrics_registration.hh>
#include <seastar/core/shared_ptr.hh>
#include "cql3/prepared_statements_cache.hh"
#include "cql3/query_options.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/statements/raw/parsed_statement.hh"
#include "cql3/statements/raw/cf_statement.hh"
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_manager.hh"
#include "service/query_state.hh"
#include "log.hh"
#include "core/distributed.hh"
#include "statements/prepared_statement.hh"
#include "transport/messages/result_message.hh"
#include "untyped_result_set.hh"
#include "prepared_statements_cache.hh"
namespace cql3 {
@@ -65,14 +66,22 @@ namespace statements {
class batch_statement;
}
class prepared_statement_is_too_big : public std::exception {
public:
static constexpr int max_query_prefix = 100;
class untyped_result_set;
class untyped_result_set_row;
private:
/*!
* \brief to allow paging, holds
* internal state, that needs to be passed to the execute statement.
*
*/
struct internal_query_state;
class prepared_statement_is_too_big : public std::exception {
sstring _msg;
public:
static constexpr int max_query_prefix = 100;
prepared_statement_is_too_big(const sstring& query_string)
: _msg(seastar::format("Prepared statement is too big: {}", query_string.substr(0, max_query_prefix)))
{
@@ -107,15 +116,33 @@ private:
class internal_state;
std::unique_ptr<internal_state> _internal_state;
public:
query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db);
~query_processor();
prepared_statements_cache _prepared_cache;
// A map for prepared statements used internally (which we don't want to mix with user statement, in particular we
// don't bother with expiration on those.
std::unordered_map<sstring, std::unique_ptr<statements::prepared_statement>> _internal_statements;
public:
static const sstring CQL_VERSION;
static prepared_cache_key_type compute_id(
const std::experimental::string_view& query_string,
const sstring& keyspace);
static prepared_cache_key_type compute_thrift_id(
const std::experimental::string_view& query_string,
const sstring& keyspace);
static ::shared_ptr<statements::raw::parsed_statement> parse_statement(const std::experimental::string_view& query);
query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db);
~query_processor();
distributed<database>& db() {
return _db;
}
distributed<service::storage_proxy>& proxy() {
return _proxy;
}
@@ -124,125 +151,6 @@ public:
return _cql_stats;
}
#if 0
public static final QueryProcessor instance = new QueryProcessor();
#endif
private:
#if 0
private static final Logger logger = LoggerFactory.getLogger(QueryProcessor.class);
private static final MemoryMeter meter = new MemoryMeter().withGuessing(MemoryMeter.Guess.FALLBACK_BEST).ignoreKnownSingletons();
private static final long MAX_CACHE_PREPARED_MEMORY = Runtime.getRuntime().maxMemory() / 256;
private static EntryWeigher<MD5Digest, ParsedStatement.Prepared> cqlMemoryUsageWeigher = new EntryWeigher<MD5Digest, ParsedStatement.Prepared>()
{
@Override
public int weightOf(MD5Digest key, ParsedStatement.Prepared value)
{
return Ints.checkedCast(measure(key) + measure(value.statement) + measure(value.boundNames));
}
};
private static EntryWeigher<Integer, ParsedStatement.Prepared> thriftMemoryUsageWeigher = new EntryWeigher<Integer, ParsedStatement.Prepared>()
{
@Override
public int weightOf(Integer key, ParsedStatement.Prepared value)
{
return Ints.checkedCast(measure(key) + measure(value.statement) + measure(value.boundNames));
}
};
#endif
prepared_statements_cache _prepared_cache;
std::unordered_map<sstring, std::unique_ptr<statements::prepared_statement>> _internal_statements;
#if 0
// A map for prepared statements used internally (which we don't want to mix with user statement, in particular we don't
// bother with expiration on those.
private static final ConcurrentMap<String, ParsedStatement.Prepared> internalStatements = new ConcurrentHashMap<>();
// Direct calls to processStatement do not increment the preparedStatementsExecuted/regularStatementsExecuted
// counters. Callers of processStatement are responsible for correctly notifying metrics
public static final CQLMetrics metrics = new CQLMetrics();
private static final AtomicInteger lastMinuteEvictionsCount = new AtomicInteger(0);
static
{
preparedStatements = new ConcurrentLinkedHashMap.Builder<MD5Digest, ParsedStatement.Prepared>()
.maximumWeightedCapacity(MAX_CACHE_PREPARED_MEMORY)
.weigher(cqlMemoryUsageWeigher)
.listener(new EvictionListener<MD5Digest, ParsedStatement.Prepared>()
{
public void onEviction(MD5Digest md5Digest, ParsedStatement.Prepared prepared)
{
metrics.preparedStatementsEvicted.inc();
lastMinuteEvictionsCount.incrementAndGet();
}
}).build();
thriftPreparedStatements = new ConcurrentLinkedHashMap.Builder<Integer, ParsedStatement.Prepared>()
.maximumWeightedCapacity(MAX_CACHE_PREPARED_MEMORY)
.weigher(thriftMemoryUsageWeigher)
.listener(new EvictionListener<Integer, ParsedStatement.Prepared>()
{
public void onEviction(Integer integer, ParsedStatement.Prepared prepared)
{
metrics.preparedStatementsEvicted.inc();
lastMinuteEvictionsCount.incrementAndGet();
}
})
.build();
ScheduledExecutors.scheduledTasks.scheduleAtFixedRate(new Runnable()
{
public void run()
{
long count = lastMinuteEvictionsCount.getAndSet(0);
if (count > 0)
logger.info("{} prepared statements discarded in the last minute because cache limit reached ({} bytes)",
count,
MAX_CACHE_PREPARED_MEMORY);
}
}, 1, 1, TimeUnit.MINUTES);
}
public static int preparedStatementsCount()
{
return preparedStatements.size() + thriftPreparedStatements.size();
}
// Work around initialization dependency
private static enum InternalStateInstance
{
INSTANCE;
private final QueryState queryState;
InternalStateInstance()
{
ClientState state = ClientState.forInternalCalls();
try
{
state.setKeyspace(SystemKeyspace.NAME);
}
catch (InvalidRequestException e)
{
throw new RuntimeException();
}
this.queryState = new QueryState(state);
}
}
private static QueryState internalQueryState()
{
return InternalStateInstance.INSTANCE.queryState;
}
private QueryProcessor()
{
MigrationManager.instance.register(new MigrationSubscriber());
}
#endif
public:
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
@@ -251,128 +159,69 @@ public:
return *it;
}
#if 0
public static void validateKey(ByteBuffer key) throws InvalidRequestException
{
if (key == null || key.remaining() == 0)
{
throw new InvalidRequestException("Key may not be empty");
}
future<::shared_ptr<cql_transport::messages::result_message>>
process_statement(
::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options);
// check that key can be handled by FBUtilities.writeShortByteArray
if (key.remaining() > FBUtilities.MAX_UNSIGNED_SHORT)
{
throw new InvalidRequestException("Key length of " + key.remaining() +
" is longer than maximum of " + FBUtilities.MAX_UNSIGNED_SHORT);
}
}
future<::shared_ptr<cql_transport::messages::result_message>>
process(
const std::experimental::string_view& query_string,
service::query_state& query_state,
query_options& options);
public static void validateCellNames(Iterable<CellName> cellNames, CellNameType type) throws InvalidRequestException
{
for (CellName name : cellNames)
validateCellName(name, type);
}
public static void validateCellName(CellName name, CellNameType type) throws InvalidRequestException
{
validateComposite(name, type);
if (name.isEmpty())
throw new InvalidRequestException("Invalid empty value for clustering column of COMPACT TABLE");
}
public static void validateComposite(Composite name, CType type) throws InvalidRequestException
{
long serializedSize = type.serializer().serializedSize(name, TypeSizes.NATIVE);
if (serializedSize > Cell.MAX_NAME_LENGTH)
throw new InvalidRequestException(String.format("The sum of all clustering columns is too long (%s > %s)",
serializedSize,
Cell.MAX_NAME_LENGTH));
}
#endif
public:
future<::shared_ptr<cql_transport::messages::result_message>> process_statement(::shared_ptr<cql_statement> statement,
service::query_state& query_state, const query_options& options);
#if 0
public static ResultMessage process(String queryString, ConsistencyLevel cl, QueryState queryState)
throws RequestExecutionException, RequestValidationException
{
return instance.process(queryString, queryState, QueryOptions.forInternalCalls(cl, Collections.<ByteBuffer>emptyList()));
}
#endif
future<::shared_ptr<cql_transport::messages::result_message>> process(const std::experimental::string_view& query_string,
service::query_state& query_state, query_options& options);
#if 0
public static ParsedStatement.Prepared parseStatement(String queryStr, QueryState queryState) throws RequestValidationException
{
return getStatement(queryStr, queryState.getClientState());
}
public static UntypedResultSet process(String query, ConsistencyLevel cl) throws RequestExecutionException
{
try
{
ResultMessage result = instance.process(query, QueryState.forInternalCalls(), QueryOptions.forInternalCalls(cl, Collections.<ByteBuffer>emptyList()));
if (result instanceof ResultMessage.Rows)
return UntypedResultSet.create(((ResultMessage.Rows)result).result);
else
return null;
}
catch (RequestValidationException e)
{
throw new RuntimeException(e);
}
}
private static QueryOptions makeInternalOptions(ParsedStatement.Prepared prepared, Object[] values)
{
if (prepared.boundNames.size() != values.length)
throw new IllegalArgumentException(String.format("Invalid number of values. Expecting %d but got %d", prepared.boundNames.size(), values.length));
List<ByteBuffer> boundValues = new ArrayList<ByteBuffer>(values.length);
for (int i = 0; i < values.length; i++)
{
Object value = values[i];
AbstractType type = prepared.boundNames.get(i).type;
boundValues.add(value instanceof ByteBuffer || value == null ? (ByteBuffer)value : type.decompose(value));
}
return QueryOptions.forInternalCalls(boundValues);
}
private static ParsedStatement.Prepared prepareInternal(String query) throws RequestValidationException
{
ParsedStatement.Prepared prepared = internalStatements.get(query);
if (prepared != null)
return prepared;
// Note: if 2 threads prepare the same query, we'll live so don't bother synchronizing
prepared = parseStatement(query, internalQueryState());
prepared.statement.validate(internalQueryState().getClientState());
internalStatements.putIfAbsent(query, prepared);
return prepared;
}
#endif
private:
query_options make_internal_options(const statements::prepared_statement::checked_weak_ptr& p, const std::initializer_list<data_value>&, db::consistency_level = db::consistency_level::ONE);
public:
future<::shared_ptr<untyped_result_set>> execute_internal(
const sstring& query_string,
const std::initializer_list<data_value>& = { });
future<::shared_ptr<untyped_result_set>>
execute_internal(const sstring& query_string, const std::initializer_list<data_value>& = { });
statements::prepared_statement::checked_weak_ptr prepare_internal(const sstring& query);
future<::shared_ptr<untyped_result_set>> execute_internal(
statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& = { });
future<::shared_ptr<untyped_result_set>>
execute_internal(statements::prepared_statement::checked_weak_ptr p, const std::initializer_list<data_value>& = { });
/*!
* \brief iterate over all cql results using paging
*
* You Create a statement with optional paraemter and pass
* a function that goes over the results.
*
* The passed function would be called for all the results, return stop_iteration::yes
* to stop during iteration.
*
* For example:
return query("SELECT * from system.compaction_history",
[&history] (const cql3::untyped_result_set::row& row) mutable {
....
....
return stop_iteration::no;
});
* You can use place holder in the query, the prepared statement will only be done once.
*
*
* query_string - the cql string, can contain place holder
* f - a function to be run on each of the query result, if the function return false the iteration would stop
* args - arbitrary number of query parameters
*/
template<typename... Args>
future<> query(
const sstring& query_string,
std::function<stop_iteration(const cql3::untyped_result_set_row&)>&& f,
Args&&... args) {
return for_each_cql_result(
create_paged_state(query_string, { data_value(std::forward<Args>(args))... }), std::move(f));
}
future<::shared_ptr<untyped_result_set>> process(
const sstring& query_string,
db::consistency_level, const std::initializer_list<data_value>& = { }, bool cache = false);
const sstring& query_string,
db::consistency_level,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> process(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level, const std::initializer_list<data_value>& = { });
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level,
const std::initializer_list<data_value>& = { });
/*
* This function provides a timestamp that is guaranteed to be higher than any timestamp
@@ -384,115 +233,110 @@ public:
*/
api::timestamp_type next_timestamp();
#if 0
public static UntypedResultSet executeInternalWithPaging(String query, int pageSize, Object... values)
{
try
{
ParsedStatement.Prepared prepared = prepareInternal(query);
if (!(prepared.statement instanceof SelectStatement))
throw new IllegalArgumentException("Only SELECTs can be paged");
SelectStatement select = (SelectStatement)prepared.statement;
QueryPager pager = QueryPagers.localPager(select.getPageableCommand(makeInternalOptions(prepared, values)));
return UntypedResultSet.create(select, pager, pageSize);
}
catch (RequestValidationException e)
{
throw new RuntimeException("Error validating query" + e);
}
}
/**
* Same than executeInternal, but to use for queries we know are only executed once so that the
* created statement object is not cached.
*/
public static UntypedResultSet executeOnceInternal(String query, Object... values)
{
try
{
ParsedStatement.Prepared prepared = parseStatement(query, internalQueryState());
prepared.statement.validate(internalQueryState().getClientState());
ResultMessage result = prepared.statement.executeInternal(internalQueryState(), makeInternalOptions(prepared, values));
if (result instanceof ResultMessage.Rows)
return UntypedResultSet.create(((ResultMessage.Rows)result).result);
else
return null;
}
catch (RequestExecutionException e)
{
throw new RuntimeException(e);
}
catch (RequestValidationException e)
{
throw new RuntimeException("Error validating query " + query, e);
}
}
public static UntypedResultSet resultify(String query, Row row)
{
return resultify(query, Collections.singletonList(row));
}
public static UntypedResultSet resultify(String query, List<Row> rows)
{
try
{
SelectStatement ss = (SelectStatement) getStatement(query, null).statement;
ResultSet cqlRows = ss.process(rows);
return UntypedResultSet.create(cqlRows);
}
catch (RequestValidationException e)
{
throw new AssertionError(e);
}
}
#endif
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(sstring query_string, service::query_state& query_state);
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(sstring query_string, const service::client_state& client_state, bool for_thrift);
static prepared_cache_key_type compute_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static prepared_cache_key_type compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace);
future<> stop();
future<::shared_ptr<cql_transport::messages::result_message>>
process_batch(::shared_ptr<statements::batch_statement>, service::query_state& query_state, query_options& options);
std::unique_ptr<statements::prepared_statement> get_statement(
const std::experimental::string_view& query,
const service::client_state& client_state);
friend class migration_subscriber;
private:
query_options make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>&,
db::consistency_level = db::consistency_level::ONE,
int32_t page_size = -1);
/*!
* \brief created a state object for paging
*
* When using paging internally a state object is needed.
*/
::shared_ptr<internal_query_state> create_paged_state(
const sstring& query_string,
const std::initializer_list<data_value>& = { },
int32_t page_size = 1000);
/*!
* \brief run a query using paging
*/
future<::shared_ptr<untyped_result_set>> execute_paged_internal(::shared_ptr<internal_query_state> state);
/*!
* \brief iterate over all results using paging
*/
future<> for_each_cql_result(
::shared_ptr<cql3::internal_query_state> state,
std::function<stop_iteration(const cql3::untyped_result_set_row&)>&& f);
/*!
* \brief check, based on the state if there are additional results
* Users of the paging, should not use the internal_query_state directly
*/
bool has_more_results(::shared_ptr<cql3::internal_query_state> state) const;
///
/// \tparam ResultMsgType type of the returned result message (CQL or Thrift)
/// \tparam PreparedKeyGenerator a function that generates the prepared statement cache key for given query and keyspace
/// \tparam IdGetter a function that returns the corresponding prepared statement ID (CQL or Thrift) for a given prepared statement cache key
/// \tparam PreparedKeyGenerator a function that generates the prepared statement cache key for given query and
/// keyspace
/// \tparam IdGetter a function that returns the corresponding prepared statement ID (CQL or Thrift) for a given
//// prepared statement cache key
/// \param query_string
/// \param client_state
/// \param id_gen prepared ID generator, called before the first deferring
/// \param id_getter prepared ID getter, passed to deferred context by reference. The caller must ensure its liveness.
/// \param id_getter prepared ID getter, passed to deferred context by reference. The caller must ensure its
//// liveness.
/// \return
template <typename ResultMsgType, typename PreparedKeyGenerator, typename IdGetter>
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare_one(sstring query_string, const service::client_state& client_state, PreparedKeyGenerator&& id_gen, IdGetter&& id_getter) {
return do_with(id_gen(query_string, client_state.get_raw_keyspace()), std::move(query_string), [this, &client_state, &id_getter] (const prepared_cache_key_type& key, const sstring& query_string) {
prepare_one(
sstring query_string,
const service::client_state& client_state,
PreparedKeyGenerator&& id_gen,
IdGetter&& id_getter) {
return do_with(
id_gen(query_string, client_state.get_raw_keyspace()),
std::move(query_string),
[this, &client_state, &id_getter](const prepared_cache_key_type& key, const sstring& query_string) {
return _prepared_cache.get(key, [this, &query_string, &client_state] {
auto prepared = get_statement(query_string, client_state);
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Too many markers(?). %d markers exceed the allowed maximum of %d", bound_terms, std::numeric_limits<uint16_t>::max()));
throw exceptions::invalid_request_exception(
sprint("Too many markers(?). %d markers exceed the allowed maximum of %d",
bound_terms,
std::numeric_limits<uint16_t>::max()));
}
assert(bound_terms == prepared->bound_names.size());
prepared->raw_cql_statement = query_string;
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
}).then([&key, &id_getter] (auto prep_ptr) {
return make_ready_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(::make_shared<ResultMsgType>(id_getter(key), std::move(prep_ptr)));
return make_ready_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(
::make_shared<ResultMsgType>(id_getter(key), std::move(prep_ptr)));
}).handle_exception_type([&query_string] (typename prepared_statements_cache::statement_is_too_big&) {
return make_exception_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(prepared_statement_is_too_big(query_string));
return make_exception_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(
prepared_statement_is_too_big(query_string));
});
});
};
template <typename ResultMsgType, typename KeyGenerator, typename IdGetter>
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement_one(const std::experimental::string_view& query_string, const sstring& keyspace, KeyGenerator&& key_gen, IdGetter&& id_getter)
{
get_stored_prepared_statement_one(
const std::experimental::string_view& query_string,
const sstring& keyspace,
KeyGenerator&& key_gen,
IdGetter&& id_getter) {
auto cache_key = key_gen(query_string, keyspace);
auto it = _prepared_cache.find(cache_key);
if (it == _prepared_cache.end()) {
@@ -503,55 +347,15 @@ private:
}
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, bool for_thrift);
#if 0
public ResultMessage processPrepared(CQLStatement statement, QueryState queryState, QueryOptions options)
throws RequestExecutionException, RequestValidationException
{
List<ByteBuffer> variables = options.getValues();
// Check to see if there are any bound variables to verify
if (!(variables.isEmpty() && (statement.getBoundTerms() == 0)))
{
if (variables.size() != statement.getBoundTerms())
throw new InvalidRequestException(String.format("there were %d markers(?) in CQL but %d bound variables",
statement.getBoundTerms(),
variables.size()));
// at this point there is a match in count between markers and variables that is non-zero
if (logger.isTraceEnabled())
for (int i = 0; i < variables.size(); i++)
logger.trace("[{}] '{}'", i+1, variables.get(i));
}
metrics.preparedStatementsExecuted.inc();
return processStatement(statement, queryState, options);
}
#endif
public:
future<::shared_ptr<cql_transport::messages::result_message>> process_batch(::shared_ptr<statements::batch_statement>,
service::query_state& query_state, query_options& options);
std::unique_ptr<statements::prepared_statement> get_statement(const std::experimental::string_view& query,
const service::client_state& client_state);
static ::shared_ptr<statements::raw::parsed_statement> parse_statement(const std::experimental::string_view& query);
#if 0
private static long measure(Object key)
{
return meter.measureDeep(key);
}
#endif
public:
future<> stop();
friend class migration_subscriber;
get_stored_prepared_statement(
const std::experimental::string_view& query_string,
const sstring& keyspace,
bool for_thrift);
};
class query_processor::migration_subscriber : public service::migration_listener {
query_processor* _qp;
public:
migration_subscriber(query_processor* qp);
@@ -575,9 +379,14 @@ public:
virtual void on_drop_function(const sstring& ks_name, const sstring& function_name) override;
virtual void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override;
virtual void on_drop_view(const sstring& ks_name, const sstring& view_name) override;
private:
void remove_invalid_prepared_statements(sstring ks_name, std::experimental::optional<sstring> cf_name);
bool should_invalidate(sstring ks_name, std::experimental::optional<sstring> cf_name, ::shared_ptr<cql_statement> statement);
bool should_invalidate(
sstring ks_name,
std::experimental::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement);
};
extern distributed<query_processor> _the_query_processor;

Some files were not shown because too many files have changed in this diff Show More