Compare commits

..

682 Commits

Author SHA1 Message Date
Amos Kong
44ec73cfc4 schema.cc/describe: fix invalid compaction options in schema
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.

    AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}

map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.

Fixes #7741

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7734

(cherry picked from commit 6b1659ee80)
2021-03-24 12:58:11 +02:00
Tomasz Grabiec
df6f9a200f sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>

(cherry picked from commit 9272e74e8c)
2021-03-24 10:42:11 +02:00
Nadav Har'El
2f4a3c271c storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
2021-03-21 10:51:36 +02:00
Raphael S. Carvalho
6a11c20b4a LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
2021-03-18 19:20:10 +02:00
Raphael S. Carvalho
cccdd6aaae compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
2021-03-18 14:29:38 +02:00
Raphael S. Carvalho
92871a88c3 compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)

Includes fixup patch:

compaction_manager: Fix use-after-free in rewrite_sstables()

Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
2021-03-11 08:24:56 +02:00
Benny Halevy
85bbf6751d repair: repair_writer: do not capture lw_shared_ptr cross-shard
The shared_from_this lw_shared_ptr must not be accessed
across shards.  Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.

This was introduced in 289a08072a.

The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it.  It is already held
by the continuations until the end of the chain.

Fixes #7535

Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
(cherry picked from commit f93fb55726)
2021-03-03 21:27:44 +02:00
Hagit Segev
0ac069fdcc release: prepare for 4.2.4 2021-03-02 14:52:31 +02:00
Avi Kivity
738f8eaccd Update seastar submodule
* seastar 1266e42c82...0fba7da929 (1):
  > io_queue: Fix "delay" metrics

Fixes #8166.
2021-03-01 13:59:02 +02:00
Avi Kivity
5d32e91e16 Update seastar submodule
* seastar f760efe0a0...1266e42c82 (1):
  > rpc: streaming sink: order outgoing messages

Fixes #7552.
2021-03-01 12:22:17 +02:00
Benny Halevy
6c5f6b3f69 large_data_handler: disable deletion of large data entries
Currently we decide whether to delete large data entries
based on the overall sstable data_size, since the entries
themselves are typically much smaller than the whole sstable
(especially cells and rows), this causes overzealous
deletions (#7668) and inefficiency in the rows cache
due to the large number of range tombstones created.

Refs #7575

Test: sstable_3_x_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

This patch is targetted for branch-4.3 or earlier.
In 4.4, the problem was fixed in #7669, but the fix
is out of scope for backporting.

Branch: 4.3
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201203130018.1920271-1-bhalevy@scylladb.com>
(cherry picked from commit bb99d7ced6)
2021-03-01 10:54:33 +02:00
Raphael S. Carvalho
fba26b78d2 sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
2021-02-28 16:43:02 +02:00
Pavel Solodovnikov
06e785994f large_data_handler: fix segmentation fault when constructing data_value from a nullptr
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:

In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.

These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.

This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.

What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.

A regression test is provided for the issue (written in
`cql-pytest` framework).

Tests: test/cql-pytest/test_large_cells_rows.py

Fixes: #6780

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 219ac2bab5)
2021-02-23 12:14:12 +02:00
Takuya ASADA
5bc48673aa scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040

(cherry picked from commit 32d4ec6b8a)
2021-02-21 16:23:45 +02:00
Nadav Har'El
59a01b2981 alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
2021-02-21 10:06:50 +02:00
Nadav Har'El
5dd49788c1 alternator: fix UpdateItem ADD for non-existent attribute
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).

We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.

Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(

Fixes #7763.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
(cherry picked from commit a8fdbf31cd)
2021-02-21 08:58:49 +02:00
Benny Halevy
56cbc9f3ed stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
(cherry picked from commit d01e7e7b58)
2021-02-14 13:11:43 +02:00
Avi Kivity
7469896017 table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
on_compaction_completion() updates _sstables_compacted_but_not_deleted
through a temporary to avoid an exception causing a partial update:

  1. copy _sstables_compacted_but_not_deleted to a temporary
  2. update temporary
  3. do dangerous stuff
  4. move temporary to _sstables_compacted_but_not_deleted

This is racy when we have parallel compactions, since step 3 yields.
We can have two invocations running in parallel, taking snapshots
of the same _sstables_compacted_but_not_deleted in step 1, each
modifying it in different ways, and only one of them winning the
race and assigning in step 4. With the right timing we can end
with extra sstables in _sstables_compacted_but_not_deleted.

Before a5369881b3, this was a benign race (only resulting in
deleted file space not being reclaimed until the service is shut
down), but afterwards, extra sstable references result in the service
refusing to shut down. This was observed in database_test in debug
mode, where the race more or less reliably happens for system.truncated.

Fix by using a different method to protect
_sstables_compacted_but_not_deleted. We unconditionally update it,
and also unconditionally fix it up (on success or failure) using
seastar::defer(). The fixup includes a call to rebuild_statistics()
which must happen every time we touch the sstable list.

Ref #7331.
Fixes #8038.

BACKPORT NOTES:
- Turns out this race prevented deletion of expired sstables because the leaked
deleted sstables would be accounted when checking if an expired sstable can
be purged.
- Switch to unordered_set<>::count() as it's not supported by older compilers.

(cherry picked from commit a43d5079f3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
2021-02-14 11:35:57 +02:00
Piotr Wojtczak
c7e2711dd4 Validate ascii values when creating from CQL
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.

Fixes #5421

Closes #7532

(cherry picked from commit caa3c471c0)
2021-02-10 19:37:56 +02:00
Piotr Dulikowski
a2355a35db hinted handoff: use default timeout for sending orphaned hints
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.

Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.

Fixes: #7051
(cherry picked from commit b111fa98ca)
2021-02-10 10:15:01 +02:00
Piotr Sarna
9e225ab447 Merge 'select_statement: Fix aggregate results on indexed selects (timeouts fixed) ' from Piotr Grabowski
Overview
Fixes #7355.

Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).

Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.

It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).

GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
 pk
----
  1
  1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.

Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).

The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
 count
-------
     2
```
The second commit fixes this issue by properly trimming `row_ranges`.

The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.

The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).

The second commit fixes the first two failing examples from issue #7355.

Tests (fourth commit)
Extensively tests queries on tables with secondary indices with  aggregates and `GROUP BY`s.

Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).

I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.

Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000).  This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.

Closes #7497

* github.com:scylladb/scylla:
  tests: Add secondary index aggregates tests
  select_statement: Introduce internal_paging_size
  select_statement: Fix paging on indexed selects
  select_statement: Fix GROUP BY on indexed select

(cherry picked from commit 8c645f74ce)
2021-02-08 20:32:36 +02:00
Amnon Heiman
e1205d1d5b API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007

(cherry picked from commit 4498bb0a48)
2021-02-08 17:04:27 +02:00
Avi Kivity
a78402efae Merge 'Add waiting for flushes on table drops' from Piotr Sarna
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush  (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault

Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)

Fixes #7792

Closes #7798

* github.com:scylladb/scylla:
  database: add flushes to waiting for pending operations
  table: unify waiting for pending operations
  database: add a phaser for flush operations
  database: add waiting for pending streams on table drop

(cherry picked from commit 7636799b18)
2021-02-02 17:23:34 +02:00
Avi Kivity
9fcf790234 row_cache: linearize key in cache_entry::do_read()
do_read() does not linearize cache_entry::_key; this can cause a crash
with keys larger than 13k.

Fixes #7897.

Closes #7898

(cherry picked from commit d508a63d4b)
2021-01-17 09:30:44 +02:00
Hagit Segev
24346215c2 release: prepare for 4.2.3 2021-01-04 19:51:12 +02:00
Benny Halevy
918ec5ecb3 compaction: compaction_writer: destroy shared_sstable after the sstable_writer
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
 (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
 (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```

What happens here is that:

    compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
            : _out(std::move(out))
            , _compression_metadata(cm)
            , _offsets(_compression_metadata->offsets.get_writer())
            , _compression(lc)
            , _full_checksum(ChecksumType::init_checksum())

_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.

Fixes #7821

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
(cherry picked from commit 8a745a0ee0)
2021-01-04 15:04:34 +02:00
Avi Kivity
7457683328 Revert "Merge 'Move temporaries to value view' from Piotr S"
This reverts commit d1fa0adcbe. It causes
a regression when processing some bind variables.

Fixes #7761.
2020-12-24 12:40:46 +02:00
Gleb Natapov
567889d283 mutation_writer: pass exceptions through feed_writer
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.

Fixes #7482
Message-Id: <20201216090038.GB3244976@scylladb.com>

(cherry picked from commit 61520a33d6)
2020-12-16 17:20:11 +02:00
Aleksandr Bykov
c605ed73bf dist: scylla_util: fix aws_instance.ebs_disks method
aws_instance.ebs_disks() method should return ebs disk
instead of ephemeral

Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com>

Closes #7780

(cherry picked from commit e74dc311e7)
2020-12-16 11:58:47 +02:00
Takuya ASADA
d0530d8ac2 node_exporter_install: stop service before force installing
Stop node-exporter.service before re-install it, to avoid 'Text file busy' error.

Fixes #6782

(cherry picked from commit ef05ea8e91)
2020-12-15 16:28:25 +02:00
Hagit Segev
696ef24226 release: prepare for 4.2.2 2020-12-13 20:34:03 +02:00
Avi Kivity
b8fe144301 dist: rpm: uninstall tuned when installing scylla-kernel-conf
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).

Not needed for .deb, since debian/ubunto do not install tuned by
default.

Fixes #7696

Closes #7776

(cherry picked from commit 615b8e8184)
2020-12-12 14:30:38 +02:00
Nadav Har'El
62f783be87 alternator: fix broken Scan/Query paging with bytes keys
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).

This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.

Fixes #7768

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
(cherry picked from commit 86779664f4)
2020-12-09 15:16:41 +02:00
Piotr Sarna
863e784951 db: fix getting local ranges for size estimates table
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.

Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:

(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
            _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
      _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
            _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}

Closes #7764

(cherry picked from commit 1cc4ed50c1)
2020-12-09 15:16:14 +02:00
Nadav Har'El
e5a6199b4d alternator, test: make test_fetch_from_system_tables faster
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.

This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.

As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.

Fixes #7706

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
(cherry picked from commit 220d6dde17)
2020-12-09 15:15:15 +02:00
Nadav Har'El
abaf6c192a alternator: fix query with both projection and filtering
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.

The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.

The two tests

 test_query_filter.py::test_query_filter_and_attributes_to_get
 test_filter_expression.py::test_filter_expression_and_projection_expression

Which failed before this patch now pass so we drop their "xfail" tag.

Fixes #6951.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 282742a469)
2020-12-09 14:39:17 +02:00
Eliran Sinvani
ef2f5ed434 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296

(cherry picked from commit 925cdc9ae1)
2020-11-29 16:45:14 +02:00
Raphael S. Carvalho
bac40e2512 sstable_directory: Fix 50% space requirement for resharding
This is a regression caused by aebd965f0.

After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.

Fixes #7463.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
(cherry picked from commit 6f805bd123)
2020-11-29 15:26:14 +02:00
Asias He
681c4d77bb repair: Make repair_writer a shared pointer
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:

class repair_writer {
   _writer_done[node_idx] =
      mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
         ...
      }).handle_exception([this] {
         ...
      });
}

The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.

To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.

Fixes #7406

Closes #7430

(cherry picked from commit 289a08072a)
2020-11-29 13:30:49 +02:00
Pavel Emelyanov
8572ee9da2 query_pager: Fix continuation handling for noop visitor
Before updating the _last_[cp]key (for subsequent .fetch_page())
the pager checks is 'if the pager is not exhausted OR the result
has data'.

The check seems broken: if the pager is not exhausted, but the
result is empty the call for keys will unconditionally try to
reference the last element from empty vector. The not exhausted
condition for empty result can happen if the short_read is set,
which, in turn, unconditionally happens upon meeting partition
end when visiting the partition with result builder.

The correct check should be 'if the pager is not exhausted AND
the result has data': the _last_[pc]key-s should be taken for
continuation (not exhausted), but can be taken if the result is
not empty (has data).

fixes: #7263
tests: unit(dev), but tests don't trigger this corner case

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921124329.21209-1-xemul@scylladb.com>
(cherry picked from commit 550fc734d9)
2020-11-29 12:01:37 +02:00
Takuya ASADA
95dbac56e5 install.sh: set PATH for relocatable CLI tools in python thunk
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.

Fixes #7350

(cherry picked from commit 5867af4edd)
2020-11-29 11:54:42 +02:00
Bentsi Magidovich
eeadeff0dc scylla_util.py: fix exception handling in curl
Retry mechanism didn't work when URLError happend. For example:

  urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>

Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.

Fixes: #7569

Closes #7567

(cherry picked from commit 956b97b2a8)
2020-11-29 11:48:30 +02:00
Takuya ASADA
62f3caab18 dist/redhat: packaging dependencies.conf as normal file, not ghost
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.

Fixes #7703

Closes #7704

(cherry picked from commit ba4d54efa3)
2020-11-29 11:40:22 +02:00
Takuya ASADA
1a4869231a install.sh: apply sysctl.d files on non-packaging installation
We don't apply sysctl.d files on non-packaging installation, apply them
just like rpm/deb taking care of that.

Fixes #7702

Closes #7705

(cherry picked from commit 5f81f97773)
2020-11-29 11:35:37 +02:00
Avi Kivity
3568d0cbb6 dist: sysctl: configure more inotify instances
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.

This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.

Increase to 1200, which is enough for 6 instances * 200 shards.

Fixes #7700.

Closes #7701

(cherry picked from commit 390e07d591)
2020-11-29 11:04:45 +02:00
Raphael S. Carvalho
030c2e3270 compaction: Make sure a partition is filtered out only by producer
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9

To fix this problem, let's make sure that partition filtering will only
happen on the producer side.

Fixes #7590.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
(cherry picked from commit 13fa2bec4c)
2020-11-19 14:08:25 +02:00
Piotr Dulikowski
37a5e9ab15 hints: don't read hint files when it's not allowed to send
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:

- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
  target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
  this operation after sleeping 1 second.

This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.

This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.

First observed in #6964

Tests:
- unit(dev)
- hinted handoff dtests

Closes #7407

(cherry picked from commit 77a0f1a153)
2020-11-16 14:30:07 +02:00
Botond Dénes
a15b5d514d mutation_reader: queue_reader: don't set EOS flag on abort
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.

Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.

Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
(cherry picked from commit f5323b29d9)
2020-11-15 11:07:38 +02:00
Botond Dénes
064f8f8bcf types: validate(): linearize values lazily
Instead of eagerly linearizing all values as they are passed to
validate(), defer linearization to those validators that actually need
linearized values. Linearizing large values puts pressure on the memory
allocator with large contiguous allocation requests. This is something
we are trying to actively avoid, especially if it is not really neaded.
Turns out the types, whose validators really want linearized values are
a minority, as most validators just look at the size of the value, and
some like bytes don't need validation at all, while usually having large
values.

This is achieved by templating the validator struct on the view and
using the FragmentedRange concept to treat all passed in views
(`bytes_view` and `fragmented_temporary_buffer_view`) uniformly.
This patch makes no attempt at converting existing validators to work
with fragmented buffers, only trivial cases are converted. The major
offenders still left are ascii/utf8 and collections.

Fixes: #7318

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201007054524.909420-1-bdenes@scylladb.com>
(cherry picked from commit db56ae695c)
2020-11-11 10:55:54 +02:00
Amnon Heiman
04fe0a7395 scyllatop/livedata.py: Safe iteration over metrics
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.

Fixes #7488

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 52db99f25f)
2020-11-08 19:16:13 +02:00
Calle Wilund
ef26d90868 partition_version: Change range_tombstones() to return chunked_vector
Refs #7364

The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.

Closes #7433

(cherry picked from commit 4b65d67a1a)
2020-11-08 14:38:32 +02:00
Tomasz Grabiec
790f51c210 sstables: ka/la: Fix abort when next_partition() is called with certain reader state
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().

The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:

  scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.

This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.

The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.

Fixes #7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit fb9b5cae05)
2020-11-08 14:25:47 +02:00
Yaron Kaikov
4fb8ebccff release: prepare for 4.2.1 2020-11-08 12:41:06 +02:00
Avi Kivity
d1fa0adcbe Merge 'Move temporaries to value view' from Piotr S
"
Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream.

Tests: unit(dev)
Fixes https://github.com/scylladb/scylla/issues/7019
"

7055297649 ("cql3: remove query_options::linearize and _temporaries")
is reverted from this backport since linearize() is still used in
this branch.

* psarna-move_temporaries_to_value_view:
  cql3: remove query_options::linearize and _temporaries
  cql3: remove make_temporary helper function
  cql3: store temporaries in-place instead of in query_options
  cql3: add temporary_value to value view
  cql3: allow moving data out of raw_value
  cql3: split values.hh into a .cc file

(cherry picked from commit 2b308a973f)
2020-11-05 19:24:23 +02:00
Piotr Sarna
46b56d885e schema_tables: fix fixing old secondary index schemas
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.

Fixes #7515

Closes #7516

(cherry picked from commit b66c285f94)
2020-11-05 17:53:08 +02:00
Yaron Kaikov
94597e38e2 release: prepare for 4.2.0 2020-10-25 09:12:38 +02:00
Piotr Sarna
c74ba1bc36 Merge 'Backport PR #7469 to 4.2' from Eliran Sinvani
This is a backport of PR #7469 that did not apply cleanly to 4.2 with a trivial conflict, another commit that touched one of the files but in a completely different region.

Closes #7480

* github.com:scylladb/scylla:
  materialized views: add a base table reference if missing
  view info: support partial match between base and view for only reading from view.
  view info: guard against null dereference of the base info
2020-10-23 17:18:02 +02:00
Eliran Sinvani
06cfc63c59 materialized views: add a base table reference if missing
schema pointers can be obtained from two distinct entities,
one is the database, those schema are obtained from the table
objects and the other is from the schema registry.
When a schema or a new schema is attached to a table object that
represents a base table for views, all of the corresponding attached
view schemas are guarantied to have their base info in sync.
However if an older schema is inserted into the registry by the
migratrion manager i.e loaded from other node, it will be
missing this info.
This becomes a problem when this schema is published through the
schema registry as it can be obtained for an obsolete read command
for example and then eventually cause a segmentation fault by null
dereferencing the _base_info ptr.

Refs #7420
2020-10-23 18:09:45 +03:00
Eliran Sinvani
56d25930ec view info: support partial match between base and view for
only reading from view.

The current implementation of materialized views does
no keep the version to which a specific version of materialized
view schema corresponds to. This complicate things especially on
old views versions that the schema doesn't support anymore. However,
the views, being also an independent table should allow reading from
them as long as they exist even if the base table changed since then.
For the reading purpose, we don't need to know the exact composition
of view primary key columns that are not part of the base primary
key, we only need to know that there are any, and this is a much
looser constrain on the schema.
We can rely on a table invariants such as the fact that pk columns are
not going to disappear on newer version of the table.
This means that if we don't find a view column in the base table, it is
not a part of the base table primary key.
This information is enough for us to perform read on the view.
This commit adds support for being able to rely on such partial
information along with a validation that it is not going to be used for
writes. If it is, we simply abort since this means that our schema
integrity is compromised.
2020-10-23 18:08:56 +03:00
Eliran Sinvani
fa1cd048d7 view info: guard against null dereference of the base info
The change's purpose is to guard against segfault that is the
result of dereferencing the _base_info member when it is
uninitialized. We already know this can happen (#7420).
The only purpose of this change is to treat this condition as
an internal error, the reason is that it indicates a schema integrity
problem.
Besides this change, other measures should be taken to ensure that
the _base_table member is initialized before calling methods that
rely on it.
We call the internal_error as a last resort.
2020-10-23 18:08:56 +03:00
Nadav Har'El
94b754eee5 alternator: change name of Alternator's SSL options
When Alternator is enabled over HTTPS - by setting the
"alternator_https_port" option - it needs to know some SSL-related options,
most importantly where to pick up the certificate and key.

Before this patch, we used the "server_encryption_options" option for that.
However, this was a mistake: Although it sounds like these are the "server's
options", in fact prior to Alternator this option was only used when
communicating with other servers - i.e., connections between Scylla nodes.
For CQL connections with the client, we used a different option -
"client_encryption_options".

This patch introduces a third option "alternator_encryption_options", which
controls only Alternator's HTTPS server. Making it separate from the
existing CQL "client_encryption_options" allows both Alternator and CQL to
be active at the same time but with different certificates (if the user
so wishes).

For backward compatibility, we temporarily continue to allow
server_encryption_options to control the Alternator HTTPS server if
alternator_encryption_options is not specified. However, this generates
a warning in the log, urging the user to switch. This temporary workaround
should be removed in a future version.

This patch also:
1. fixes the test run code (which has an "--https" option to test over
   https) to use the new name of the option.
2. Adds documentation of the new option in alternator.md and protocols.md -
   previously the information on how to control the location of the
   certificate was missing from these documents.

Fixes #7204.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930123027.213587-1-nyh@scylladb.com>
(cherry picked from commit 509a41db04)
2020-10-18 18:05:04 +03:00
Takuya ASADA
e02defb3bf install.sh: set LC_ALL=en_US.UTF-8 on python3 thunk
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.

Fixes #7408

Closes #7414

(cherry picked from commit ff129ee030)
2020-10-18 15:02:32 +03:00
Botond Dénes
95e712e244 reader_permit: reader_resources: make true RAII class
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.

Fixes: #7256

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
(cherry picked from commit a0107ba1c6)
2020-10-18 15:00:47 +03:00
Avi Kivity
a9109f068c Update seastar submodule
* seastar 61b88d1da4...f760efe0a0 (1):
  > append_challenged_posix_file_impl: allow destructing file with no queued work

Fixes #7285.
2020-10-12 15:11:46 +03:00
Piotr Sarna
2e71779970 Merge 'Fix view_builder lockup and crash on shutdown' from Pavel
The lockup:

When view_builder starts all shards at some point get to a
barrier waiting for each other to pass. If any shard misses
this checkpoint, all others stuck forever. As this barrier
lives inside the _started future, which in turn is waited
on stop, the stop stucks as well.

Reasons to miss the barrier -- exception in the middle of the
fun^w start or explicit abort request while waiting for the
schema agreement.

Fix the "exception" case by unlocking the barrier promise with
exception and fix the "abort request" case by turning it into
an exception.

The bug can be reproduced by hands if making one shard never
see the schema agreement and continue looping until the abort
request.

The crash:

If the background start up fails, then the _started future is
resolved into exception. The view_builder::stop then turns this
future into a real exception caught-and-rethrown by main.cc.

This seems wrong that a failure in a background fiber aborts
the regular shutdown that may proceed otherwise.

tests: unit(dev), manual start-stop
branch: https://github.com/xemul/scylla/tree/br-view-builder-shutdown-fix-3
fixes: #7077

Patch #5 leaves the seastar::async() in the 1-st phase of the
start() although can also be tuned not to produce a thread.
However, there's one more (painless) issue with the _sem usage,
so this change appears too large for the part of the bug-fix
and will come as a followup.

* 'br-view-builder-shutdown-fix-3' of git://github.com/xemul/scylla:
  view_builder: Add comment about builder instances life-times
  view_builder: Do sleep abortable
  view_builder: Wakeup barrier on exception
  view_builder: Always resolve started future to success
  view_builder: Re-futurize start
  view_builder: Split calculate_shard_build_step into two
  view_builder: Populate the view_builder_init_state
  view_builder: Fix indentation after previous patch
  view_builder: Introduce view_builder_init_state

(cherry picked from commit ca9422ca73)
2020-10-07 15:05:12 +03:00
Gleb Natapov
5ed7de81ad lwt: do not return unavailable exception from the 'learn' stage
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.

Fixes #7258

Message-Id: <20201001130724.GA2283830@scylladb.com>
(cherry picked from commit 3e8dbb3c09)
2020-10-07 10:59:30 +02:00
Juliusz Stasiewicz
5e21c9bd8a tracing: Fix error on slow batches
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```

In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```

Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.

Fixes #5843

Closes #7293

(cherry picked from commit 0afa738a8f)
2020-10-04 18:04:22 +03:00
Avi Kivity
8e22fddc9e Merge 'Fix ignoring cells after null in appending hash' from Piotr Sarna
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible

(cherry picked from commit 0e03c979d2)
2020-10-01 23:23:00 +02:00
Avi Kivity
6cf2d998e3 Merge "Fix race in schema version recalculation leading to stale schema version in gossip" from Tomasz
"
Migration manager installs several cluster feature change listeners.
The listeners will call update_schema_version_and_announce() when cluster
features are enabled, which does this:

    return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
        return announce_schema_version(uuid);
    });

It first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.

The fix is to serialize schema digest calculation and publishing.

Refs #7200

This problem also brought my attention to initialization code, which could be
prone to the same problem.

The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.

First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.

Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.

Maybe we're not doing concurrent schema changes/recalculations now,
but it is easy to imagine that this could change for whatever reason
in the future.

The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.

Tests:

  - unit (dev)
  - manual (3 node ccm)
"

Fixes #7291

* tag 'fix-schema-digest-calculation-race-v1' of github.com:tgrabiec/scylla:
  db, schema: Hide update_schema_version_and_announce()
  db, storage_service: Do not call into gossiper from the database layer
  db: Make schema version observable
  utils: updateable_value_source: Introduce as_observable()
  schema: Fix race in schema version recalculation leading to stale schema version in gossip

(cherry picked from commit dcaf4ea4dd)
2020-10-01 17:44:37 +02:00
Hagit Segev
5fcc1f205c release: prepare for 4.2.rc5 2020-09-30 20:40:44 +03:00
Avi Kivity
08c35c1aad Revert "Revert "config: Do not enable repair based node operations by default""
This reverts commit 71d0d58f8c. Repair-based
node operations stil have a significant regression (See #7249).
2020-09-30 14:18:37 +03:00
Tomasz Grabiec
54a913d452 Merge "evictable_reader: validate buffer on reader recreation" from Botond
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader.
The intent is to prevent corrupt data from getting into the system as
well as to help catch these bugs as close to the source as possible.

Fixes: #7208

Tests: unit(dev), mutation_reader_test:debug (v4)

* botond/evictable-reader-validate-buffer/v5:
  mutation_reader_test: add unit test for evictable reader self-validation
  evictable_reader: validate buffer after recreation the underlying
  evictable_reader: update_next_position(): only use peek'd position on partition boundary
  mutation_reader_test: add unit test for evictable reader range tombstone trimming
  evictable_reader: trim range tombstones to the read clustering range
  position_in_partition_view: add position_in_partition_view before_key() overload
  flat_mutation_reader: add buffer() accessor

(cherry picked from commit 97c99ea9f3)
2020-09-30 13:13:09 +02:00
Piotr Dulikowski
8f9cd98c45 hinted handoff: fix race - decomission vs. endpoint mgr init
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.

The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.

In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.

End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.

To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.

Fixes #7257

Closes #7278

(cherry picked from commit 39771967bb)
2020-09-29 14:18:48 +03:00
Avi Kivity
2893f6e43b Update seastar submodule
* seastar 0c289412a9...61b88d1da4 (1):
  > lz4_fragmented_compressor: Fix buffer requirements

Fixes #6925.
2020-09-23 11:04:22 +03:00
Avi Kivity
18d6c27b05 Merge 'storage_proxy: add a separate smp_group for hints' from Eliran
Hints writes are handled by storage_proxy in the exact same way
regular writes are, which in turn means that the same smp service
group is used for both. The problem is that it can lead to a priority
inversion where writes of the lower priority  kind occupies a lot of
the semaphores units making the higher priority writes wait for an
empty slot.
This series adds a separate smp group for hints as well as a field
to pass the correct smp group to mutate_locally functions, and
then uses this field to properly classify the writes.

Fixes #7177

* eliransin-hint_priority_inversion:
  Storage proxy: use hints smp group in mutate locally
  Storage proxy: add a dedicated smp group for hints

(cherry picked from commit c075539fea)
2020-09-22 14:06:14 +03:00
Pavel Solodovnikov
97d7f6990c storage_proxy: un-hardcode force sync flag for mutate_locally(mutation) overload
Corresponding overload of `storage_proxy::mutate_locally`
was hardcoded to pass `db::commitlog::force_sync::no` to the
`database::apply`. Unhardcode it and substitute `force_sync::no`
to all existing call sites (as it were before).

`force_sync::yes` will be used later for paxos learn writes
when trying to apply mutations upgraded from an obsolete
schema version (similar to the current case when applying
locally a `frozen_mutation` stored in accepted proposal).

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 5ff5df1afd)

Prerequisite for #7177.
2020-09-22 14:05:39 +03:00
Nadav Har'El
9855e18c0d alternator: fix corruption of PutItem operation in case of contention
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).

To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.

The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.

The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().

The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.

Fixes #7218.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
(cherry picked from commit 5e8bdf6877)
2020-09-16 19:17:52 +03:00
Avi Kivity
5d5ddd3539 Merge "materialized views: Fix undefined behavior on base table schema changes" from Tomasz
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.

The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).

Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.

Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.

Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.

It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.

Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.

Fixes #7061.

Tests:

  - unit (dev)
  - [v1] manual (reproduced using scylla binary and cqlsh)
"

* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
  db: view: Refactor view_info::initialize_base_dependent_fields()
  tests: mv: Test dropping columns from base table
  db: view: Fix incorrect schema access during view building after base table schema changes
  schema: Call on_internal_error() when out of range id is passed to column_at()
  db: views: Fix undefined behavior on base table schema changes
  db: views: Introduce has_base_non_pk_columns_in_view_pk()

(cherry picked from commit 3daa49f098)
2020-09-16 16:42:02 +03:00
Benny Halevy
0a72893fef test: cql_query_test: test_cache_bypass: use table stats
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.

Use the table's row_cache stats instead.

Fixes #6773

Test: cql_query_test.test_cache_bypass(dev, debug)

Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
(cherry picked from commit 6deba1d0b4)
2020-09-16 16:05:53 +03:00
Dejan Mircevski
1e45557d2a cql3: Fix NULL reference in get_column_defs_for_filtering
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing.  Add a test exposing the NULL
dereference and fix the typo.

Tests: unit (dev)

Fixes #7198.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 9d02f10c71)
2020-09-16 15:46:58 +03:00
Avi Kivity
96e1e95c1d reconcilable_result_builder: don't aggrevate out-of-memory condition during recovery
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.

During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.

Fix by making _result the last member, so it is freed first.

Fixes #7240.

(cherry picked from commit 9421cfded4)
2020-09-16 15:40:40 +03:00
Raphael S. Carvalho
338196eab6 storage_service: Fix use-after-free when calculating effective ownership
Use-after-free happens because we take a ref to keyspace_name, which
is stack allocated, and ceases to exist after the next deferring
action.

Fixes #7209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200909210741.104397-1-raphaelsc@scylladb.com>
(cherry picked from commit 86b9ea6fb2)
2020-09-12 13:58:45 +03:00
Asias He
71cbec966b storage_service: Fix a TOKENS update race for replace operation
In commit 7d86a3b208 (storage_service:
Make replacing node take writes), application state of TOKENS of the
replacing node is added into gossip and propagated to the cluster after
the initial start of gossip service. This can cause a race below

1. The replacing node replaces the old dead node with the same ip address
2. The replacing node starts gossip without application state of the TOKENS
3. Other nodes in the cluster replace the application states of old dead node's
   version with the new replacing node's version
4. replacing node dies
5. replace operation is performed again, the TOKENS application state is
   not preset and replace operation fails.

To fix, we can always add TOKENS application state when the
gossip service starts.

Fixes: #7166
Backports: 4.1 and 4.2
(cherry picked from commit 3ba6e3d264)
2020-09-10 13:12:56 +03:00
Avi Kivity
067a065553 Merge "Fix repair stalls in get_sync_boundary and apply_rows_on_master_in_thread" from Asias
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.

Fixes #6940
Fixes #6975
Fixes #6976
"

* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
  repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
  repair: Use clear_gently in get_sync_boundary to avoid stall
  utils: Add clear_gently
  repair: Use merge_to_gently to merge two lists
  utils: Add merge_to_gently

(cherry picked from commit 4547949420)
2020-09-10 13:12:53 +03:00
Avi Kivity
e00bdc4f57 repair: apply_rows_on_follower(): remove copy of repair_rows list
We copy a list, which was reported to generate a 15ms stall.

This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.

Fixes #7115.

(cherry picked from commit 6ff12b7f79)
2020-09-10 11:53:05 +03:00
Juliusz Stasiewicz
ad40f9222c cdc: Retry generation fetching after read_failure_exception
While fetching CDC generations, various exceptions can occur. They
are divided into "fatal" and "nonfatal", where "fatal" ones prevent
retrying of the fetch operation.

This patch makes `read_failure_exception` "non-fatal", because such
error may appear during restart. In general this type of error can
mean a few different things (e.g. an error code in a response from
replica, but also a broken connection) so retrying seems reasonable.

Fixes #6804

(cherry picked from commit d1dec3fcd7)
2020-09-09 15:10:50 +03:00
Kamil Braun
5d90fa17d6 cdc: fix deadlock inside check_and_repair_cdc_streams
check_and_repair_cdc_streams, in case it decides to create a new CDC
generation, updates the STATUS application state so that other nodes
gossiped with pick up the generation change.

The node which runs check_and_repair_cdc_streams also learns about a
generation change: the STATUS update causes a notification change.
This happens during add_local_application_state call
which caused the STATUS update; it would lead to calling
handle_cdc_generation, which detects a generation change and calls
add_local_application_state with the new generation's timestamp.

Thus, we get a recursive add_local_application_state call. Unforunately,
the function takes a lock before doing on_change notifications, so we
get a deadlock.

This commit prevents the deadlock.
We update the local variable which stores the generation timestamp
before updating STATUS, so handle_cdc_generation won't consider
the observed generation to be new, hence it won't perform the recursive
add_local_application_state call.

(cherry picked from commit 42fb4fe37c)
2020-09-09 10:14:18 +03:00
Yaron Kaikov
bf0c493c28 release: prepare for 4.2.rc4 2020-09-07 14:56:32 +03:00
Raphael S. Carvalho
26cb0935f0 sstables/LCS: increase per-level overlapping tolerance in reshape
LCS can have its overlapping invariant broken after operations that can
proceed in parallel to regular compaction like cleanup. That's because
there could be two compactions in parallel placing data in overlapping
token ranges of a given level > 0.
After reshape, the whole table will be rewritten, on restart, if a
given level has more than (fan_out*2)=20 overlaps.
That may sound like enough, but that's not taking into account the
exponential growth in # of SSTables per level, so 20 overlaps may
sound like a lot for level 2 which can afford 100 sstables, but it's
only 2% of level 3, and 0.2% of level 4. So let's change the
overlapping tolerance from the constant of fan_out*2 to 10% of level
limit on # of SSTables, or fan_out, whichever is higher.

Refs #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200810154510.32794-1-raphaelsc@scylladb.com>
(cherry picked from commit 7d7f9e1c54)
2020-09-06 18:28:55 +03:00
Raphael S. Carvalho
4e97d562eb compaction: Prevent non-regular compaction from picking compacting SSTables
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.

When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.

Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.

Fixes #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
(cherry picked from commit 11df96718a)
2020-09-06 18:26:43 +03:00
Takuya ASADA
3f1b932c04 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6991

(cherry picked from commit 7cccb018b8)
2020-09-06 18:21:12 +03:00
Avi Kivity
b9498ab947 Update seastar submodule
* seastar 7816796dd1...0c289412a9 (1):
  > TLS: Use "known" (precalculated) DH parameters if available

Fixes #6191.
2020-09-06 17:38:05 +03:00
Avi Kivity
e1b3d6d0a2 Update seastar submodule
* seastar adaabdfbc...7816796dd (1):
  > core/reactor: complete_timers(): restore previous scheduling group

Fixes #7117.
2020-09-03 23:47:22 +03:00
Avi Kivity
67378cda03 Merge "Fix TWCS compaction aggressiveness due to data segregation" from Raphael
"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.

Fixes #6928.
"

* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
  compaction/twcs: improve further debug messages
  compaction/twcs: Improve debug log which shows all windows
  test: Check that TWCS properly performs size-tiered compaction on past windows
  compaction/twcs: Make task estimation take into account the size-tiered behavior
  compaction/stcs: Export static function that estimates pending tasks
  compaction/stcs: Make get_buckets() static
  compact/twcs: Perform size-tiered compaction on past time windows
  compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
  compaction/twcs: Make newest_bucket() non-static
  compaction/twcs: Move TWCS implementation into source file

(cherry picked from commit 6f986df458)
2020-09-02 12:53:45 +03:00
Nadav Har'El
6ab3965465 redis: fix another use-after-free crash in "exists" command
Never trust Occam's Razor - it turns out that the use-after-free bug in the
"exists" command was caused by two separate bugs. We fixed one in commit
9636a33993, but there is a second one fixed in
this patch.

The problem fixed here was that a "service_permit" object, which is designed to
be copied around from place to place (it contains a shared pointer, so is cheap
to copy), was saved by reference, and the reference was to a function argument
and was destroyed prematurely.

This time I tested *many times* that that test_strings.py passes on both dev and
debug builds.

Note that test/run/redis still fails in a debug build, but due to a different
problem.

Fixes #6469

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200825183313.120331-1-nyh@scylladb.com>
(cherry picked from commit 868194cd17)
2020-08-27 12:16:19 +03:00
Nadav Har'El
ca22461a9b redis: fix use-after-free crash in "exists" command
A missing "&" caused the key stored in a long-living command to be copied
and the copy quickly freed - and then used after freed.
This caused the test test_strings.py::test_exists_multiple_existent_key for
this feature to frequently crash.

Fixes #6469

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200823190141.88816-1-nyh@scylladb.com>
(cherry picked from commit 9636a33993)
2020-08-27 12:16:19 +03:00
Asias He
b3d83ad073 compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:

scylla: Reactor stalled for 16262 ms on shard 4.

| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
|  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
|  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) operator() at ./sstables/compaction_manager.cc:691
|  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get local ranges and sstables.

In addition, this patch removes the dependency to global storage_service object.

Fixes #6662

(cherry picked from commit 07e253542d)
2020-08-27 12:16:19 +03:00
Raphael S. Carvalho
7e6f47fbce sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
(cherry picked from commit cf352e7c14)
2020-08-27 12:16:16 +03:00
Asias He
9ca49cba6b abstract_replication_strategy: Add get_ranges_in_thread
Add a version that runs inside a seastar thread. The benefit is that
get_ranges can yield to avoid stalls.

Refs #6662

(cherry picked from commit 94995acedb)
2020-08-27 12:15:33 +03:00
Raphael S. Carvalho
6da8ba2d3f sstables: export needs_cleanup()
May be needed elsewhere, like in an unit test.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-1-raphaelsc@scylladb.com>
(cherry picked from commit a9eebdc778)
2020-08-27 12:15:29 +03:00
Asias He
366c1c2c59 gossip: Fix race between shutdown message handler and apply_state_locally
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
   in apply_state_locally and calls gossiper::handle_major_state_change
   and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
   node2 will skip to mark node1 as alive because the status of node1 is
   SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.

To fix, we serialize the two operations.

Fixes #7032

(cherry picked from commit e6ceec1685)
2020-08-27 11:15:48 +03:00
Nadav Har'El
05cdb173f3 Alternator: allow CreateTable with SSESpecification explicitly disabled
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.

This patch also adds a test for this case, which failed on Alternator
before this patch.

Fixes #7031.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
(cherry picked from commit 4c73d43153)
2020-08-26 20:15:19 +03:00
Nadav Har'El
8c929a96cf alternator: CreateTable with bad Tags shouldn't create a table
Currently, if a user tries to CreateTable with a forbidden set of tags,
e.g., the Tags list is too long or contains an invalid value for
system:write_isolation, then the CreateTable request fails but the table
is still created. Without the tag of course.

This patch fixes this bug, and adds two test cases for it that fail
before this patch, and succeed with it. One of the test cases is
scylla_only because it checks the Scylla-specific system:write_isolation
tag, but the second test case works on DynamoDB as well.

What this patch does is to split the update_tags() function into two
parts - the first part just parses the Tags, validates them, and builds
a map. Only the second part actually writes the tags to the schema.
CreateTable now does the first part early, before creating the table,
so failure in parsing or validating the Tags will not leave a created
table behind.

Fixes #6809.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713120611.767736-1-nyh@scylladb.com>
(cherry picked from commit 35f7048228)
2020-08-26 19:52:49 +03:00
Avi Kivity
a9aa10e8de Merge "Unregister RPC verbs on stop" from Pavel E
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.

Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.

In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.

The set brings the RPC handlers unregistration in sync with the
registration part.

tests: unit (dev)
       dtest (dev: simple_boot_shutdown, repair)
       start-stop by hands (dev)
fixes: #6904
"

* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
  main: Add missing calls to unregister RPC hanlers
  messaging: Add missing per-service unregistering methods
  messaging: Add missing handlers unregistration helpers
  streaming: Do not use db->invoke_on_all in vain
  storage_proxy: Detach rpc unregistration from stop
  main: Shorten call to storage_proxy::init_messaging_service

(cherry picked from commit 01b838e291)
2020-08-26 14:41:04 +03:00
Raphael S. Carvalho
989d8fe636 cql3/statements: verify that counter column cannot be added into non-counter table
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.

Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.

The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.

Fixes #7065.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
(cherry picked from commit 1c29f0a43d)
2020-08-25 18:44:42 +03:00
Takuya ASADA
48d79a1d9f dist/debian: disable debuginfo compression on .deb
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.

Fixes #6982

(cherry picked from commit 75c2362c95)
2020-08-23 19:01:00 +03:00
Botond Dénes
4c65413413 scylla-gdb.py: find_db(): don't return current shard's database for shard=0
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.

Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
(cherry picked from commit 4cfab59eb1)
2020-08-23 18:56:19 +03:00
Hagit Segev
e931d28673 release: prepare for 4.2.rc3 2020-08-19 14:39:08 +03:00
Botond Dénes
ec71688ff2 view_update_generator: fix race between registering and processing sstables
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.

This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.

Fixes: #6892

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
(cherry picked from commit 22a6493716)
2020-08-19 00:11:48 +03:00
Botond Dénes
9710a91100 table: get_sstables_by_partition_key(): don't make a copy of selected sstables
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.

Fixes: #7060

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
(cherry picked from commit 78f94ba36a)
2020-08-19 00:01:36 +03:00
Nadav Har'El
b052f3f5ce Update Seastar submodule module
> http: add "Expect: 100-continue" handling

Refs #6844.
2020-08-11 13:06:03 +03:00
Calle Wilund
d70cab0444 database: Do not assert on replay positions if truncate does not flush
Fixes #6995

In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.

This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.

Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.

(cherry picked from commit 9620755c7f)
2020-08-11 00:00:43 +03:00
Avi Kivity
0ce3799187 Update seastar submodule
* seastar 4641f4f2d3...2775a54dcb (1):
  > memory: fix small aligned free memory corruption

Fixes #6831
2020-08-09 18:35:44 +03:00
Avi Kivity
ee113eca52 Merge 'hinted handoff: fix commitlog memory leak' from Piotr D
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.

This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.

Fixes: #6409, Fixes #6776
"

* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
  hinted handoff: disable warnings about segments left on disk
  hinted handoff: release memory on commitlog termination

(cherry picked from commit 4c221855a1)
2020-08-09 17:25:20 +03:00
Tomasz Grabiec
be11514985 thrift: Fix crash on unsorted column names in SlicePredicate
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.

It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:

scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.

Fixes #6486.

Tests:

   - added a new dtest to thrift_tests.py which reproduces the problem

Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit bfd129cffe)
2020-08-08 19:47:57 +03:00
Rafael Ávila de Espíndola
ec874bdc31 alternator: Fix use after return
Avoid a copy of timeout so that we don't end up with a reference to a
stack allocated variable.

Fixes #6897

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200721184939.111665-1-espindola@scylladb.com>
(cherry picked from commit e83e91e352)
2020-08-03 22:24:12 +03:00
Nadav Har'El
43169ffa2c alternator: fix Expected's "NULL" operator with missing AttributeValueList
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.

This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.

Fixes #6816.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
(cherry picked from commit f549d147ea)
2020-08-03 20:39:01 +03:00
Yaron Kaikov
c5ed14bff6 release: prepare for 4.2.rc2 2020-08-03 16:50:38 +03:00
Takuya ASADA
8366eda943 scylla_util.py: always use relocatable CLI tools
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.

Related #6954

(cherry picked from commit a19a62e6f6)
2020-08-03 10:39:26 +03:00
Takuya ASADA
5d0b0dd4c4 create-relocatable-package.py: add lsblk for relocatable CLI tools
We need latest version of lsblk that supported partition type UUID.

Fixes #6954

(cherry picked from commit 6ba2a6c42e)
2020-08-03 10:39:07 +03:00
Juliusz Stasiewicz
6f259be5f1 aggregate_fcts: Use per-type comparators for dynamic types
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.

This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.

Fixes #6768

(cherry picked from commit 5b438e79be)
2020-08-03 10:26:02 +03:00
Calle Wilund
16e512e21c cql3::lists: Fix setter_by_uuid not handing null value
Fixes #6828

When using the scylla list index from UUID extension,
null values were not handled properly causing throws
from underlying layer.

(cherry picked from commit 3b74b9585f)
2020-08-03 10:19:13 +03:00
Avi Kivity
c61dc4e87d tools: toolchain: regenerate for gcc 10.2
Fixes #6813.

As a side effect, this also brings in xxhash 0.7.4.

(matches commit 66c2b4c8bf)
2020-07-31 08:48:12 +03:00
Takuya ASADA
af76a3ba79 scylla_post_install.sh: generate memory.conf for CentOS7
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.

Fixes #6783

(cherry picked from commit 3a25e7285b)
2020-07-30 16:41:10 +03:00
Tomasz Grabiec
8fb5ebb2c6 commitlog: Fix use-after-free on mutation object during replay
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.

This will typically manifest as a segfault.

Fixes #6953

Introduced in 79935df

Tests:
  - manual using scylla binary. Reproduced the problem then verified the fix makes it go away

Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3486eba1ce)
2020-07-30 16:36:42 +03:00
Takuya ASADA
bfb11defdd scylla_setup: skip boot partition
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.

It's better to exclude such partiotion from unsed disk list.

Fixes #6636

(cherry picked from commit d7de9518fe)
2020-07-29 09:48:10 +03:00
Asias He
2d1ddcbb6a repair: Fix race between create_writer and wait_for_writer_done
We saw scylla hit user after free in repair with the following procedure during tests:

- n1 and n2 in the cluster

- n2 ran decommission

- n2 sent data to n1 using repair

- n2 was killed forcely

- n1 tried to remove repair_meta for n1

- n1 hit use after free on repair_meta object

This was what happened on n1:

1) data was received -> do_apply_rows was called -> yield before create_writer() was called

2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
   with _writer_done[node_idx] not engaged

3) step 1 resumed, create_writer() was called and _repair_writer object was referenced

4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed

5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object

To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.

Fixes: #6853
Fixes: #6868
Backports: 4.0, 4.1, 4.2
(cherry picked from commit e6f640441a)
2020-07-29 09:48:10 +03:00
Raphael S. Carvalho
4c560b63f0 sstable: index_reader: Make sure streams are all properly closed on failure
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.

Fixes #6924.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
(cherry picked from commit 0d70efa58e)
2020-07-29 09:48:10 +03:00
Nadav Har'El
00155e32b1 merge: db/view: view_update_generator: make staging reader evictable
Merged patch set by Botond Dénes:

The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The

staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.

This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.

  test/boost: view_build_test: add test_view_update_generator_buffering
  test/boost: view_build_test: add test test_view_update_generator_deadlock
  reader_permit: reader_resources: add operator- and operator+
  reader_concurrency_semaphore: add initial_resources()
  test: cql_test_env: allow overriding database_config
  mutation_reader: expose new_reader_base_cost
  db/view: view_updating_consumer: allow passing custom update pusher
  db/view: view_update_generator: make staging reader evictable
  db/view: view_updating_consumer: move implementation from table.cc to view.cc
  database: add make_restricted_range_sstable_reader()

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

(cherry picked from commit f488eaebaf)

Fixes #6892.
2020-07-28 17:02:09 +03:00
Avi Kivity
b06dffcc19 Merge "messaging: make verb handler registering independent of current scheduling group" from Botond
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.

This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.

This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.

In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.

This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.

I manually checked that each individual defense added is already
preventing the crash from happening.

Fixes: #6613
Fixes: #6907
Fixes: #6908

Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"

* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: use cached semaphore
  messaging: make verb handler registering independent of current scheduling group
  multishard_mutation_query: validate the semaphore of the looked-up reader
  reader_concurrency_semaphore: make inactive read handles unique across semaphores
  reader_concurrency_semaphore: add name() accessor
  reader_concurrency_semaphore: allow passing name to no-limit constructor

(cherry picked from commit 3f84d41880)
2020-07-27 17:41:51 +03:00
Botond Dénes
508e58ef9e sstables: clamp estimated_partitions to [1, +inf) in writers
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.

Fixes #6913

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
(cherry picked from commit fe127a2155)
2020-07-27 15:00:00 +03:00
Piotr Sarna
776faa809f Merge 'view_update_generator: use partitioned sstable set'
from Botond.

Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.

Refs: #6603
Fixes: #6707

Tests: unit(dev)

* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
  db/view: view_update_generator: use partitioned sstable set
  sstables: make_partitioned_sstable_set(): return an sstable_set

(cherry picked from commit e4b74356bb)
2020-07-21 15:40:02 +03:00
Raphael S. Carvalho
7037f43a17 table: Fix Staging SSTables being incorrectly added or removed from the backlog tracker
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.

Fixes #6798.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
(cherry picked from commit b67066cae2)
2020-07-21 12:57:09 +03:00
Avi Kivity
bd713959ce Update seastar submodule
* seastar 8aad24a5f8...4641f4f2d3 (4):
  > httpd: Don't warn on ECONNABORTED
  > httpd: Avoid calling future::then twice on the same future
Fixes #6709.
  > httpd: Use handle_exception instead of then_wrapped
  > httpd: Use std::unique_ptr instead of a raw pointer
2020-07-19 11:49:02 +03:00
Rafael Ávila de Espíndola
b7c5a918cb mutation_reader_test: Wait for a future
Nothing was waiting for this future. Found while testing another
patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630183929.1704908-1-espindola@scylladb.com>
(cherry picked from commit 6fe7706fce)

Fixes #6858.
2020-07-16 14:44:31 +03:00
Asias He
fb2ae9e66b repair: Relax node selection in bootstrap when nodes are less than RF
Consider a cluster with two nodes:

 - n1 (dc1)
 - n2 (dc2)

A third node is bootstrapped:

 - n3 (dc2)

The n3 fails to bootstrap as follows:

 [shard 0] init - Startup failed: std::runtime_error
 (bootstrap_with_repair: keyspace=system_distributed,
 range=(9183073555191895134, 9196226903124807343], no existing node in
 local dc)

The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.

Fixes: #6744
Backports: 4.0 4.1, 4.2
(cherry picked from commit 38d964352d)
2020-07-16 12:02:38 +03:00
Asias He
7a7ed8c65d repair: Relax size check of get_row_diff and set_diff
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.

For example,

Node1 (Repair master):
row1  -> hash1
row2  -> hash2
row3  -> hash3
row3' -> hash3

Node2 (Repair follower):
row1  -> hash1
row2  -> hash2

We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:

   repair - Got error in row level repair: std::runtime_error
   (row_diff.size() != set_diff.size())

In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.

Refs: #6252
(cherry picked from commit a00ab8688f)
2020-07-15 14:48:49 +03:00
Nadav Har'El
7b9be752ec alternator test: configurable temporary directory
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.

But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.

The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.

Fixes #6750

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
(cherry picked from commit 8e3be5e7d6)
2020-07-14 12:34:26 +03:00
Konstantin Osipov
903e967a16 Export TMPDIR pointing at subdir of testlog/
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.

By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.

Fixes #6062, "test.py uses tmpfs"

(cherry picked from commit e628da863d)
2020-07-14 12:34:06 +03:00
Avi Kivity
b84946895c Update seastar submodule
* seastar 1e762652c4...8aad24a5f8 (2):
  > futures: Add a test for a broken promise in a parallel_for_each
  > future: Call set_to_broken_promise earlier

Fixes #6749 (probably).
2020-07-13 20:08:16 +03:00
Asias He
a27188886a repair: Switch to btree_set for repair_hash.
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.

I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.

- unordered_set:
hash_sets=295634, time=333029199 ns

- flat_hash_set:
hash_sets=295634, time=312484711 ns

- node_hash_set:
hash_sets=295634, time=346195835 ns

- btree_set:
hash_sets=295634, time=341379801 ns

The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.

To fix, switch to absl btree_set container.

Fixes #6190

(cherry picked from commit 67f6da6466)
2020-07-13 10:09:23 +03:00
Dmitry Kropachev
51d4efc321 dist/common/scripts/scylla-housekeeping: wrap urllib.request with try ... except
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
	Now you will get legit error message in the case.

	Fixes #6690

(cherry picked from commit de82b3efae)
2020-07-09 18:24:55 +03:00
Avi Kivity
0847eea8d6 Update seastar submodule
* seastar 11e86172ba...1e762652c4 (1):
  > sharded: Do not hang on never set freed promise

Fixes #6606.
2020-07-09 15:52:26 +03:00
Avi Kivity
35ad57cb9c Point seastar submodule at scylla-seastar.git
This allows us to backport seastar patches to 4.2.
2020-07-09 15:50:25 +03:00
Hagit Segev
42b0b9ad08 release: prepare for 4.2.rc1 2020-07-08 23:01:10 +03:00
Dejan Mircevski
68b95bf2ac cql/restrictions: Handle WHERE a>0 AND a<0
WHERE clauses with start point above the end point were handled
incorrectly.  When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0.  This is clearly incorrect, as the user's intent was to filter out
all possible values of a.

Fix it by explicitly short-circuiting to false when start > end.  Add
a test case.

Fixes #5799.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 921dbd0978)
2020-07-08 13:20:10 +03:00
Botond Dénes
fea83f6ae0 db/view: view_update_generator: re-balance wait/signal on the register semaphore
The view update generator has a semaphore to limit concurrency. This
semaphore is waited on in `register_staging_sstable()` and later the
unit is returned after the sstable is processed in the loop inside
`start()`.
This was broken by 4e64002, which changed the loop inside `start()` to
process sstables in per table batches, however didn't change the
`signal()` call to return the amount of units according to the number of
sstables processed. This can cause the semaphore units to dry up, as the
loop can process multiple sstables per table but return just a single
unit. This can also block callers of `register_staging_sstable()`
indefinitely as some waiters will never be released as under the right
circumstances the units on the semaphore can permanently go below 0.
In addition to this, 4e64002 introduced another bug: table entries from
the `_sstables_with_tables` are never removed, so they are processed
every turn. If the sstable list is empty, there won't be any update
generated but due to the unconditional `signal()` described above, this
can cause the units on the semaphore to grow to infinity, allowing
future staging sstables producers to register a huge amount of sstables,
causing memory problems due to the amount of sstable readers that have
to be opened (#6603, #6707).
Both outcomes are equally bad. This patch fixes both issues and modifies
the `test_view_update_generator` unit test to reproduce them and hence
to verify that this doesn't happen in the future.

Fixes: #6774
Refs: #6707
Refs: #6603

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200706135108.116134-1-bdenes@scylladb.com>
(cherry picked from commit 5ebe2c28d1)
2020-07-08 11:13:24 +03:00
Takuya ASADA
76618a7e06 scylla_setup: don't add same disk device twice
We shouldn't accept adding same disk twice for RAID prompt.

Fixes #6711

(cherry picked from commit 835e76fdfc)
2020-07-07 13:07:59 +03:00
Takuya ASADA
189a08ac72 scylla_setup: follow hugepages package name change on Ubuntu 20.04LTS
hugepages package now renamed to libhugetlbfs-bin, we need to follow
the change.

Fixes #6673

(cherry picked from commit 03ce19d53a)
2020-07-05 14:41:33 +03:00
Takuya ASADA
a3e9915a83 dist/debian: apply generated package version for .orig.tar.gz file
We currently does not able to apply version number fixup for .orig.tar.gz file,
even we applied correct fixup on debian/changelog, becuase it just reading
SCYLLA-VERSION-FILE.
We should parse debian/{changelog,control} instead.

Fixes #6736

(cherry picked from commit a107f086bc)
2020-07-05 14:08:37 +03:00
Asias He
e4bc14ec1a boot_strapper: Ignore node to be replaced explicitly as stream source
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.

Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.

To fix, filter out the node being replaced as a source node explicitly.

Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
(cherry picked from commit e338028b7e22b0a80be7f80c337c52f958bfe1d7)
2020-07-01 14:36:43 +03:00
Takuya ASADA
972acb6d56 scylla_swap_setup: handle <1GB environment
Show better error message and exit with non-zero status when memory size <1GB.

Fixes #6659

(cherry picked from commit a9de438b1f)
2020-07-01 12:40:25 +03:00
Yaron Kaikov
7fbfedf025 dist/docker/redhat/Dockerfile: update 4.2 params
Set SCYLLA_REPO and VERSION values for scylla-4.2
2020-06-30 13:09:06 +03:00
Avi Kivity
5f175f8103 Merge "Fix handling of decimals with negative scales" from Rafael
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.

Fixes #6720
"

* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
  big_decimal: Add a test for a corner case
  big_decimal: Correctly handle negative scales
  big_decimal: Add a as_rational member function
  big_decimal: Move constructors out of line

(cherry picked from commit 3e2eeec83a)
2020-06-29 12:05:17 +03:00
Benny Halevy
674ad6656a comapction: restore % in compaction completion message
The % sign fell off in c4841fa735

Fixes #6727.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200625151352.736561-1-bhalevy@scylladb.com>
(cherry picked from commit a843945115)
2020-06-28 12:10:21 +03:00
Hagit Segev
58498b4b6c release: prepare for 4.2.rc0 2020-06-26 13:06:07 +03:00
Raphael S. Carvalho
b17d20b5f4 reshape: LCS: avoid unnecessary work on level 0
No need to sort level 0 as we only check if levels > 0 are disjoint.

Also taking the opportunity to avoid copies when sorting.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200624151921.20160-1-raphaelsc@scylladb.com>
2020-06-24 18:27:22 +03:00
Rafael Ávila de Espíndola
67c22c8697 commitlog::read_log_file: Don't discard a future
This makes the code a bit easier to read as there are no discarded
futures and no references to having to keep a subscription alive,
which we don't with current seastar.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>

Message-Id: <20200527013120.179763-1-espindola@scylladb.com>
2020-06-24 17:22:29 +03:00
Botond Dénes
5ff6ac52b2 scylla-gdb.py: collection element func: accept references and pointers to collections
Add support to references (both lvalue and rvalue) and pointers to
collections as well, in addition to plain values.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200624101305.428925-1-bdenes@scylladb.com>
2020-06-24 13:31:18 +03:00
Avi Kivity
a9c7a1a86c Merge "repair: row_level: prevent deadlocks when repairing homogenous nodes" from Botond
"
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.

To prevent this, we make the local repair reader evictable. For this we
reuse the already existing evictable reader mechanism of the multishard
combining reader. This patchset refactors this evictable reader
mechanism into a standalone flat mutation reader, then exposes it to the
outside world.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.

The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.

Fixes: #6272

Tests: unit(dev, release, debug)
"

* 'repair-row-level-evictable-local-reader/v3' of https://github.com/denesb/scylla:
  repair: row_level: destroy reader on EOS or error
  repair: row_level: use evictable_reader for local reads
  mutation_reader: expose evictable_reader
  mutation_reader: evictable_reader: add auto_pause flag
  mutation_reader: make evictable_reader a flat_mutation_reader
  mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
  mutation_reader: move inactive_shard_reader code up
  mutation_reader: fix indentation
  mutation_reader: shard_reader: extract remote_reader as evictable_reader
  mutation_reader: reader_lifecycle_policy: make semaphore() available early
2020-06-24 12:55:34 +03:00
Piotr Sarna
c2939c67b2 test: add a case for local altering of distributed tables
Local altering, which does not propagate the change to other nodes,
should not be allowed for a non-local table.

Refs #6700
Message-Id: <34a2b191c0e827f296e6d720dc31bf8bda0fd160.1592990796.git.sarna@scylladb.com>
2020-06-24 12:51:41 +03:00
Piotr Sarna
835734c99d cql3: disallow altering non-local tables with local queries
The database has a mechanism of performing internal CQL queries,
mainly to edit its own local tables. Unfortunately, it's easy
to use the interface incorrectly - e.g. issuing an `ALTER TABLE`
statement on a non-local table will result in not propagating
the schema change to other nodes, which in turn leads to
inconsistencies. In order to avoid such mistakes (one of them
was a root cause of #6513), when an attempt to alter a distributed
table via a local interface is performed, it results in an error.

Tests: unit(dev)
Fixes #6700
Message-Id: <61be3defb57be79f486e6067ceff4f4c965e34cb.1592990796.git.sarna@scylladb.com>
2020-06-24 12:51:40 +03:00
Raphael S. Carvalho
864eb20002 reshape: Fix reshaping procedure for LCS
The function that determines if a level L, where L > 0, is disjoint,
is returning false if level is disjoint.
That's because it incorrectly accounts an overlapping SSTable in
the level as a disjoint SSTable. So we need to inverse the logic.

The side effect is that boot will always try to reshape levels
greater than 0 because reshape procedure incorrectly thinks that
levels are overlapping when they're actually disjoint.

Fixes #6695.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200623180221.229695-1-raphaelsc@scylladb.com>
2020-06-24 12:50:19 +03:00
Avi Kivity
1398628e8a Update seastar submodule
cql3/functions/error_injection_fcts.cc adjusted for
smp::invoke_on_all() now requiring nothrow move
constructible functions.

* seastar 7664f991b9...11e86172ba (4):
  > Merge "smp: make submit_to noexcept" from Benny
  > memory: Fix clang build
  > Fix a debug build with SEASTAR_TASK_BACKTRACE
  > manual_clock: Add missing includes
2020-06-24 12:49:50 +03:00
Botond Dénes
be452b1f91 service: storage_proxy: log exception returned from replica with more context
Currently the message only mentions the endpoint and the error message
returned from the replica. Add the keyspace and table to this message to
provide more context. This should help investigations of such errors
greatly, as in the case of tests where there is usually a single table,
we can already guess what exactly is timing out based on this.
We should add even more context, like the kind of the query (single
partition or range scan) but this information is not readily available
in the surrounding scope so this patch defers it.

Refs: #6548
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200624054647.413256-1-bdenes@scylladb.com>
2020-06-24 11:30:37 +03:00
Piotr Sarna
df91e9a4c7 alternator: clean up string view conversions
Manual translation from JSON to string_view is replaced
with rjson::to_string_view helper function. In one place,
a redundant string_view intermediary is removed
in favor of creating the string straight from JSON.
Message-Id: <2aa9d9fedd73f14b7640870d14db4f2f0bd7bd8a.1592936139.git.sarna@scylladb.com>
2020-06-23 21:45:27 +03:00
Piotr Sarna
4558401aee alternator: drop using global migration manager
As part of "war on globals", the unneeded usage of global migration manager
instance is dropped.
Message-Id: <c9b2fab57e62185daa2441458f9a3a5e7e0a3908.1592936139.git.sarna@scylladb.com>
2020-06-23 21:43:57 +03:00
Piotr Sarna
f4e8cfe03b alternator: fix propagating tags
Updating tags was erroneously done locally, which means that
the schema change was not propagated to other nodes.
The new code announces new schema globally.

Fixes #6513
Branches: 4.0,4.1
Tests: unit(dev)
       dtest(alternator_tests.AlternatorTest.test_update_condition_expression_and_write_isolation)
Message-Id: <3a816c4ecc33c03af4f36e51b11f195c231e7ce1.1592935039.git.sarna@scylladb.com>
2020-06-23 21:27:55 +03:00
Botond Dénes
fbbc86e18c repair: row_level: destroy reader on EOS or error
To avoid having to make it an optional with all the additional checks,
we just replace it with an empty reader instead, this also also achieves
the desired effect of releasing the read permit and all the associated
resources early.
2020-06-23 21:08:21 +03:00
Botond Dénes
080f00b99a repair: row_level: use evictable_reader for local reads
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.

To prevent this, we make the local repair reader evictable. For this we
reuse the newly exposed evictable reader.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.

The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.
2020-06-23 21:08:21 +03:00
Botond Dénes
542d9c3711 mutation_reader: expose evictable_reader
Expose functions for the outside world to create evictable readers. We
expose two functions, which create an evictable reader with
`auto_pause::yes` and `auto_pause::no` respectively. The function
creating the latter also returns a handle in addition to the reader,
which can be used to pause the reader.
2020-06-23 21:08:21 +03:00
Botond Dénes
1cc31deff9 mutation_reader: evictable_reader: add auto_pause flag
Currently the evictable reader unconditionally pauses the underlying
reader after each use (`fill_buffer()` or `fast_forward_to()` call).
This is fine for current users (the multishard reader), but the future
user we are doing all this refactoring for -- repair -- will want to
control when the underlying reader is paused "manually". Both these
behaviours can easily be supported in a single implementation, so we
add an `auto_pause` flag to allow the creator of the evictable reader
to control this.
2020-06-23 21:08:21 +03:00
Botond Dénes
af9e1c23e1 mutation_reader: make evictable_reader a flat_mutation_reader
The `evictable_reader` class is almost a proper flat mutation reader
already, it roughly offers the same interface. This patch makes this
formal: changing the class to inherit from `flat_mutation_reader::impl`,
and implement all virtual methods. This also entails a departure from
using the lifecycle policy to pause/resume and create readers, instead
using more general building blocks like the reader concurrency semaphore
and a mutation source.
2020-06-23 21:08:21 +03:00
Rafael Ávila de Espíndola
64c8164e6c everywhere: Update to seastar api v4 (when_all_succeed returning a tuple)
We now just need to replace a few calls to then with then_unpack.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200618172100.111147-1-espindola@scylladb.com>
2020-06-23 19:40:18 +03:00
Raphael S. Carvalho
47f63d021a sstables/sstable_directory: improve log message in reshape()
We were blind about the table which needed reshape and its
compaction strategy, so let's improve log message.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200622192502.187532-4-raphaelsc@scylladb.com>
2020-06-23 19:40:18 +03:00
Raphael S. Carvalho
39f96a5572 distributed_loader: Don't mutate levels to zero when populating column family
Unlike refresh on upload dir, column family population shouldn't mutate
level of SSTables to level 0. Otherwise, LCS will have to regenerate all
levels by rewriting the data multiple times, hurting a lot the write
amplification and consequently the node performance. That's also affecting
the time for a node to boot because reshape may be triggered as a result
of this.

Refs #6695.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200622192502.187532-2-raphaelsc@scylladb.com>
2020-06-23 19:40:18 +03:00
Benny Halevy
2d7c39de88 storage_service: set_tables_autocompaction: fix not-initialized-yet logic
Typo introduced in bb07678346,
set_tables_autocompaction should reject too-early requests
if !_initialized rather than if _initialized.

Fixes a bunch of compaction dtests. For example:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/530/testReport/compaction_test/TestCompaction_with_DateTieredCompactionStrategy/disable_autocompaction_twice_test/
```
True is not false : Expected to have autocompaction disabled but got it is enabled
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Tests:
- unit(dev),
- compaction_test:TestCompaction_with_DateTieredCompactionStrategy.disable_autocompaction_twice_test(dev)
Message-Id: <20200623151418.439534-1-bhalevy@scylladb.com>
2020-06-23 19:40:18 +03:00
Avi Kivity
c72365d862 thrift: switch csharp backend to netstd
The thrift compiler (since 0.13 at least) complains that
the csharp target is deprecated and recommend replacing it
with netstd. Since we don't use either, humor it.

I suspect that this warning caused some spurious rebuilds,
but have not proven it.
2020-06-23 19:40:18 +03:00
Piotr Sarna
6d224ae131 cql3: add missing filtering stats bump
In a single case of indexed queries, the filtered_rows_read_total
metrics was not updated, which could result in inconsistencies between
filtered_rows_read_total and filtered_rows_matched_total later.

Message-Id: <9a5a741da4c6cf030329610ba8b8e340be85c8e6.1592902295.git.sarna@scylladb.com>
2020-06-23 19:40:18 +03:00
Piotr Sarna
7480015721 cql3, service: decouple cql_stats from query pagers
Pager belongs to a different layer than CQL and thus should not be
coupled with CQL stats - if any different frontends want to use paging,
they shouldn't be forced to instantiate CQL stats at all.

Same goes with CQL restrictions, but that will require much bigger
refactoring, so is left for later.

Message-Id: <5585eb470949e3457334ffd6dba80742abf3a631.1592902295.git.sarna@scylladb.com>
2020-06-23 19:40:18 +03:00
Nadav Har'El
428e8b5c96 docker readme: remove outdated warning
In the section explaining how to build a docker image for a self-built
Scylla executable, we have a warning that even if you already built
Scylla, build_reloc.sh will re-run configure.py and rebuild the executable
with slightly different options.

The re-run of configure.py and ninja still happens (see issue #6547) but
we no longer pass *different* options to configure.py, so the rebuild
usually doesn't do anything and finishes in seconds, and the paragraph
warning about the rebuild is no longer relevant.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200621093049.975044-1-nyh@scylladb.com>
2020-06-23 19:40:18 +03:00
Avi Kivity
8d67537178 Update seastar submodule
* seastar a6c8105443...7664f991b9 (13):
  > gate: add try_enter and try_with_gate
  > Merge "Manage reference counts in the file API" from Rafael
  > cmake: Refactor a bit of duplicated code
  > stream: Delete _sub
  > future: Add a rethrow_exception to future_state_base
  > future: Use a new seastar::nested_exception in finally
  > cmake: only apply C++ compile options to C++ language
  > testing: Enable fail-on-abandoned-failed-futures by default
  > future: Correct a few hypercorrect uses of std::forward
  > futures_test: Test using future::then with functions
  > Merge "io-queue: A set of cleanups collected so far" from Pavel E
  > tmp_file: Replace futurize_apply with futurize_invoke
  > future: Replace promise::set_coroutine with forward_state_and_schedule

Contains update to tests from Rafael:

tests: Update for fail-on-abandoned-failed-futures's new default

This depends on the corresponding change in seastar.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-23 19:39:54 +03:00
Botond Dénes
4485864ada mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
Rename `inactive_shard_read` to `inactive_evictable_reader` to reflect
that the fact that the evictable reader is going to be of general use,
not specific to the multishard reader.
2020-06-23 10:01:38 +03:00
Botond Dénes
b6ed054c08 mutation_reader: move inactive_shard_reader code up
It will be used by the `evictable_reader` code too in the next patches.
2020-06-23 10:01:38 +03:00
Botond Dénes
e3ea1c9080 mutation_reader: fix indentation
Deferred from the previous patch.
2020-06-23 10:01:38 +03:00
Botond Dénes
f9d1916499 mutation_reader: shard_reader: extract remote_reader as evictable_reader
We want to make the evictable reader mechanism used in the multishard
reader pipeline available for general (re)use, as a standalone
flat mutation reader implementation. The first step is extracting
`shard_reader::remote_reader` the class implementing this logic into a
top-level class, also renamed to `evictable_reader`.
2020-06-23 10:01:38 +03:00
Botond Dénes
63309f925c mutation_reader: reader_lifecycle_policy: make semaphore() available early
Currently all reader lifecycle policy implementations assume that
`semaphore()` will only be called after at least one call to
`make_reader()`. This assumption will soon not hold, so make sure
`semaphore()` can be called at any time, including before any calls are
made to `make_reader()`.
2020-06-23 10:01:38 +03:00
Raphael S. Carvalho
9033fa82d7 compaction: Reduce boilerplate to create new compaction type
Run id and compaction type can now be figured out from the base class.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200622160645.177707-1-raphaelsc@scylladb.com>
2020-06-22 20:27:57 +02:00
Takuya ASADA
2d25697873 scylla_swap_setup: fix systemd-escape path
On Ubuntu 18.04 and ealier & Deiban 10 and ealier, /usr merge is not done, so
/usr/bin/systemd-escape and /bin/systemd-escape is different place, and we call
/usr/bin but Debian variants tries to install the command in /bin.
Drop full path, just call command name and resolve by default PATH.

Fixes: #6650
2020-06-22 17:42:06 +03:00
Raphael S. Carvalho
2a171ee470 reshape: LCS: fix the target level of reshaping job
LCS reshape job may pick a wrong level because we iterate through
levels from index 1 and stop the iteration as soon as the current
level is NOT disjoint, so it happens that we never reach the upper
levels, meaning the level of the first NOT disjoint level is used,
and not the actual maximum filled level. That's fixed by doing
the iteration in the inverse order.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200618154112.8335-1-raphaelsc@scylladb.com>
2020-06-22 16:40:57 +03:00
Avi Kivity
de38091827 priority_manager: merge streaming_read and streaming_write classes into one class
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.

Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
2020-06-22 15:09:04 +03:00
Takuya ASADA
9e51acec1f reloc: simplified .deb build process
We don't really need to have two build_deb.sh, merge it to reloc.
2020-06-22 14:03:13 +03:00
Takuya ASADA
67c0439c7d reloc: simplified .rpm build process
We don't really need to have two build_rpm.sh, merge it to reloc.
2020-06-22 14:03:13 +03:00
Takuya ASADA
90e28c5fcf scylla_raid_setup: daemon-reload after mounts.conf installed
systemd requires daemon-reload after adding drop-in file, so we need to
do that after writing mounts.conf.

Fixes #6674
2020-06-22 14:03:13 +03:00
Takuya ASADA
d6165bc1c3 dist/debian/python3: drop dependency on pystache
Same as 287d6e5, we need to drop pystache from package build script
since Fedora 32 dropped it.
2020-06-22 14:03:13 +03:00
Juliusz Stasiewicz
a35b71c247 cdc: Handling of timeout/unavailable exceptions in streams fetching
Retrying the operation of fetching generations not always makes
sense. In this patch only the lightest exceptions (timeout and
unavailable) trigger retrying, while the heavy, unrecoverable ones
abort the operation and get logged on ERROR level.

Fixes #6557
2020-06-22 14:03:13 +03:00
Raphael S. Carvalho
52180f91d4 compaction: Fix the 2x disk space requirement in SSTable upgrade
SSTable upgrade is requiring 2x the space of input SSTables because
we aren't releasing references of the SSTables that were already
upgraded. So if we're upgrading 1TB, it means that up to 2TB may be
required for the upgrade operation to succeed.

That can be fixed by moving all input SSTables when rewrite_sstables()
asks for the set of SSTables to be compacted, so allowing their space
to be released as soon as there is no longer any ref to them.

Spotted while auditting code.

Fixes #6682.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619205701.92891-1-raphaelsc@scylladb.com>
2020-06-22 14:03:13 +03:00
Rafael Ávila de Espíndola
a67f5b2de1 sstable_3_x_test: Call await_background_jobs on every test
Now every tests starts by deferring a call to
await_background_jobs. That can be verified with:

$ git grep -B 1 await_background test/boost/sstable_3_x_test.cc  | grep THREAD | wc -l
90
$ git grep -A 1 SEASTAR_THREAD_TEST_CASE test/boost/sstable_3_x_test.cc  | grep await_background | wc -l
90

Thanks to Raphael Carvalho for noticing it.

Refs #6624

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619220048.1091630-1-espindola@scylladb.com>
2020-06-22 14:03:13 +03:00
Raphael S. Carvalho
a82afa68aa test/lib/cql_test_env: reenable auto compaction
after e40aa042a7, auto compaction is explicitly disabled on all
tables being populated and only enabled later on in the boot
process. we forgot to update cql_test_env to also reenable
auto compaction, so unit tests based on cql_test_env were not
compacting at all.
database_test, for example, was running out of file descriptors
because the number kept growing unboundly due to lack of compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200618225621.15937-1-raphaelsc@scylladb.com>
2020-06-22 14:03:13 +03:00
Benny Halevy
a3918bdc96 distributed_loader: reenable verify_owner_and_mode when loading new sstables
The call to `verify_owner_and_mode` from `flush_upload_dir`
fell between the cracks in b34c0c2ff6
(distributed_loader: rework uploading of SSTables).

It causes https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/528/testReport/nodetool_additional_test/TestNodetool/nodetool_refresh_with_wrong_upload_modes_test/
to fail like this:
```
/Directory cannot be accessed .* write/ not found in 'Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/7351db7cab7bbf907172940d0bbf8b90afde90ba/scylla-tools-java/bin/nodetool -h 127.0.87.1 -p 7187 refresh -- keyspace1 standard1' failed; exit status: 1; stdout: nodetool: Scylla API server HTTP POST to URL '/storage_service/sstables/keyspace1' failed: Failed to load new sstables: std::filesystem::__cxx11::filesystem_error (error system:13, filesystem error: remove failed: Permission denied [/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-rqzo7km7/test/node1/data/keyspace1/standard1-8a57a660b29611eabf0c000000000000/upload/mc-3-big-TOC.txt])
```

Reenable it in this patch makes the dtest pass again.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200621140439.85843-1-bhalevy@scylladb.com>
2020-06-22 14:03:13 +03:00
Benny Halevy
aa4b4311e2 configure: do not define SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION in debug mode
Seastar uses the default allocator in debug mode so it can't inject
allocation failures in this mode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Test: mutation_test(debug)
Message-Id: <20200621131819.72108-1-bhalevy@scylladb.com>
2020-06-22 14:03:13 +03:00
Nadav Har'El
e4eca5211a docker: add option to start Alternator with HTTPS
We already have a docker image option to enable alternator on an unencrypted
port, "--alternator-port", but we forgot to also allow the similar option
for enabling alternator on an encrypted (HTTPS) port: "--alternator-https-port"
so this patch adds the missing option, and documents how to use it.

Note that using this option is not enough. When this option is used,
Alternator also requires two files, /etc/scylla/scylla.crt and
/etc/scylla/scylla.key, to be inserted into the image. These files should
contain the SSL certificate, and key, respectively. If these files are
missing, you will get an error in the log about the missing file.

Fixes #6583.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200621125219.12274-1-nyh@scylladb.com>
2020-06-22 14:03:13 +03:00
Avi Kivity
7351db7cab Merge "Reshape upload files and reshard+reshape at boot" from Glauber
"

This patchset adds a reshape operation to each compaction strategy;
that is a strategy-specific way of detecting if SSTables are in-strategy
or off-strategy, and in case they are offstrategy moving them to in-strategy.

Often times the number of SSTables in a particular slice of the sstable set
matters for that decision (number of SSTables in the same time window for TWCS,
number of SSTables per tier for STCS, number of L0 SSTables for LCS). We want
to be more lenient for operations that keep the node offline, like reshape at
boot, but more forgiving for operations like upload, which run in maintenance
mode. To accomodate for that the threshold for considering a slice of the SSTable
set offstrategy is passed as a parameter

Once this patchset is applied, the upload directory will reshape the SSTables
before moving them to the main directory (if needed). One side effect of it
is that it is no longer necessary to take locks for the refresh operation nor
disable writes in the table.

With the infrastructure that we have built in the upload directory, we can
apply the same set of steps to populate_column_family. Using the sstable_directory
to scan the files we can reshard and reshape (usually if we resharded a reshape
will be necessary) with the node still offline. This has the benefit of never
adding shared SSTables to the table.

Applying this patchset will unlock a host of cleanups:
- we can get rid of all testing for shared sstables, sstable_need_rewrite, etc.
- we can remove the resharding backlog tracker.

and many others. Most cleanups are deferred for a later patchset, though.
"

* 'reshard-reshape-v4' of github.com:glommer/scylla:
  distributed_loader: reshard before the node is made online
  distributed_loader: rework uploading of SSTables
  sstable_directory: add helper to reshape existing unshared sstables
  compaction_strategy: add method to reshape SSTables
  compaction: add a new compaction type, Reshape
  compaction: add a size and throught pretty printer.
  compaction: add default implementation for some pure functions
  tests: fix fragile database tests
  distributed_loader.cc: add a helper function to extract the highest SSTable version found
  distributed_loader.cc : extract highest_generation_seen code
  compaction_manager: rename run_resharding_job
  distributed_loader: assume populate_column_families is run in shard 0
  api: do not allow user to meddle with auto compaction too early
  upload: use custom error handler for upload directory
  sstable_directory: fix debug message
2020-06-18 17:04:53 +03:00
Glauber Costa
e40aa042a7 distributed_loader: reshard before the node is made online
This patch moves the resharding process to use the new
directory_with_sstables_handler infrastructure. There is no longer
a clear reshard step, and that just becomes a natural part of
populate_column_family.

In main.cc, a couple of changes are necessary to make that happen.
The first one obviously is to stop calling reshard. We also need to
make sure that:
 - The compaction manager is started much earlier, so we can register
   resharding jobs with it.
 - auto compactions are disabled in the populate method, so resharding
   doesn't have to fight for bandwidth with auto compactions.

Now that we are resharding through the sstable_directory, the old
resharding code can be deleted. There is also no need to deal with
the resharding backlog either, because the SSTables are not yet
added to the sstable set at this point.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Glauber Costa
b34c0c2ff6 distributed_loader: rework uploading of SSTables
Uploading of SSTables is problematic: for historical reasons it takes a
lock that may have to wait for ongoing compactions to finish, then it
disables writes in the table, and then it goes loading SSTables as if it
knew nothing about them.

With the sstable_directory infrastructure we can do much better:

* we can reshard and reshape the SSTables in place, keeping the number
  of SSTables in check. Because this is an background process we can be
  fairly aggressive and set the reshape mode to strict.

* we can then move the SSTables directly into the main directory.
  Because we know they are few in number we can call the more elegant
  add_sstable_and_invalidate_cache instead of the open coding currently
  done by load_new_sstables

* we know they are not shared (if they were, we resharded them),
  simplifying the load process even further.

The major changes after this patch is applied is that all compactions
(resharding and reshape) needed to make the SSTables in-strategy are
done in the streaming class, which reduces the impact of this operation
on the node. When the SSTables are loaded, subsequent reads will not
suffer as we will not be adding shared SSTables in potential high
numbers, nor will we reshard in the compaction class.

There is also no more need for a lock in the upload process so in the
fast path where users are uploading a set of SSTables from a backup this
should essentially be instantaneous. The lock, as well as the code to
disable and enable table writes is removed.

A future improvement is to bypass the staging directory too, in which
case the reshaping compaction would already generate the view updates.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Glauber Costa
4d6aacb265 sstable_directory: add helper to reshape existing unshared sstables
Before moving SSTables to the main directory, we may need to reshape them
into in-strategy. This patch provides helper code that reshapes the SSTables
that are known to be unshared local in the sstable directory, and updates the
sstable directory with the result.

Rehaping can be made more or less aggressive by passing a reshape mode
(relaxed or strict), which will influence the amount of SSTables reshape
can tolerate to consider a particular slice of the SSTable set
offstrategy.

Because the compaction expects an std::vector everywhere, we changed
our chunked vector for the unshared sstables to a std::vector so we
can more easily pass it around without conversions.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Glauber Costa
3c254dd49d compaction_strategy: add method to reshape SSTables
Some SSTable sets are considered to be off-strategy: they are in a shape
that is at best not optimal and at worst adversarial to the current
compaction strategy.

This patch introduces the compaction strategy-specific method
get_reshaping_job(). Given an SSTable set, it returns one compaction
that can be done to bring the table closer to being in-strategy. The
caller can then call this repeatedly until the table is fully
in-strategy.

As an example of how this is supposed to work, consider TWCS: some
SSTables will belong to a single window -> in which case they are
already in-strategy and don't need to be compacted, and others span
multiple windows in which case they are considered off-strategy and
have to be compacted.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Glauber Costa
0467bd0a94 compaction: add a new compaction type, Reshape
From the point of view of selecting SSTables and its expected output,
Reshaping really is just a normal compaction. However, there are some
key differences that we would like to uphold:

- Reshaping is done separately from the main SSTable set. It can be
  done with the node offline, or it can be done in a separate priority
  class. Either way, we don't want those SSTables to count towards
  backlog. For reads, because the SSTables are not yet registered in
  the backlog tracker (if offline or coming from upload), if we were
  to deduct compaction charges from it we would go negative. For writes,
  we don't want to deal with backlog management here because we will add
  the SSTable at once when reshaping is finished.

- We don't need to do early replacements.

- We would like to clearly mark the Reshaping compactions as such in the
  logs

For the reasons above, it is nicer to add a new Reshape compaction type,
a subclass of compaction, that upholds such properties.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Glauber Costa
c4841fa735 compaction: add a size and throught pretty printer.
This is so we don't always use MB. Sometimes it is best
to report GB, TB, and their equivalent throughput metrics.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Takuya ASADA
76af112c7d scylla_swap_setup: don't create sparse file
fallocate creates sparse file, XFS rejects such file for swapfile:
https://bugzilla.redhat.com/show_bug.cgi?id=1129205

Use dd instead.

Fixes #6650
2020-06-18 16:26:17 +03:00
Avi Kivity
7129662edb Update seastar submodule
* seastar b515d63735...a6c8105443 (15):
  > Merge "Move thread_wake_task out of line" from Rafael
  > future: Fix result_of_apply instantiation
  > future: Move the function in then/then_wrapped only once
  > io-queue: Dont leak desc
  > fair-queue: Keep request queues self-consistent
  > app: Do not coredump on missing options
  > future: promise: mark set_value as noexcept
  > future: future_state: mark set as noexcept
  > fair_queue_perf: Remove unused captures
  > file_io_test: Add missing override
  > Merge "tmp_dir: handle remove failure in do_with_thread" from Benny
  > api-level: Add missing api_v4 namespace
  > future: Fix CanApplyTuple
  > http: use logger instead of stderr for erro reporting
  > sstring: Generalize make_sstring a bit
2020-06-18 16:16:05 +03:00
Glauber Costa
ef85a2cec5 compaction: add default implementation for some pure functions
There are some functions that are today pure that have an obvious
implementation (for example on_new_partition, do nothing). We'll add
default implementations to the compaction class, which reduces the
boilerplate needed to add a new compaction type.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:28 -04:00
Glauber Costa
96abf80c5e tests: fix fragile database tests
This test wants to make sure that an SSTable with generation number 4,
which is incomplete, gets deleted.

While that works today, the way the test verifies that is fragile
because new SSTables can and will be created, especially in the local
directory that sees a lot of activity on startup.

It works if generations don't go that far, but with SMP, even a single
SSTable in the right shard can end up having generation 4. In practice
this isn't an issue today because the code calls
cf.update_sstables_known_generation() as soon as it sees a file, before
deciding whether or not the file has to be deleted. However this
behavior is not guaranteed and is changing.

The best way to fix this would be to check if the file is the same,
including its inode. But given that this is just a unit test (which
is almost always if not always single node), I am just moving to use
the peers table instead. Again, we could have created a user table,
but it's just not worth the hassle.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:28 -04:00
Glauber Costa
072d0d3073 distributed_loader.cc: add a helper function to extract the highest SSTable version found
Using a map reduce in a shared sstable directory, finds the highest
version seen across all shards.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:28 -04:00
Glauber Costa
baa82b3a26 distributed_loader.cc : extract highest_generation_seen code
We'll use it in one more other location so extract it to common code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:27 -04:00
Glauber Costa
9902af894a compaction_manager: rename run_resharding_job
It will be used to run any custom job where the caller provides a
function. One such example is indeed resharding, but reshaping SSTables
can also fall here.

The semaphore is also renamed, and we'll allow only one custom job at a
time (across all possible types).

We also remove the assumption of the scheduling group. The caller has to
have already placed the code in the correct CPU scheduling group.  The
I/O priority class comes from the descriptor.

To make sure that we don't regress, we wrap the entire reshard-at-boot
code in the compaction class. Currently the setup would be done in the
main group, and the actual resharding in the compaction group. Note that
this is temporary, as this code is about to change.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:27 -04:00
Glauber Costa
45f3bc679e distributed_loader: assume populate_column_families is run in shard 0
This is already the case, since main.cc calls it from shard 0 and
relies on it to spread the information to the other shards. We will
turn this branch - which is always taken - into an assert for the
sake of future-proofing and soon add even more code that relies on this
being executed in shard 0.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:27 -04:00
Glauber Costa
bb07678346 api: do not allow user to meddle with auto compaction too early
We are about to use the auto compaction property during the
populate/reshard process. If the user toggles it, the database can be
left in a bad state.

There should be no reason why a user would want to set that up this
early. So we'll disallow it.

To do that property, it is better if the check of whether or not
the storage service is ready to accomodate this request is local
to the storage service itself. We then move the logic of set_tables_autocompaction
from api to the storage service. The API layer now merely translates
the table names and pass it along.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:25 -04:00
Pekka Enberg
3d128f5b51 scripts: Rename sync-submodules.sh to refresh-submodules.sh
Rename the script as per Nadav's suggestion and update documentation
within the script.
Message-Id: <20200618123446.32496-1-penberg@scylladb.com>
2020-06-18 15:39:23 +03:00
Rafael Ávila de Espíndola
f6e407ecd2 everywhere: Prepare for seastar api v4 (when_all_succeed return value)
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.

This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
2020-06-18 15:13:56 +03:00
Amnon Heiman
bc854342e7 approx_exponential_histogram: Makes the implementation clearer
This patch aim to make the implementation and usage of the
approx_exponential_histogram clearer.

The approx_exponential_histogram Uses a combination of Min, Max,
Precision and number of buckets where the user needs to pick 3.

Most of the changes in the patch are about documenting the class and
method, but following the review there are two functionality changes:

1. The user would pick: Min, Max and Precision and the number of buckets
   will be calculated from these values.
2. The template restrictions are now state in a requires so voiolation
   will be stop at compile time.
2020-06-18 14:18:21 +03:00
Tomasz Grabiec
17ee1a2eed utils: cached_file: Fix compilation error
Fix field initialization order problem.

In file included from ./sstables/mc/bsearch_clustered_cursor.hh:28,
                 from sstables/index_reader.hh:32,
                 from sstables/sstables.cc:49:
./utils/cached_file.hh: In constructor 'cached_file::stream::stream(cached_file&, const seastar::io_priority_class&, tracing::trace_state_ptr, cached_file::page_idx_type, cached_file::offset_type)':
./utils/cached_file.hh:119:34: error: 'cached_file::stream::_trace_state' will be initialized after [-Werror=reorder]
  119 |         tracing::trace_state_ptr _trace_state;
      |                                  ^~~~~~~~~~~~
./utils/cached_file.hh:117:23: error:   'cached_file::page_idx_type cached_file::stream::_page_idx' [-Werror=reorder]
  117 |         page_idx_type _page_idx;
      |                       ^~~~~~~~~
./utils/cached_file.hh:127:9: error:   when initialized here [-Werror=reorder]
  127 |         stream(cached_file& cf, const io_priority_class& pc, tracing::trace_state_ptr trace_state,
      |         ^~~~~~
Message-Id: <1592478082-22505-1-git-send-email-tgrabiec@scylladb.com>
2020-06-18 14:08:29 +03:00
Raphael S. Carvalho
03db448a92 sstables/backlog_tracker: Fix incorrect calculation of Compaction backlog
When debugging this for first time c412a7a, I thought the problem,
which causes backlog to be negative, was a bug in the implementation of the
formula, but it turns out that the bug is actually in the formula itself.
Not limiting the scope of this bug to STCS because its tracker is inherited
by the trackers of other strategies, meaning they're also affected by this.

The backlog for a SSTable is known to be
	Bi = Ei * log(T / Si)
Where T = total Size minus compacted bytes for a table,
      Ci = Compacted Bytes for a SSTable,
      Si = Size of a SStable
      Ei = Ci - Si

The problem was that we were assuming T > Si, but it can happen that T
is lower than Si if the table in question is decreasing in size.

If we rewrite SSTable backlog as
	Bi = Ei * log (T) - Ei * log(Si)
It becomes even clearer why T cannot be lower than Si whatsoever,
or the backlog calculation can go wrong because first term becomes
lower than the second.

Fixing the formula consists of changing it to
	Bi = Ei * log (T / Ei)
	Bi = Ei * log (T) - Ei * log (Si - Ci)

After this change, the backlog still behave in a very similar way
as before, which can be confirmed via this graph:
https://user-images.githubusercontent.com/1409139/79627762-71afdf80-8111-11ea-9ebc-0831c4e3d9c6.png

Fixes #6021.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200616174712.16505-1-raphaelsc@scylladb.com>
2020-06-18 13:56:47 +03:00
Avi Kivity
5d99d667ec Merge "Build system improvements for packaging" from Pekka
"
This patch series attempts to decouple package build and release
infrastructure, which is internal to Scylla (the company). The goal of
this series is to make it easy for humans and machines to build the full
Scylla distribution package artifacts, and make it easy to quickly
verify them.

The improvements to build system are done in the following steps.

1. Make scylla.git a super-module, which has git submodules for
   scylla-jmx and scylla-tools.  A clone of scylla.git is now all that
   is needed to access all source code of all the different components
   that make up a Scylla distribution, which is a preparational step to
   adding "dist" ninja build target. A scripts/sync-submodules.sh helper
   script is included, which allows easy updating of the submodules to the
   latest head of the respective git repositories.

2. Make builds reproducible by moving the remaining relocatable package
   specific build options from reloc/build_reloc.sh to the build system.
   After this step, you can build the exact same binaries from the git
   repository by using the dbuild version from scylla.git.

3. Add a "dist" target to ninja build, which builds all .rpm and .deb
   packages with one command. To build a release, run:

   $ ./tools/toolchain/dbuild ./configure.py --mode release

   $ ./tools/toolchain/dbuild ninja-build dist

   and you will now have .rpm and .deb packages to all the components of
   a Scylla distribution.

4. Add a "dist-check" target to ninja build for verification of .rpm and
   .deb packages in one command. To verify all the built packages, run:

   $  ninja-build dist-check

   Please note that you must run this step on the host, because the
   target uses Docker under the hood to verify packages by installing
   them on different Linux distributions.

   Currently only CentOS 7 verification is supported.

All these improvements are done so that backward compatibility is
retained. That is, any existing release infrastructure or other build
scripts are completely unaffacted.

Future improvements to consider:

- Package repository generation: add a "ninja repo" command to generate
  a .rpm and .deb repositories, which can be uploaded to a web site.
  This makes it possible to build a downloadable Scylla distribution
  from scylla.git. The target requires some configuration, which user
  has to provide. For example, download URL locations and package
  signing keys.

- Amazon Machine Image (AMI) support: add a "ninja ami" command to
  simplify the steps needed to generate a Scylla distribution AMI.

- Docker image support: add a "ninja docker" command to simplify the
  steps needed to generate a Scylla distribution Docker image.

- Simplify and unify package build: simplify and unify the various shell
  scripts needed to build packages in different git repositories. This
  step will break backward compatiblity and can be done only after
  relevant build scripts and release infrastructure is updated.
"

* 'penberg/packaging/v5' of github.com:penberg/scylla:
  docs: Update packaging documentation
  build: Add "dist-check" target
  scripts/testing: Add "dist-check" for package verification
  build: Add "dist" target
  reloc: Add '--builddir' option to build_deb.sh
  build: Add "-ffile-prefix-map" to cxxflags
  docs: Document sync-submodules.sh script in maintainer.md
  sync-submodules.sh: Add script for syncing submodules
  Add scylla-tools submodule
  Add scylla-jmx submodule
2020-06-18 12:59:52 +03:00
Dejan Mircevski
aec1acd1d5 range_test: Add cases for singular intersection
Intersection was previously not tested for singular ranges.  This
ensures it will always work for singular ranges, too.

Tests: unit(dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-18 12:38:31 +03:00
Yaron Kaikov
e9d5852b0c dbuild: Add an option to run dbuild using podman
Following https://github.com/scylladb/scylla/pull/5333, we want to be
able to run dbuild using podman or docker by setting enviorment variable
named: DBUILD_TOOL

DBUILD_TOOL will use docker by default unless we explicitly set the tool podmand

Fixes: https://github.com/scylladb/scylla/pull/6644
2020-06-18 12:13:39 +03:00
Avi Kivity
9322c07c71 Merge "Use binary search in sstable promoted index" from Tomasz
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.

Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.

Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.

Fixes #4007.

TESTING RESULTS

 * Point reads, large promoted index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:

      time: 1.9ms vs 22.9ms
      CPU utilization: 8.9% vs 92.3%
      I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB

      It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.

    - Slicing at the front (offset=0) is a mixed bag.

      time is similar: 1.8ms
      CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
      disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)

      bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.

      scan is issuing:

         2 * 128 KB (index page)
         2 * 32 KB (data file)

      bsearch is issuing:

         1 * 64 KB (index page)
         15 * 4 KB (promoted index)
         1 * 64 KB (data file)

      The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
      32 KB is the minimum I/O currently.

      Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
      so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.

  Command:

        perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001836          172         1        545          9        563        175        4.0      4        320       2       2        0        1        1        0        0        0  57.7%      0
    0       32        0.001858          502        32      17220        126      17776      11526        3.2      3        324       2       1        0        1        1        0        0        0  56.4%      0
    0       256       0.002833          339       256      90374        427      91757      85931        7.0      7        776       3       1        0        1        1        0        0        0  41.1%      0
    0       4096      0.017211           58      4096     237984       2011     241802     233870       66.1     66       8376      59       2        0        1        1        0        0        0  21.4%      0
    5000000 1         0.022952           42         1         44          1         45         41       29.2     29       3520      22       2        0        1        1        0        0        0  92.3%      0
    5000000 32        0.023052           43        32       1388         14       1414       1331       31.1     32       3588      26       2        0        1        1        0        0        0  91.7%      0
    5000000 256       0.024795           41       256      10325        129      10721       9993       43.1     39       4544      29       2        0        1        1        0        0        0  86.4%      0
    5000000 4096      0.038856           27      4096     105414        398     106918     103162       95.2     95      12160      78       5        0        1        1        0        0        0  61.4%      0

 After (v2):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001831          248         1        546         21        581        252       17.6     17        188       2       0        0        1        1        0        0        0   8.5%      0
    0       32        0.001910          535        32      16751        626      17770      13896       17.9     19        160       3       0        0        1        1        0        0        0   8.8%      0
    0       256       0.003545          266       256      72207       2333      89076      62852       26.9     24        764       7       0        0        1        1        0        0        0   9.7%      0
    0       4096      0.016800           56      4096     243812        524     245430     239736       83.6     83       8700      64       0        0        1        1        0        0        0  16.6%      0
    5000000 1         0.001968          351         1        508         19        538        380       21.3     21        172       2       0        0        1        1        0        0        0   8.9%      0
    5000000 32        0.002273          431        32      14077        436      15503      11551       22.7     22        268       3       0        0        1        1        0        0        0   8.9%      0
    5000000 256       0.003889          257       256      65824       2197      81833      57813       34.0     37        652      18       0        0        1        1        0        0        0  11.2%      0
    5000000 4096      0.017115           54      4096     239324        834     241310     231993       88.3     88       8844      65       0        0        1        1        0        0        0  16.8%      0

 After (v1):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001886          259         1        530          4        545        261       18.0     18        376       2       2        0        1        1        0        0        0   9.1%      0
    0       32        0.001954          513        32      16381         93      16844      15618       19.0     19        408       3       2        0        1        1        0        0        0   9.3%      0
    0       256       0.003266          318       256      78393       1820      81567      61663       30.8     26       1272       7       2        0        1        1        0        0        0  10.4%      0
    0       4096      0.017991           57      4096     227666        855     231915     225781       83.1     83       8888      55       5        0        1        1        0        0        0  15.5%      0
    5000000 1         0.002353          232         1        425          2        432        232       23.0     23        396       2       2        0        1        1        0        0        0   8.7%      0
    5000000 32        0.002573          384        32      12437         47      12571        429       25.0     25        460       4       2        0        1        1        0        0        0   8.5%      0
    5000000 256       0.003994          259       256      64101       2904      67924      51427       37.0     35       1484      11       2        0        1        1        0        0        0  10.6%      0
    5000000 4096      0.018567           56      4096     220609        448     227395     219029       89.8     89       9036      59       5        0        1        1        0        0        0  15.1%      0

 * Point reads, small promoted index (two blocks):

  Config: rows: 400, value size: 200
  Partition size: 84 KiB
  Index size: 65 B

  Notes:
     - No significant difference in time
     - the same disk utilization
     - similar CPU utilization

  Command:

      perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000279          470         1       3587         31       3829        478        3.0      3         68       2       1        0        1        1        0        0        0  21.1%      0
    0       32        0.000276         3498        32     116038        811     122756     104033        3.0      3         68       2       1        0        1        1        0        0        0  24.0%      0
    0       256       0.000412         2554       256     621044       1778     732150     559221        2.0      2         72       2       0        0        1        1        0        0        0  32.6%      0
    0       4096      0.000510         1901       400     783883       4078     819058     665616        2.0      2         88       2       0        0        1        1        0        0        0  36.4%      0
    200     1         0.000339         2712         1       2951          8       3001       2569        2.0      2         72       2       0        0        1        1        0        0        0  17.8%      0
    200     32        0.000352         2586        32      91019        266      92427      83411        2.0      2         72       2       0        0        1        1        0        0        0  20.8%      0
    200     256       0.000458         2073       200     436503       1618     453945     385501        2.0      2         88       2       0        0        1        1        0        0        0  29.4%      0
    200     4096      0.000458         2097       200     436475       1676     458349     381558        2.0      2         88       2       0        0        1        1        0        0        0  29.0%      0

  After (v1):

    Testing slicing of large partition using clustering keys:
    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000278          492         1       3598         30       3831        500        3.0      3         68       2       1        0        1        1        0        0        0  19.4%      0
    0       32        0.000275         3433        32     116153        753     122915      92559        3.0      3         68       2       1        0        1        1        0        0        0  22.5%      0
    0       256       0.000458         2576       256     559437       2978     728075     504375        2.1      2         88       2       0        0        1        1        0        0        0  29.0%      0
    0       4096      0.000506         1888       400     790064       3306     822360     623109        2.0      2         88       2       0        0        1        1        0        0        0  36.6%      0
    200     1         0.000382         2493         1       2619         10       2675       2268        2.0      2         88       2       0        0        1        1        0        0        0  16.3%      0
    200     32        0.000398         2393        32      80422        333      84759      22281        2.0      2         88       2       0        0        1        1        0        0        0  19.0%      0
    200     256       0.000459         2096       200     435943       1608     453989     380749        2.0      2         88       2       0        0        1        1        0        0        0  30.5%      0
    200     4096      0.000458         2097       200     436410       1651     455779     382485        2.0      2         88       2       0        0        1        1        0        0        0  29.2%      0

 * Scan with skips, large index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)

    - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)

      Binary search reads more by 828 KB and by 1719 IOs.
      It does more I/O to read the the promoted index offset map.

    - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
      we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
      last "mem" column.

  Command:

    perf_fast_forward --datasets=large-part-ds1 \
       --run-tests=large-partition-skips -c1 --test-case-duration=1

  Before:

      read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
      1       1        36.103451            4   5000000     138491         38     138601     138453   153932.0 153932   19703260  153561       1        0        1        1        0        0        0  31.5% 502690

  After (v2):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        37.000145            4   5000000     135135          6     135146     135128   155651.0 155651   19704088  138968       0        0        1        1        0        0        0  34.2%      0

  After (v1):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        36.965520            4   5000000     135261         30     135311     135231   155628.0 155628   19704216  139133       1        0        1        1        0        0        0  33.9% 248738

Also in:

  git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2

Tests:

  - unit (all modes)
  - manual using perf_fast_forward
"

* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
  sstables: Add promoted index cache metrics
  position_in_partition: Introduce external_memory_usage()
  cached_file, sstables: Add tracing to index binary search and page cache
  sstables: Dynamically adjust I/O size for index reads
  sstables, tests: Allow disabling binary search in promoted index from perf tests
  sstables: mc: Use binary search over the promoted index
  utils: Introduce cached_file
  sstables: clustered_index: Relax scope of validity of entry_info
  sstables: index_entry: Introduce owning promoted_index_block_position
  compound_compat: Allow constructing composite from a view
  sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
  sstables: mc: Extract parser for promoted index block
  sstables: mc: Extract parser for clustering out of the promoted index block parser
  sstables: consumer: Extract primitive_consumer
  sstables: Abstract the clustering index cursor behavior
  sstables: index_reader: Rearrange to reduce branching and optionals
2020-06-18 12:09:39 +03:00
Pekka Enberg
4d48f22827 docs: Update packaging documentation 2020-06-18 10:20:08 +03:00
Pekka Enberg
9e279ec2a9 build: Add "dist-check" target
This adds a "dist-check" target to ninja build. The target needs to be
run on the host because package verification is done with Docker.
2020-06-18 10:20:08 +03:00
Pekka Enberg
584c7130a1 scripts/testing: Add "dist-check" for package verification
This adds a "dist-check.sh" script in tools/testing, which performs
distribution package verification by installing packages under Docker.
2020-06-18 10:16:46 +03:00
Pekka Enberg
8e1a561fba build: Add "dist" target 2020-06-18 10:16:46 +03:00
Pekka Enberg
7b7c91a34b reloc: Add '--builddir' option to build_deb.sh
The build system will call this script. It needs control over where the
packages are built to allow building packages for the different build
modes.
2020-06-18 09:54:37 +03:00
Pekka Enberg
013f87f388 build: Add "-ffile-prefix-map" to cxxflags
This patch adds "-ffile-prefix-map" to cxxflags for all build modes.
This has two benefits:

1, Relocatable packages no longer have any special build flags, which
   makes deeper integration with the build system possible (e.g.
   targets for packages).

2 Builds are now reproducible, which makes debugging easier in case you
  only have a backtrace, but no artifacts. Rafael explains:

  "BTW, I think I found another argument for why we should always build
   with -ffile-prefix-map=.

   There was user after free test failure on next promotion. I am unable
   to reproduce it locally, so it would be super nice to be able to
   decode the backtrace.

   I was able to do it, but I had to create a
   /jenkins/workspace/scylla-master/next/ directory and build from there
   to get the same results as the bot."

Acked-by: Botond Dénes <bdenes@scylladb.com>
Acked-by: Nadav Har'El <nyh@scylladb.com>
Acked-by: Rafael Avila de Espindola <espindola@scylladb.com>
2020-06-18 09:54:37 +03:00
Pekka Enberg
71da4e6e79 docs: Document sync-submodules.sh script in maintainer.md 2020-06-18 09:54:37 +03:00
Pekka Enberg
e3376472e8 sync-submodules.sh: Add script for syncing submodules 2020-06-18 09:54:37 +03:00
Pekka Enberg
d759d7567b Add scylla-tools submodule 2020-06-18 09:54:37 +03:00
Pekka Enberg
9edf858d30 Add scylla-jmx submodule 2020-06-18 09:54:37 +03:00
Benny Halevy
5926cfc298 CMakeLists.txt: Update to C++20
Following 427398641a

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200618052956.570260-1-bhalevy@scylladb.com>
2020-06-18 09:51:23 +03:00
Pekka Enberg
02b733c22b Revert "dbuild: Add an option to run with 'docker' or 'podman'"
This reverts commit ac7237f991. The logic
is wrong and always picks "podman" if it's installed on the system even
if user asks for "docker" with the DBUILD_TOOL environment variable.
This wreaks havoc on machines that have both docker and podman packages
installed, but podman is not configured correctly.
2020-06-18 09:22:33 +03:00
Juliusz Stasiewicz
8628ede009 cdc: Fix segfault when stream ID key is too short
When a token is calculated for stream_id, we check that the key is
exactly 16 bytes long. If it's not - `minimum_token` is returned
and client receives empty result.

This used to be the expected behavior for empty keys; now it's
extended to keys of any incorrect length.

Fixes #6570
2020-06-17 18:19:37 +03:00
Nadav Har'El
095ddf0d41 alternator test: use ConsistentRead=True where missing
All tests that write some data and then read it back need to use
ConsistentRead=True, otherwise the test may sporadically fail on a multi-
node cluster.

In the previous patch we fixed the full_query()/full_scan() convenience
functions. In this patch, I audited the calls to the boto3 read methods -
get_item(), batch_get_item(), query(), scan(), and although most of them
did use ConsistentRead=True as needed, I found some missing and this patch
fixes them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616080334.825893-1-nyh@scylladb.com>
2020-06-17 14:57:45 +02:00
Nadav Har'El
c298088375 alternator test: use ConsistentRead=True for full_query/scan
Many of the Alternator tests use the convenience functions full_query()/
full_scan() to read from the table. Almost all these tests need to be able
to read their own writes, i.e., want ConsistentRead=True, but none of them
explicitly specified this parameter. Such tests may sporadically fail when
running on cluster with multiple nodes.

So this patch follows a TODO in the code, and makes ConsistentRead=True
the default for the full_*() functions. The caller can still override it
with ConsistentRead=False - and this is necessary in the GSI tests, because
ConsistentRead=True is not allowed in GSIs.

Note that while ConsistentRead=True is now the default for the full_*()
convenience functions, but it is still not the default for the lower level
boto3 functions scan(), query() and get_item() - so usages of those should
be evaluated as well and missing ConsistentRead=True, if any, should be
added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616073821.824784-1-nyh@scylladb.com>
2020-06-17 14:57:45 +02:00
Raphael S. Carvalho
2f680b3458 size_tiered_backlog_tracker: Rename total_bytes
Reader can assume total_bytes and _total_bytes have the same meaning,
but they don't, so let's give the former a more descriptive name.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200616175055.16771-1-raphaelsc@scylladb.com>
2020-06-17 13:39:30 +03:00
Avi Kivity
d2ab6a24a1 Update seastar submodule
* seastar 8f0858cfd7...b515d63735 (2):
  > do_with: replace seastar::apply() calls with std::apply()
  > Merge "Resolve various http fixmes" from Piotr
2020-06-17 12:59:16 +03:00
Glauber Costa
1c70a7c54e upload: use custom error handler for upload directory
SSTables created for the upload directory should be using its custom error
handler.

There is one user of the custom error handler in tree, which is the current
upload directory function. As we will use a free function instead of a lambda
in our implementation we also use the opportunity to fix it for consistency.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-16 19:42:19 -04:00
Glauber Costa
c188aef088 sstable_directory: fix debug message
I just noticed while working on the reshape patches that there
is an extra format bracket in two of the debug message. As they
are debug I've seen them less often than the others and that slipped.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-16 19:42:19 -04:00
Nadav Har'El
ba59034402 merge: Use std::string_view in a few more apis
Merged patch series by Rafael Ávila de Espíndola:

The main advantage is that callers now don't have to construct
sstrings. It is also a 0.09% win in text size (from 41804308 to
41766484 bytes) and the tps reported by

perf_simple_query --duration 16 --smp 1  -m4G >> log 2>err

in 500 randomized runs goes up by 0.16% (from 162259 to 162517).

Rafael Ávila de Espíndola (3):
  service: Pass a std::string_view to client_state::set_keyspace
  cql3: Use a flat_hash_map in untyped_result_set_row
  cql3: Pass std::string_view to various untyped_result_set member
    functions

 cql3/untyped_result_set.hh | 30 ++++++++++++++++--------------
 service/client_state.hh    |  2 +-
 cql3/untyped_result_set.cc |  6 +++---
 service/client_state.cc    |  4 ++--
 4 files changed, 22 insertions(+), 20 deletions(-)
2020-06-16 20:31:36 +03:00
Avi Kivity
b608af870b dist: debian: do not require root during package build
Debian package builds provide a root environment for the installation
scripts, since that's what typical installation scripts expect. To
avoid providing actual root, a "fakeroot" system is used where syscalls
are intercepted and any effect that requires root (like chown) is emulated.

However, fakeroot sporadically fails for us, aborting the package build.
Since our install scripts don't really require root (when operating in
the --packaging mode), we can just tell dpkg-buildpackage that we don't
need fakeroot. This ought to fix the sporadic failures.

As a side effect, package builds are faster.

Fixes #6655.
2020-06-16 20:27:04 +03:00
Tomasz Grabiec
266e3f33d1 sstables: Add promoted index cache metrics 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
9885d0e806 position_in_partition: Introduce external_memory_usage() 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
58532cdf11 cached_file, sstables: Add tracing to index binary search and page cache 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
ecb6abe717 sstables: Dynamically adjust I/O size for index reads
Currently, index reader uses 128 KiB I/O size with read-ahead. That is
a waste of bandwidth if index entries contain large promoted index and
binary search will be used within the promoted index, which may not
need to access as much.

The read-ahead is wasted both when using binary search and when using
the scanning cursor.

On the other hand, large I/O is optimal if there is no promoted index
and we're going to parse the whole page.

There is no way to predict which case it is up front before reading
the index.

Attaching dynamic adjustments (per-sstable) lets the system auto adjust
to the workload from past history.

The large promoted index workload will settle on reading 32 KiB (with
read-ahead). This is still not optimal, we should lower the buffer
size even more. But that requires a seastar change, so is deferred.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
19501d9ef2 sstables, tests: Allow disabling binary search in promoted index from perf tests 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
c0ee997614 sstables: mc: Use binary search over the promoted index
Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This patch implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
c95dd67d11 utils: Introduce cached_file
It is a read-through cache of a file.

Will be used to cache contents of the promoted index area from the
index file.

Currently, cached pages are evicted manually using the invalidate_*()
method family, or when the object is destroyed.

The cached_file represents a subset of the file. The reason for this
is to satisfy two requirements. One is that we have a page-aligned
caching, where pages are aligned relative to the start of the
underlying file. This matches requirements of the seastar I/O engine
on I/O requests.  Another requirement is to have an effective way to
populate the cache using an unaligned buffer which starts in the
middle of the file when we know that we won't need to access bytes
located before the buffer's position. See populate_front(). If we
couldn't assume that, we wouldn't be able to insert an unaligned
buffer into the cache.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
ab274b8203 sstables: clustered_index: Relax scope of validity of entry_info
entry_info holds views, which may get invalidated when the containing
index blocks are removed. Current implementations of next_entry() keeps
the blocks in memory as long as the cursor is alive but that will
change in new implementations of the cursor.

Adjust the assumption of tests accordingly.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
ea2fbcc2cd sstables: index_entry: Introduce owning promoted_index_block_position 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
714da3c644 compound_compat: Allow constructing composite from a view 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
f2e52c433f sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
101fd613c5 sstables: mc: Extract parser for promoted index block
It will be reused in binary search over the index.
2020-06-16 16:15:14 +02:00
Tomasz Grabiec
a557c374fd sstables: mc: Extract parser for clustering out of the promoted index block parser
This parser will be used stand-alone when doing a binary search over
promoted index blocks. We will only parse the start key not the whole
block.
2020-06-16 16:14:31 +02:00
Tomasz Grabiec
95df7126a7 sstables: consumer: Extract primitive_consumer
This change extracts the parser for primitive types out of
continuous_data_consumer so that it can be used stand-alone
or embedded in other parsers.
2020-06-16 16:14:30 +02:00
Tomasz Grabiec
d5bf540079 sstables: Abstract the clustering index cursor behavior
In preparation for supporting more than one algorithm for lookups in
the promoted index, extract relevant logic out of the index_reader
(which is a partition index cursor).

The clustered index cursor implementation is now hidden behind
abstract interface called clustered_index_cursor.

The current implementation is put into the
scanning_clustered_index_cursor. It's mostly code movement with minor
adjustments.

In order to encapsulate iteration over promoted index entries,
clustered_index_cursor::next_entry() was introduced.

No change in behavior intended in this patch.
2020-06-16 16:14:17 +02:00
Tomasz Grabiec
a858f87b11 sstables: index_reader: Rearrange to reduce branching and optionals
No change in logic.

Will make it easier to make further refactoring.
2020-06-16 16:13:39 +02:00
Yaron Kaikov
ac7237f991 dbuild: Add an option to run with 'docker' or 'podman'
This adds support for configuring whether to run dbuild with 'docker' or
'podman' via a new environment variable, DBUILD_TOOL. While at it, check
if 'podman' exists, and prefer that by default as the tool for dbuild.
2020-06-16 15:18:46 +03:00
Gleb Natapov
7ca937778d cql transport: do not log broken pipe error when a client closes its side of a connection abruptly
Fixes #5661

Message-Id: <20200615075958.GL335449@scylladb.com>
2020-06-16 13:59:12 +02:00
Nadav Har'El
41a049d906 README: better explanation of dependencies and build
In this patch I rewrote the explanations in both README.md and HACKING.md
about Scylla's dependencies, and about dbuild.

README.md used to mention only dbuild. It now explains better (I think)
why dbuild is needed in the first place, and that the alternative is
explained in HACKING.md.

HACKING.md used to explain *only* install-dependencies.sh - and now explains
why it is needed, what install-dependencies.sh and that it ONLY works on
very recent distributions (e.g., Fedora older than 32 are not supported),
and now also mentions the alternative - dbuild.

Mentions of incorrect requirements (like "gcc > 8.1") were fixed or dropped.

Mention of the archaic 'scripts/scylla_current_repo' script, which we used
to need to install additional packages on non-Fedora systems, was dropped.
The script itself is also removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616100253.830139-1-nyh@scylladb.com>
2020-06-16 13:26:04 +02:00
Avi Kivity
bd794629f9 range: rename range template family to interval
nonwrapping_range<T> and related templates represent mathematical
intervals, and are different from C++ ranges. This causes confusion,
especially when C++ ranges and the range templates are used together.

As the first step to disentable this, introduce a new interval.hh
header with the contents of the old range.hh header, renaming as
follows:

  range_bound  -> interval_bound
  nonwrapping_range -> nonwrapping_interval
  wrapping_range -> wrapping_interval
  Range -> Interval (concepts)

The range alias, which previously aliased wrapping_range, did
not get renamed - instead the interval alias now aliases
nonwrapping_interval, which is the natural interval type. I plan
to follow up making interval the template, and nonwrapping_interval
the alias (or perhaps even remove it).

To avoid churn, a new range.hh header is provided with the old names
as aliases (range, nonwrapping_range, wrapping_range, range_bound,
and Range) with the same meaning as their former selves.

Tests: unit (dev)
2020-06-16 13:36:20 +03:00
Piotr Sarna
3bcc2e8f09 Merge 'hinted handoff: improve segment replay logic' from PiotrD
This series contains two improvements to hint file replay logic
in hints manager:

- During replay of a hint file, keeping track of the first hint that fails
  to be sent is now done via a simple std::optional variable instead of an
  unordered_set. This slightly reduces complexity of next replay position
  calculation.
- A corner case is handled: if reading commitlog fails, but there won't be an
  error related to sending hints, starting position wouldn't be updated. This
  could cause us to replay more hints than necessary.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)

* piodul-hints-manager-handle-commitlog-failure-in-replay-position-calculation:
  hinted handoff: use bool instead of send_state_set
  hinted handoff: update replay position on commitlog failure
  hinted handoff: remove rps_set, use first_failed_rp instead
2020-06-16 12:24:55 +02:00
Avi Kivity
6ba7b8f3f5 Update seastar submodule
* seastar 81242ccc3f...8f0858cfd7 (18):
  > Merge 'future, future-utils: stop returning a variadic future from when_all_succeed'
  > file: introduce layered_file_impl, a helper for layered files
  > net: packet: mark move assignment operator as noexcept
  > core: weak_ptr, weakly_referencable: implement empty default constructor
  > circular_buffer: Fix build with gcc 11 (avoid template parameters in d'tor declaration)
  > test: weak_ptr_test: fix static asserts about nothrow constructibility
  > coroutines: Fix clang build
  > cmake: Delete SEASTAR_COROUTINES_TS
  > Merge "future-util: Mark a few more functions as noexcept" from Rafael
  > tests: add a perf test to measure the fair_queue performance
  > Merge "iostream: make iostream stack nothrow move constructible" from Benny
  > future: Move most of rethrow_with_nested out of line.
  > future_test: Add test for nested exceptions in finally
  > core: Add noexcept to unaligned members functions
  > Merge "core: make weak_ptr and checked_ptr default and move nothrow constructible" from Benny
  > core: file: Fix typo in a comment
  > byteorder: Mark functions as noexcept
  > future: replace CanInvoke concepts with std::invocable
2020-06-16 13:19:36 +03:00
Piotr Sarna
e59d41dad6 alternator: use plain function pointer instead of std::function
Since all function handlers are plain functions without any state,
there's no need for wrapping them with a 32-byte std::function
when a plain function pointer would suffice.

Reported-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <913c1de7d02c252b40dc0c545989ec83fe74e5a9.1592291413.git.sarna@scylladb.com>
2020-06-16 12:08:21 +03:00
Raphael S. Carvalho
238ba899c0 compaction_manager: use double for backlog everywhere
Avi says:
"The backlog is a large number that changes slowly, so float
might not have enough resolution to track small changes.

For example, if the backlog is 800GB and changes less than 100kB, then
we won't see a change (float resolution is 2^23 ~ 1:8,000,000).

This is outside the normal range of values (usually the backlog changes
a lot more than 100kB per 15-second period), so it will work, but better
to be more careful."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200615150621.17543-1-raphaelsc@scylladb.com>
2020-06-16 12:05:05 +03:00
Rafael Ávila de Espíndola
3e1307a6d1 cql3: Pass std::string_view to various untyped_result_set member functions
Taking a std::string_view is a bit more flexible.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-15 15:47:15 -07:00
Rafael Ávila de Espíndola
3a9b4e7d26 cql3: Use a flat_hash_map in untyped_result_set_row
No functionality changed. This just makes it possible to use
heterogeneous lookups, which the next patch will add.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-15 15:46:25 -07:00
Rafael Ávila de Espíndola
65d56095d0 service: Pass a std::string_view to client_state::set_keyspace
No change in the implementation since it was already copying the
string. Taking a std::string_view is just a bit more flexible.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-15 15:46:25 -07:00
Piotr Sarna
45bf039357 alternator: use has_function instead of try-catch
With the new interface available, the try-catch idiom
can be removed, thus resolving a TODO.

Tests: unit(dev)

Message-Id: <788a29f8f9d7bcf952b28a6148670dbadb97a619.1592233511.git.sarna@scylladb.com>
2020-06-15 23:55:20 +03:00
Piotr Sarna
911dee5417 schema: add has_column utility function
With this simple helper function, a code snippet in alternator
can be transformed from try-catch to a simple condition.

Message-Id: <553debf4e91c0511566e53e2c8a5e8e6ee6552e2.1592233511.git.sarna@scylladb.com>
2020-06-15 23:55:06 +03:00
Piotr Sarna
b1684cf2e1 alternator: move function handlers to a lookup map
Instead of a long chain of `if` statements, handlers are now
created in a static map.
Fixes a TODO in the code.

Tests: unit(dev)

Message-Id: <0ea577a44dd56859da170fe82c16c8f810f9d695.1592232448.git.sarna@scylladb.com>
2020-06-15 23:44:45 +03:00
Piotr Sarna
e76fba6f86 alternator: remove outdated TODO for adding timeouts
The TODO is already fixed, not to mention that it had
an incorrect ordinal number (:
Message-Id: <006dc3061e0f30641c2e63ff471686f4c2e82829.1592230155.git.sarna@scylladb.com>
2020-06-15 23:04:42 +03:00
Tomasz Grabiec
1c5db178dd Merge "logalloc: Get rid of segments migration" from Pavel
But not compaction.

When reclaiming segments to seastar non-empty segments are copied
as-is to some other place. Instead of doing this reclaimer can copy
only allocated objects and leave the freed holes behing, i.e. -- do
the regular compaction. This would be the same or better from the
timing perspective, and will help to avoid yet another compaction
pass over the same set of objects in the future.

Current migration code checks for the free segments reserve to be
above minimum to proceed with migration, so does the code after this
patch, thus the segment compaction is called with non-empty free
segments set and thus it's guaranteed not to fail the new segment
allocation (if it will be required at all).

Plus some bikeshedding patches for the run-up.

tests: unit(dev)

* https://github.com/xemul/scylla/tree/br-logalloc-compact-on-reclaim-2:
  logalloc: Compact segments on reclaim instead of migration
  logallog: Introduce RAII allocation lock
  logalloc: Shuffle code around region::impl::compact
  logalloc: Do not lock reclaimer twice
  logalloc: Do not calculate object size twice
  logalloc: Do not convert obj_desc to migrator back and forth
2020-06-15 16:28:16 +02:00
Glauber Costa
093328741d compaction: test that sstable set is not null in update_pending_ranges
SSTable_set is now an optional, and if we don't want to expire data
it will be empty. We need to check that it is not empty before dereferencing
it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200610170647.142817-1-glauber@scylladb.com>
2020-06-15 15:43:08 +02:00
Tomasz Grabiec
e81fc1f095 row_cache: Fix undefined behavior on key linearization
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).

There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:

 1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.

 2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.

Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:

  auto prev = std::prev(it);
  if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
     it->set_continuous(true);

but handle_end_of_stream() is not invoked under allocating section.

Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:

  return with_linearized_managed_bytes([&] {
  ...
    return _worker_state->alloc_section(region, [&] {

Fixes #6637.
Refs #6108.

Tests:

  - unit (all)

Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
2020-06-15 16:03:33 +03:00
Nadav Har'El
86a4dfcd29 merge: api: Command to check and repair cdc streams
Merged pull request https://github.com/scylladb/scylla/pull/6551
from Juliusz Stasiewicz:

The command regenerates streams when:

    generations corresponding to a gossiped timestamp cannot be
    fetched from system_distributed table,
    or when generation token ranges do not align with token metadata.

In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.

Fixes #6498
Accompanied by: scylladb/scylla-jmx#109, scylladb/scylla-tools-java#172
2020-06-15 14:17:35 +03:00
Takuya ASADA
ecc83e83e5 scylla_cpuscaling_setup: move the unit file to /etc/systemd
Since scylla-cpupower.service isn't installed by .rpm package, but created
in the setup script, it's better to not use /usr/lib directory, use /etc.

We already doing same way for scylla-server.service.d/*.conf, *.mount, and
*.swap created by setup scripts.
2020-06-15 11:36:20 +03:00
Asias He
61e4387811 repair: Relax node selection in decommission for non network topology strategy
In decommission operation, current code requires a node in local dc to
sync data with. This requirement is too strong for a non network topology
strategy. For example, consider:

   n1 dc1
   n2 dc1
   n3 dc2

n2 runs decommission operation. For a keyspace with simple strategy and
RF = 2, it is possible n3 is the new owner but n3 is not in the same dc
as n2.

To fix, perform the dc check only for the network topology strategy.

Fixes #6564
2020-06-15 11:26:02 +03:00
Avi Kivity
d17b05e911 Merge 'Adding Optimized pseudo floating point estimated histogram' from Amnon
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.

Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes #5815
Fixes #4746
"

* amnonh-quicker_estimated_histogram:
  storage_proxy: use time_estimated_histogram for latencies
  test/boost/estimated_histogram_test
  utils/histogram_metrics_helper Adding histogram converter
  utils/estimated_histogram: Adding approx_exponential_histogram
2020-06-15 10:19:36 +03:00
Avi Kivity
493d16e800 build: fix --enable-dpdk/--disable-dpdk configure switch
5ceb20c439 switched --enable-dpdk
to a tristate switch, but forgot that add_tristate() prepends
--enable and --disable itself; so now the switch looks like
--enable-enable-dpdk and --disable-enable-dpdk.

Fix by removing the "enable-" prefix.
2020-06-15 09:37:45 +03:00
Amnon Heiman
6e1f042b93 storage_proxy: use time_estimated_histogram for latencies
This patch change storage_proxy to use time_estimated_histogram.

Besides the type, it changes how values are inserted and how the
histogram is used by the API.

An example how a metric looks like after the change:
scylla_storage_proxy_coordinator_write_latency_bucket{le="640.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="896.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1024.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1280.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 0
scylla_storage_proxy_coordinator_write_latency_bucket{le="1792.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2
scylla_storage_proxy_coordinator_write_latency_bucket{le="2048.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2
scylla_storage_proxy_coordinator_write_latency_bucket{le="2560.000000",scheduling_group_name="statement",shard="0",type="histogram"} 3
scylla_storage_proxy_coordinator_write_latency_bucket{le="3072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5
scylla_storage_proxy_coordinator_write_latency_bucket{le="3584.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5
scylla_storage_proxy_coordinator_write_latency_bucket{le="4096.000000",scheduling_group_name="statement",shard="0",type="histogram"} 7
scylla_storage_proxy_coordinator_write_latency_bucket{le="5120.000000",scheduling_group_name="statement",shard="0",type="histogram"} 8
scylla_storage_proxy_coordinator_write_latency_bucket{le="6144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9
scylla_storage_proxy_coordinator_write_latency_bucket{le="7168.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11
scylla_storage_proxy_coordinator_write_latency_bucket{le="8192.000000",scheduling_group_name="statement",shard="0",type="histogram"} 11
scylla_storage_proxy_coordinator_write_latency_bucket{le="10240.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19
scylla_storage_proxy_coordinator_write_latency_bucket{le="12288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 49
scylla_storage_proxy_coordinator_write_latency_bucket{le="14336.000000",scheduling_group_name="statement",shard="0",type="histogram"} 132
scylla_storage_proxy_coordinator_write_latency_bucket{le="16384.000000",scheduling_group_name="statement",shard="0",type="histogram"} 294
scylla_storage_proxy_coordinator_write_latency_bucket{le="20480.000000",scheduling_group_name="statement",shard="0",type="histogram"} 1035
scylla_storage_proxy_coordinator_write_latency_bucket{le="24576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 2790
scylla_storage_proxy_coordinator_write_latency_bucket{le="28672.000000",scheduling_group_name="statement",shard="0",type="histogram"} 5788
scylla_storage_proxy_coordinator_write_latency_bucket{le="32768.000000",scheduling_group_name="statement",shard="0",type="histogram"} 9815
scylla_storage_proxy_coordinator_write_latency_bucket{le="40960.000000",scheduling_group_name="statement",shard="0",type="histogram"} 19821
scylla_storage_proxy_coordinator_write_latency_bucket{le="49152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 30063
scylla_storage_proxy_coordinator_write_latency_bucket{le="57344.000000",scheduling_group_name="statement",shard="0",type="histogram"} 38642
scylla_storage_proxy_coordinator_write_latency_bucket{le="65536.000000",scheduling_group_name="statement",shard="0",type="histogram"} 44987
scylla_storage_proxy_coordinator_write_latency_bucket{le="81920.000000",scheduling_group_name="statement",shard="0",type="histogram"} 51821
scylla_storage_proxy_coordinator_write_latency_bucket{le="98304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 54197
scylla_storage_proxy_coordinator_write_latency_bucket{le="114688.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55054
scylla_storage_proxy_coordinator_write_latency_bucket{le="131072.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55363
scylla_storage_proxy_coordinator_write_latency_bucket{le="163840.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55520
scylla_storage_proxy_coordinator_write_latency_bucket{le="196608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55545
scylla_storage_proxy_coordinator_write_latency_bucket{le="229376.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="262144.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="327680.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="393216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="458752.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="524288.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="655360.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="786432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="917504.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1048576.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1310720.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1572864.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="1835008.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="2097152.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="2621440.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="3145728.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="3670016.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="4194304.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="5242880.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="6291456.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="7340032.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="8388608.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="10485760.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="12582912.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="14680064.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="16777216.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="20971520.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="25165824.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="29360128.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="33554432.000000",scheduling_group_name="statement",shard="0",type="histogram"} 55549
scylla_storage_proxy_coordinator_write_latency_bucket{le="+Inf",scheduling_group_name="statement",shard="0",type="histogram"} 55549

Fixes #4746

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:23:02 +03:00
Amnon Heiman
1cbc2e3d3e test/boost/estimated_histogram_test
This patch adds basic testing for the approx_exponential_histogram
implementations.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:57 +03:00
Amnon Heiman
f30f926703 utils/histogram_metrics_helper Adding histogram converter
This patch adds a helper converter function to convert from a
approx_exponential_histogram histogram to a seastar::metrics::histogram

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:49 +03:00
Amnon Heiman
3319756f36 utils/estimated_histogram: Adding approx_exponential_histogram
This patch adds an efficient histogram implementation.
The implementation chooses efficiency over flexibility.
That is why templates are used.

How the approx_exponential_histogram pseudo floating point histogram
works: It split the range [MIN, MAX] into log2(MAX/MIN) ranges it then
split each of that ranges linearly according to a given resolution.

For example, using resolution of 4, would be similar to using an
exponentially growing histogram with a coefficient of 1.2.

All values are uint64. To prevent handling of corner cases, it is not
allowed to set the MIN to be lower than the resolution.

The approx_exponential_histogram will probably not be used directly,
the first used is by time_estimated_histogram. A histogram for durations.

It should be compared to the estimated_histogram.

Performance comparison:
Comparison was done by inserting 2^20 values into
time_estimated_histogram and estimated_histogram.

In debug mode on a local machine insert operation took an average of
26.0 nanoseconds vs 342.2 nanoseconds.

In release mode insert operation took an average of 1.90 vs 8.28 nanoseconds

Fixes #5815

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:43 +03:00
Piotr Sarna
23c63ec19d Merge 'alternator: implement FilterExpression' from Nadav
The main goal of this series is to implement FilterExpression - the
newer syntax for filtering results of Query and Scan requests.

This feature itself is just one simple patch - it just needs to have the
already-existing filtering code call the already-existing expression
evaluation code. However, before we can do this, we need a patch to
refactor the expression-evaluation interface (this patch also fixes
pre-existing bugs). Then we need three additional patches to fix pre-
existing bugs in the various corner cases of expressions (this bugs
already existed in ConditionExpression but now became visible in
tests for FilerExpression). Finally, in the end of the series, we also
do a bit of code cleanup.

After this series, the FilterExpression feature is complete, and all
tests for this feature pass.

Tests: unit(dev)

* 'alternator-filterexpression' of git://github.com/nyh/scylla:
  alternator: avoid unnecessary conversion to string
  alternator: move some code out of executor.cc
  alternator: implement FilterExpression
  alternator: improve error path of attribute_type() function
  alternator: fix begins_with() error path
  alternator: fix corner case of contains() function in conditions
  alternator: refactor resolving of references in expressions
2020-06-14 19:42:46 +02:00
Avi Kivity
4220ed849b Merge "Use abseil's hash map in a couple places" from Rafael
"
This is part of the work for replacing global sstring variables with
constexpr std::string_view ones.

To have std::string_view values we have to convert a few APIs to take
std::string_view instead of sstring references.

The API conversions are complicated by the fact that
std::unordered_map doesn't support heterogeneous lookup, so we need
another hash map.

The one provided by abseil seems like a natural choice since it has an
API that looks like what is being proposed for c++
(http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1690r0.html)
but is also much faster.

A nice side effect is that this series is a 0.46% win in

perf_simple_query --duration 16 --smp 1  -m4G

Over 500 runs with randomized section layout and environment on each
run.
"

* 'espindola/absl-v10' of https://github.com/espindola/scylla:
  database: Use a flat_hash_map for _ks_cf_to_uuid
  database: Use flat_hash_map for _keyspaces
  Add absl wrapper headers
  build: Link with abseil
  cofigure: Don't overwrite seastar_cflags
  Add abseil as a submodule
2020-06-14 18:26:59 +03:00
Rafael Ávila de Espíndola
336d541f58 database: Use a flat_hash_map for _ks_cf_to_uuid
Given that the key is a std::pair, we have to explicitly mark the hash
and eq types as transparent for heterogeneous lookup to work.

With that, pass std::string_view to a few functions that just check if
a value is in the map.

This increases the .text section by 11 KiB (0.03%).

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
6da9eef25f database: Use flat_hash_map for _keyspaces
This changes the hash map used for _keyspaces. Using a flat_hash_map
allows using std::string_view in has_keyspace thanks to the
heterogeneous lookup support.

This add 200 KiB to .text, since this is the first use of absl and
brings in files from the .a.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
dd0d4ae217 Add absl wrapper headers
Using these instead of using the absl headers directly adds support
for heterogeneous lookup with sstring as key.

The is no gain from having the hash function inline, so this
implements it in a .cc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
7d1f6725dd build: Link with abseil
It is a pity we have to list so many libraries, but abseil doesn't
provide a .pc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
2ad09aefb6 cofigure: Don't overwrite seastar_cflags
The variable seastar_cflags was being used for flags passed to seastar
and for flags extracted from the seastar.pc file.

This introduces a new variable for the flags extracted from the
seastar.pc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Rafael Ávila de Espíndola
383a9c6da9 Add abseil as a submodule
This adds the https://abseil.io library as a submodule. The patch
series that follows needs a hash table that supports heterogeneous
lookup, and abseil has a really good hash table that supports that
(https://abseil.io/blog/20180927-swisstables).

The library is still not available in Fedora, but it is fairly easy to
use it directly from a submodule.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:37 -07:00
Avi Kivity
08313106ce Merge 'Repair use table id instead of table name' from Asias
"
Use table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.

Refs: #5942
"

* asias-repair_use_table_id_instead_of_table_name:
  repair: Do not pass table names to repair_info
  repair: Add table_id to row_level_repair
  repair: Use table id to find a table in get_sharder_for_tables
  repair: Add table_ids to repair_info
  repair: Make func in tracker::run run inside a thread
2020-06-14 14:58:46 +03:00
Raphael S. Carvalho
9983fa8766 compaction_manager: Export backlog metric
This backlog metric holds the sum of backlog for all the tables
in the system. This is very useful for understanding the behavior
of the backlog trackers. That's how we managed to fix most of
backlog bugs like #6054, #6021, etc.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200612194908.39909-1-raphaelsc@scylladb.com>
2020-06-14 14:07:53 +03:00
Avi Kivity
76d082c2b2 Merge "Decouple client services from storage_service" from Pavel E
"
The cql_server and thrift are "owned" by storage_service for
the sake of managing those, i.e. starting and stopping. Since
other services (still) need the storage_service this creates
dependencies loops.

This set makes the client services independent from the storage
service. As a consequence of it the auth service is also removed
from storage_service and put standalone. This, in turn, sets
some tests free from the need to start and stop auth and makes
one step towards NOT join_cluster()-ing in unit tests.

Also the set fixes few wierd races on scylla start and stop
that can trigger local_is_initialized() asserts, and one case of
unclear aborted shutdown when client services remain running
till the scylla process exit.

Yet another benefit is localization of "isolating" functionality
that sits deeper in storage_service than it should.

One thing that's not completely clean after it is the need for cql
server to continue referencing the service_memory_limiter semaphore
from the storage_service, but this will go away with one of the
next sets.

tests: unit(debug), manual start-stop,
       nodetool check of cql/thrift start/stop
"

* 'br-split-transport-1' of https://github.com/xemul/scylla:
  storage_service: Isolate isolator
  auth: Move away from storage_service
  auth: Move start-stop code into main
  main: Don't forget to stop cql/thrift when start is aborted
  thrift_controller: Switch on standalone
  thrift_controller: Pass one through management API
  thrift_controller: Move the code into thrift/
  thrift_controller: Introduce own lock for management
  thrift: Wrap start/stop/is_running code into a class
  cql_controller: Switch on standalone
  cql_controller: Pass one through management API
  cql_controller: Move the code into transport/
  cql_controller: Introduce own lock for management
  cql: Wrap start/stop/is_running code into a class
  api: Tune reg/unreg of client services control endpoints
2020-06-14 13:49:23 +03:00
Takuya ASADA
863293576c scylla_setup: add swapfile setup
Adding swapfile setup on scylla_setup.

Fixes #6539
2020-06-14 13:18:51 +03:00
Amnon Heiman
06510a4752 service/storage_service.cc: Make effective_ownership preemptable
A lot is going on when calculating effective ownership.
For each node in the cluster, we need to go over all the ranges belong
to that node and see if that node is the owner or not.

This patch uses futurized loops with do_for_each so it would preempt if
needed.

The patch replaces the current for-loops with do_for_each and do_with
but keeps the logic.

Fixes #6380
2020-06-14 12:56:07 +03:00
Nadav Har'El
493d7e6716 alternator: avoid unnecessary conversion to string
In a couple of places, where we already have a std::string_view, there
is no need to convert to to a std::string (which requires allocation).

One cool observation (by Piotr Sarna) is that map over std::string_view
is fine, when the strings in the map are always string constants.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:26 +03:00
Nadav Har'El
8c026b9f10 alternator: move some code out of executor.cc
The source file alternator/executor.cc has grown too much, reaching almost
4,000 lines. In this patch I move about 400 lines out of executor.cc:

1. Some functions related to serialization of sets and lists were moved to
   serialization.cc,
2. Functions related to evaluating parsed expressions were moved to
   expressions.cc.

The header file expressions_eval.hh was also removed - the calculate_value()
functions now live in expressions.cc, so we can just define them in
expressions.hh, no need for a separate header files.

This patch just moves code around. It doesn't make any functional changes.

Refs #5783.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:26 +03:00
Nadav Har'El
0b9f25ab50 alternator: implement FilterExpression
This patch provides a complete implementation for the FilterExpression
parameter - the newer syntax for filtering the results of the Query or
Scan operations.

The implementation is pretty straightforward - we already added earlier
a result-filtering framework to Alternator, and used it for the older
filtering syntax - QuryFilter and ScanFilter. All we had to do now was
to run the FilterExpression (which has the same syntax as a
ConditionExpression) on each individual items. The previous cleanup
patches were important to reduce the friction of running these expressions
on the items.

After the previous patches fixing small esoteric bugs in a few expression
functions, with this patch *all* the tests in test_filter_expression.py
now pass, and so do the two FilterExpression tests in test_query.py and
test_scan.py. As far as I know (and of course minus any bugs we'll discover
later), this marks the FilterExpression feature complete.

Fixes #5038.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:26 +03:00
Nadav Har'El
f87259a762 alternator: improve error path of attribute_type() function
The attribute_type() function, which can be used in expressions like
ConditionExpression and FilterExpression, is supposed to generate an
error if its second parameter is not one of the known types. What we
did until now was to just report a failed check in this case.

We already had a reproducing test with FilterExpression, but in this patch
we also add a test with ConditionExpression - which fails before this
patch and passes afterwards (and of course, passes with DynamoDB).

Fixes #6641.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:16:20 +03:00
Nadav Har'El
11d86dfb06 alternator: fix begins_with() error path
The begins_with() function should report an error if a constant is
passed to it which isn't one of the supported types - string or bytes
(e.g., a number).

The code we had to check this had wrong logic, though. If the item
attribute was also a number, we silently returned false, and didn't
go on to detect that the second parameter - a constant - was a number
too and should generate an error - not be silent.

Fixed and added a reproducing test case and another test to validate
my understanding of the type of parameters that begins_with() accepts.

Fixes #6640.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:13:23 +03:00
Nadav Har'El
f79a4e0e78 alternator: fix corner case of contains() function in conditions
It turns out that the contains() functions in the new syntax of
conditions (ConditionExpression, FilterExpression) is not identical
to the CONTAINS operator in the old-syntax conditions (Expected).

In the new syntax, one can check whether *any* constant object is contained
in a list. In the old syntax, the constant object must be of specific
types.

So we need to move the testing out of the check_CONTAINS() functions
that both implementations used, and into just the implementation of
the old syntax (in conditions.cc).

This bug broke one of the FilterExpression tests, but this patch also
adds new tests for the different behaviour of ConditionExpression and
Expected - tests which also reproduce this issue and verify its fix.

Fixes #6639.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 12:02:14 +03:00
Nadav Har'El
13ef31f38b alternator: refactor resolving of references in expressions
In the DynamoDB API, expressions (e.g., ConditionExpression and many more)
may contain references to column names ("#name") or to values (":val")
given in a separate part of the request - ExpressionAttributeNames and
ExpressionAttributeValues respectively.

Before this patch, we resolved these references as part of the expression's
evaluation. This approach had two downsides:

1. It often misdiagnosed (both false negatives and false positives) cases
   of unused names and values in expressions. We already had two xfailing
   tests with examples - which pass after this patch. This patch also
   adds two additional tests, which failed before this patch and pass
   with it.

2. In one of the following patches we will add support for FilterExpression,
   where the same expression is used repeatedly on many items. It is a waste
   (as well as makes the code uglier) to resolve the same references again
   and again each time the expression is evaluated. We should be able
   to do it just once.

So this patch introduces an intermediate step between parsing and evaluating
an expression - "resolving" the expression. The new resolve_*() functions
modify the already parsed expression, replacing references to attribute
names and constant values by the actual names and values taken from the
request. The resolve_*() functions also keep track which references were
used, making it very easy to check (as DynamoDB does) if there are any
unused names or values, before starting the evaluation.

The interface of evaluate() functions become much simpler - they no longer
need to know the original request (which was previously needed for
ExpressionAttributeNames/Values), the table's schema (which was previously
needed only for some error checking), keep track of which references were
used. This simplification is helpful for using the expressions in contexts
where these things (request and schema) are no longer conveniently available,
namely in FilterExpression.

A small side-benefit of this patch is that it moves a bit of code, which
handled resolving of references in expressions, from executor.cc to
expressions.cc. This is just the first step in a bigger effort to
reduce the size of executor.cc by moving code to smaller source files.
There is no attempt in this patch to move as much code as we can.
We will move more code in a separate patch in this series.

Fixes #6572.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-06-14 11:57:13 +03:00
Glauber Costa
b0a0c207c3 twcs: move implementations to its own file
LCS and SCTS already have their own files, reducing the clutter in
compaction_strategy.cc. Do the same for TWCS. I am doing this in
preparation to add more functions.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200611230906.409023-6-glauber@scylladb.com>
2020-06-14 11:50:08 +03:00
Pavel Emelyanov
514a1580da storage_service: Isolate isolator
There is a code that isolates a node on disk error. After all the previous
changes this code can be collected in one place (better to move it away from
storage_service at all, but still).

This simplifies the stop_transport(): now it can avoid rescheduling itself
on shard 0 for the 2nd time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
60e283b23e auth: Move away from storage_service
Now after the auth start/stop is standalone, we can remove
reference from storage service to it. This frees some tests
from the need to carry the auth service around for nothing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
6a46721fb7 auth: Move start-stop code into main
The auth service management is currently sitting in storage
service, but it was needed there just for cql/thrift start
code. After the latters has been moved away there are no
other reasons for the auth to be integrated with the storage
service, so move it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
3eaf6b3ec7 main: Don't forget to stop cql/thrift when start is aborted
The defer action for stopping the storage_service is registered
very late, after the cql and thrift started. If an error happens
in between, these client-shutdown hooks will not be called.

This is a problem with the hooks, but fixing it in hooks place
is a big rework, so for now put fuses for cql and thrift
individually -- both their stopping codes are re-entrable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
a1df24621c thrift_controller: Switch on standalone
Remove the on-storage_service instance and make everybody use
th standalone one.

Stopping the thrift is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
c26943e7b5 thrift_controller: Pass one through management API
The goal is to make the relevant endpoints work on standalone
thrift controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
3786bc40ec thrift_controller: Move the code into thrift/
Pure moving, no functional changes. Also fix the
indentation leaft unclean two patches back.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:18 +03:00
Pavel Emelyanov
98ccf9bccb thrift_controller: Introduce own lock for management
Currently start/stop of thrift is protected with storage_service's
run_with_api_lock, but this protection is purely needed to
guard start and stop against each other, not from anything else.

For the sake of thrift management isolation it's worth having its own
start-stop lock. This also decouples thrift code from storage_service's
"isolated" thing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:18 +03:00
Pavel Emelyanov
1dfcd63d34 thrift: Wrap start/stop/is_running code into a class
The plan is to decouple thrift management code from
storage_service and move into thrift/ directory, so
prepare for that by introducing a controller class.

This leaves some unclean indentation in start/stop helpers
to reduce the churn, it will be fixed two patches ahead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:09 +03:00
Pavel Emelyanov
1d5cdfe3c6 cql_controller: Switch on standalone
Remove the on-storage_service instance and make everybody use
th standalone one.

Stopping the server is done by registering the controller in
client service shutdown hooks. This automatically wires the
stopping into drain, decommission and isolation codes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:09 +03:00
Pavel Emelyanov
7ebe44f33d cql_controller: Pass one through management API
The goal is to make the relevant endpoints work on standalone
cql controller instead of the storage_service's one, so
prepare this controller (dummy for now) and pass it all the
way down the API code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:09 +03:00
Pavel Emelyanov
f048f3434f cql_controller: Move the code into transport/
Pure moving, no functional changes. Also fix the
indentation leaft unclean two patches back.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:13:41 +03:00
Pavel Emelyanov
2282a27f26 cql_controller: Introduce own lock for management
Currently start/stop of cql is protected with storage_service's
run_with_api_lock, but this protection is purely needed to
guard start and stop against each other, not from anything else.

For the sake of cql server isolation it's worth having its own
start-stop lock. This also decouples cql code from storage_service's
"isolated" thing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:13:41 +03:00
Pavel Emelyanov
7de23f44d2 cql: Wrap start/stop/is_running code into a class
The plan is to decouple cql server management code from
storage_service and move into transport/ directory, so
prepare for that by introducing a controller class.

This leaves some unclean indentation in start/stop helpers
to reduce the churn, it will be fixed two patches ahead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:12:19 +03:00
Pavel Emelyanov
6a89c987e4 api: Tune reg/unreg of client services control endpoints
Currntly API endpoints to start and stop cql_server and thrift
are registered right after the storage service is started, but
much earlier than those services are. In between these two
points a lot of other stuff gets initialized. This opens a small
window  during which cql_server and thrift can be started by
hand too early.

The most obvious problem is -- the storage_service::join_cluster()
may not yet be called, the auth service is thus not started, but
starting cql/thrift needs auth.

Another problem is those endpoints are not unregistered on stop,
thus creating another way to start cql/thrif at wrong time.

Also the endpoints registration change helps further patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 18:47:24 +03:00
Piotr Dulikowski
e5b2218ad4 hinted handoff: use bool instead of send_state_set
After restart_segment was removed from send_state enum, send_state_set
now has only one possible element: segment_replay_failed.

This patch removes send_state_set and uses bool in its place instead.
2020-06-12 16:10:20 +02:00
Piotr Dulikowski
6b34bb1a43 hinted handoff: update replay position on commitlog failure
Hints manager uses commitlog framework to store and replay hints.

The commitlog::read_log_file function is used for replaying hints. It
reads commitlog entries and passes them to a callback. In case of hints
manager, the callback calls manager::send_one_hint function.

In case something goes wrong during this process, sending of that file
is attempted again later. If the error was caused by hints that failed
to be sent (e.g. due to network error), then we also advance
_last_not_complete_rp field to the position of the first hint that
failed. In the next retry, we will start reading from the commitlog from
that position.

However, current logic does not account for the case when an error
occurs in the commitlog::read_log_file function itself. If,
coincidentally, all hints sent by send_one_hint succeed, then we won't
advance the _last_not_complete_rp field and we may unnecessarily repeat
sending some of the hints that succeeded.

This patch adds the send_one_file_ctx::last_sent_rp field, which keeps
track of the last commitlog position for which a hint was attempted to
be sent. In case read_log_file throws an error but all send_one_hint
calls succeed, then it will be used to update _last_not_complete_rp.
This will reduce the amount of hints that are resent in this case to
only one.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
2020-06-12 16:10:20 +02:00
Piotr Dulikowski
d369b538f0 hinted handoff: remove rps_set, use first_failed_rp instead
When sending hints from one file, rps_set is used to keep track of
positions of hints that are currently sent. If sending of a hint fails,
its position is not removed from rps_set. If some hints fail to be sent
while handling a hints file, the lowest position from rps_set is used
to calculate the position from where to start when sending of the file
is retried.

Keeping track of commitlog positions this way isn't necessary to
calculate this position. This patch removes rps_set and replaces it
with first_failed_rp - which is just a single
std::optional<db::replay_position>. This value is updated when a hint
send failure is detected.

This simplifies calculation of starting position for the next retry, and
allowed to remove some error handling logic related to an edge case when
inserting to rps_set fails.

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
2020-06-12 16:10:19 +02:00
Botond Dénes
218b7d5b85 docs/debugging.md: expand section about troubleshooting thread debugging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200612065604.215204-1-bdenes@scylladb.com>
2020-06-12 09:54:02 +02:00
Avi Kivity
4e79296090 tracked_file_impl: inherit disk and memory alignment from underlying file
tracked_file_impl is a wrapper around another file, that tracks
memory allocated for buffers in order to control memory consumption.

However, it neglects to inherit the disk and memory alignment settings
from the wrapped file, which can cause unnecessarily-large buffers
to be read from disk, reducing throughput.

Fix by copying the alignment parameters.

Fixes #6290.
2020-06-11 17:43:50 +03:00
Avi Kivity
5ceb20c439 build: default enable dpdk in release mode
To reduce special cases for the build bots, default dpdk to enabled
in release mode, keeping it disabled for debug and dev.

To allow release modes without dpdk to be build, the --enable-dpdk
switch is converted to a tri-state. When disabled, dpdk is disabled
across all modes. Similarly when enabled the effect is global. When
unspecified, dpdk is enabled for release mode only.

After this change, reloc/build_reloc.sh no longer needs to specify
--enable-dpdk, so remove it.
2020-06-11 17:24:16 +03:00
Avi Kivity
0dc78d38f1 build: remove zstd submodule
Now that Fedora provides the zstd static library, we can remove the
submodule.

The frozen toolchain is regenerated to include the new package.
2020-06-11 17:12:49 +03:00
Eliran Sinvani
14520e843a messagin service: fix execution order in messaging_service constructor
The messaging service constructor's body does two main things in this
order:
1. it registers the CLIENT_ID verb with rpc.
2. it initializes the scheduling mechanism in charge of locating the
right scheduling group for each verb.

The registration function uses the scheduling mechanism to determine
the scheduling group for the verb.
This commit simply reverses the order of execution.

Fixes #6628
2020-06-11 12:14:10 +03:00
Raphael S. Carvalho
72ae76fb09 compaction: Fix a potential source of stalls for run-based strategies
When compaction A completes, a request is issued so that all parallel compactions
will replace compaction A's input sstables by respective output sstables, in the
SSTable set snapshot used for expiration purposes.
That's done to allow space of input SSTables to be released as soon as possible,
helping a lot incremental compaction, but also the non-incremental approach.

Recently I came to realization that we're copying the SSTable set, when doing the
replacement, to make the code exception safe, but it turns out that if an exception
is triggered, the compaction will fail anyway. So this copy is very useless and a
potential source of reactor stalls if strategies like LCS is used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200608192614.9354-1-raphaelsc@scylladb.com>
2020-06-10 18:44:44 +03:00
Avi Kivity
9afd599d7c Merge 'range_streamer: Handle table of RF 1 in get_range_fetch_map' from Asias
"
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:

```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.

In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```

Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.

After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```

Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.

Ta fix, we ignore the table of RF 1 when it is unable to find a source
node to stream.

Fixes #6351
"

* asias-fix_bootstrap_with_rf_one_in_range_streamer:
  range_streamer: Handle table of RF 1 in get_range_fetch_map
  streaming: Use separate streaming reason for replace operation
2020-06-10 16:03:13 +03:00
Rafael Ávila de Espíndola
555d8fe520 build: Be consistent about system versus regular headers
We were not consistent about using '#include "foo.hh"' instead of
'#include <foo.hh>' for scylla's own headers. This patch fixes that
inconsistency and, to enforce it, changes the build to use -iquote
instead of -I to find those headers.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200608214208.110216-1-espindola@scylladb.com>
2020-06-10 15:49:51 +03:00
Gleb Natapov
d5b0cf975a cql transport: get rid of unneeded shared_ptr
There is no point to hold prepared_metadata in result_message::prepared
as a shared_ptr since their lifetime match.

Message-Id: <20200610113217.GF335449@scylladb.com>
2020-06-10 15:48:40 +03:00
Nadav Har'El
65d3e3992f alternator test: small fixes for test_key_condition_expression_multi
The test test_key_condition_expression_multi() had a small typo, which
was hidden by the fact that the request was expected to fail for other
reasons, but nevertheless should be fixed.

Moreover, it appears that the Amazon DynamoDB changed their error message
for this case, so running the test with "--aws" failed. So this patch
makes it work again by being more forgiving on the exact error message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200609205628.562351-1-nyh@scylladb.com>
2020-06-10 07:34:20 +02:00
Nadav Har'El
0c460927bf alternator: cleanup - don't use unique_ptr when not needed
In the existing Alternator code, we used std::unique_ptr<rjson::value> for
passing the optional old value of an item read for a RMW operation.
The benfit of this type over the simpler "const rjson::value*" is that it
gives the callee ownership of the item, and thus the ability to move parts
of it into the response without copying them. We only used this ability in a
handful of obscure cases involving ReturnedValues, but I am not going to
break this dubious feature in this patch.

Nevertheless, a lot of internal code, like condition checks, just needs
read-only access to that previous item, so we passed a reference to the
unique_ptr, i.e., "const std::unique_ptr<rjson::value>&". This is ugly,
and also forces new code that wants to use the same condition checks (i.e.,
filtering code), to artificially allocate a unique_ptr just because that
is what these functions expect.

So in this patch, we change the utility functions such as
verify_condition_expression() and everything they use, to pass around a
"const rjson::value*" instead of a "const std::unique_ptr<rjson::value>&.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604131352.436506-1-nyh@scylladb.com>
2020-06-10 07:33:31 +02:00
Takuya ASADA
5bdd09d08a supervisor: drop unused Upstart code, always use libsystemd
Since we don't support Ubuntu 14.04 anymore, we can drop Upstart related code
from supervisor.[cc|hh].
Also, "#ifdef HAVE_LIBSYSTEMD" was for compiling Scylla on older distribution
which does not provide libsystemd, we nolonger need this since we always build
Scylla on latest Fedora.
Dropping HAVE_LIBSYSTEMD also means removing libsystemd from optional_packages
in configure.py, make it required library.

Note that we still may run Scylla without systemd such as our Docker image,
but sd_notify() does nothing when systemd does not detected, so we can ignore
such case.
Reference: https://www.freedesktop.org/software/systemd/man/sd_notify.html
Reference: https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-daemon/sd-daemon.c
2020-06-10 08:17:35 +03:00
Takuya ASADA
06bcbfc4c3 scylla_cpuscaling_setup: support Amazon Linux 2
Amazon Linux 2 has /usr/bin/cpupower, but does not have cpupower.service
unlike CentOS7.
We need to provide the .service file when distribution is Amazon Linux 2.

Fixes #5977
2020-06-10 08:12:53 +03:00
Dejan Mircevski
9027b6636f Use sstring_view in execute_cql and assertions
This lets the functions operate on a wider variety of arguments and
may also be faster.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-10 08:10:43 +03:00
Takuya ASADA
4c9369e449 scylla_bootparam_setup: support Amazon Linux 2
CentOS7 uses GRUB_CMDLINE_LINUX on /etc/default/grub, but Amazon Linux 2 only
has GRUB_CMDLINE_LINUX_DEFAULT, we need to support both.
2020-06-10 08:05:12 +03:00
Raphael S. Carvalho
8663824589 sstable_directory: fix off-by-one when calculating number of jobs
Number of jobs can be off-by-one if it's divisible by max threshold
(max_sstables_per_job), which results in one extra unneeded resharding
job.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200609163430.14155-1-raphaelsc@scylladb.com>
2020-06-09 19:36:40 +03:00
Asias He
a521c429e1 streaming: Do not send end of stream in case of error
Current sender sends stream_mutation_fragments_cmd::end_of_stream to
receiver when an error is received from a peer node. To be safe, send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to prevent end_of_stream to
be written into the sstable when a partition is not closed yet.

In addition, use mutation_fragment_stream_validator to valid the
mutation fragments emitted from the reader, e.g., check if
partition_start and partition_end are paired when the reader is done. If
not, fail the stream session and send
stream_mutation_fragments_cmd::error instead of
stream_mutation_fragments_cmd::end_of_stream to isolate the problematic
sstables on the sender node.

Refs: #6478
2020-06-09 18:46:12 +03:00
Glauber Costa
4025b22d13 distributed_loader: remove self-move assignment
By mistake I ended up spilling the lambda capture idiom of x = std::move(x)
into the function parameter list, which is invalid.

Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200609141608.103665-1-glauber@scylladb.com>
2020-06-09 17:22:57 +03:00
Avi Kivity
94634d9945 Merge "Reshard SSTables before moving them from upload directory" from Glauber
"
This series allows for resharding SSTables (if needed) before SSTables are
moved from the upload directory, instead of after.

The infrastructure is supposed to be used soon to also load SSTables at boot.
That, however, will take a bit longer as we need to reshape resharded SSTables
for maximum benefit. That should benefit the upload directory as well, however
the current series already presents high incremental value for upload directory
and could be merged sooner (so I can focus on reshaping).

For now, this series still keep the actual moving from upload directory
to the main directory untouched. Once reshaping is ready, it will take
care of this too.

A new file with tests is introduced that tests the process of reading
SSTables from an existing directory.

dtests executed: migration_test.py (--smp 4), which previously failed
"

* 'upload-reshard-v8.1' of github.com:glommer/scylla:
  load_new_sstables: reshard before scanning the upload directory
  distributed_load: initial handling of off-strategy SSTables
  remove manifest_file filter from table.
  sstables: move open-related structures to their own file.
  sstables: store data size in foreign_sstable_open_info
  compaction: split compaction.hh header
2020-06-09 17:06:22 +03:00
Glauber Costa
8021d12371 load_new_sstables: reshard before scanning the upload directory
In a later patch we will be able move files directly from upload
into the main directory. However for now, for the benefit of doing
this incrementally, we will first reshard in place with our new
reshard infrastructure.

load_new_sstables can then move the SSTables directly, without having
to worry about resharding. This has the immediate benefit that the
resharding happens:

- in the streaming group, without affecting compaction work
- without waiting for the current locks (which are held by compactions)
  in load_new_sstables to release.

We could, at this point, just move the SSTables to the main directory
right away.

I am not doing this in this patch, and opting to keep the rest of upload
process unchanged. This will be fixed later when we enable offstrategy
compactions: we'll then compact the SSTables generated into the main
directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-09 09:02:35 -04:00
Pekka Enberg
a41258b116 dist/ami: Remove obsolete AMI files
The Scylla AMI has moved to the scylla-machine-image.git repository so
let's the obsolete files from scylla.git.

Suggested-by: Konstantin Osipov <kostja@scylladb.com>
Acked-by: Bentsi Magidovich <bentsi@scylladb.com>
Message-Id: <20200609105424.30237-1-penberg@scylladb.com>
2020-06-09 13:55:41 +03:00
Calle Wilund
5105e9f5e1 cdc::log: Missing "preimage" check in row deletion pre-image
Fixes #6561

Pre-image generation in row deletion case only checked if we had a pre-image
result set row. But that can be from post-image. Also check actual existance
of the pre-image CK.
Message-Id: <20200608132804.23541-1-calle@scylladb.com>
2020-06-09 10:56:41 +03:00
Glauber Costa
aebd965f0e distributed_load: initial handling of off-strategy SSTables
Off-strategy SSTables are SSTables that do not conform to the invariants
that the compaction strategies define. Examples of offstrategy SSTables
are SSTables acquired over bootstrap, resharding when the cpu count
changes or imported from other databases through our upload directory.

This patch introduces a new class, sstable_directory, that will
handle SSTables that are present in a directory that is not one of the
directories where the table expects its SSTables.

There is much to be done to support off-strategy compactions fully. To
make sure we make incremental progress, this patch implements enough
code to handle resharding of SSTables in the upload directory. SSTables
are resharded in place, before we start accessing the files.

Later, we will take other steps before we finally move the SSTables into
the main directory. But for now, starting with resharding will not only
allow us to start small, but it will also allow us to start unleashing
much needed cleanups in many places. For instance, once we start
resharding on boot before making the SSTables available, we will be able
to expurge all places in Scylla where, during normal operations, we have
extra handler code for the fact that SSTables could be shared.

Tests: a new test is added and it passes in debug mode.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-08 16:06:00 -04:00
Glauber Costa
e48ad3dc23 remove manifest_file filter from table.
When we are scanning an sstable directory, we want to filter out the
manifest file in most situations. The table class has a filter for that,
but it is a static filter that doesn't depend on table for anything. We
are better off removing it and putting in another independent location.

While it seems wasteful to use a new header just for that, this header
will soon be populated with the sstable_directory class.

Tests: unit (dev)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-08 16:06:00 -04:00
Glauber Costa
fd89e9f740 sstables: move open-related structures to their own file.
sstables/sstables.hh is one of our heaviest headers and it's better that we don't
include it if possible. For some users, like distributed_loader, we are mostly
interested in knowing the shape of structures used to open an SSTable.

They are:
- the entry_descriptor, representing an SSTable that we are scanning on-disk
- the sstable_open_info, representing information about a local, opened SSTable
- the foreign_sstable_open_info, representing information about an opened SSTable
  that can cross shard boundaries.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-08 16:06:00 -04:00
Glauber Costa
8698221dd2 sstables: store data size in foreign_sstable_open_info
In the new version of resharding we'll want to spread SSTables around
the many shards based on their total size. This means we also need to
know the size of each SSTable individually.

We could wrap the foreign_sstable_info around another structure that
keeps track of that, but because this structure exists mostly for
resharding purposes anyway we will just add the data_size to it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-08 16:06:00 -04:00
Glauber Costa
3972628fc0 compaction: split compaction.hh header
compaction.hh is one of our heavy headers, but some users just want to
use information on it about how to describe a compaction, not how to
perform one.

For that reason this patch splits the compaction_descriptor into a new
header.

The compaction_descriptor has, as a member type, compaction_options.
That is moved too, and brings with it the compaction_type. Both of those
structures would make sense in a separate header anyway.

The compaction_descriptor also wants the creator_fn and replacer_fn
functions.  We also take this opportunity to rename them into something
more descriptive

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-08 16:06:00 -04:00
Piotr Sarna
2746a3597f Update seastar submodule
* seastar 42e77050...81242ccc (7):
  > demos: coroutine_demo: fix for SEASTAR_API_LEVEL >= 3
  > core: Avoid warning on disable_backtrace_temporarily::_old being unused
  > future: Add a couple of friend declarations
  > Merge "net: make socket stack nothrow move constructible" from Benny
  > reactor: Avoid declaring _Unwind_RaiseException
  > future-util: Delete SEASTAR__WAIT_ALL__AVOID_ALLOCATION_WHEN_ALL_READY
  > file: io_priority_class: specify constructor as noexcept
2020-06-08 19:38:28 +02:00
Takuya ASADA
1e2509ffec dist/offline_installer/debian: fix umask error
same as redhat, makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

See #6243
2020-06-08 20:06:21 +03:00
Takuya ASADA
4eae7f66eb dist/offline_installer/debian: support cross build
Unlike redhat version, debian version already supported cross build since
it uses debootstrap, but the shellscript rejecting to continue build on
non-debian distribution, so drop these lines to build on Fedora.

[avi: regenerate toolchain]
2020-06-08 19:54:09 +03:00
Takuya ASADA
058da69a3b dist/debian/python3: cleanup build/debian, rename build directory
This is scylla-python3 version of #6611, but we also need to rename
.deb build directory for scylla-python3, since we may lose .deb when
building both scylla and scylla-python3 .deb package, since we currently
sharing build directory.
So renamed it to build/python3/debian.
2020-06-08 15:49:22 +03:00
Takuya ASADA
260d264d3c dist/debian: cleanup build/debian before building .deb
On 287d6e5, we stopped to rm -rf debian/ on build_deb.sh, since now we have
prebuilt debian/ directory.
However, it might cause .deb build error when we modified debian package source,
since it never cleanup.

To prevent build error, we need to cleanup build/debian on reloc/build_deb.sh,
before extracting contents from relocatable package.
2020-06-08 15:18:42 +03:00
Pavel Emelyanov
d908646b28 logalloc: Compact segments on reclaim instead of migration
When reclaiming segments to the seastar the code tries to free the segments
sequentially. For this it walks the segments from left to right and frees
them, but every time a non-empty segment is met it gets migrated to another
segment, that's allocated from the right end of the list.

This is waste of cycles sometimes. The destination segment inherits the
holes from the source one, and thus it will be compacted some time in the
future. Why not compact it right at the reclamation time? It will take the
same time or less, but will result in better compaction.

To acheive this, the segment to be reclaimed is compacted with the existing
compact_segment_locked() code with some special care around it.

1. The allocation of new segments from seastar is locked
2. The reclaiming of segments with evict-and-compact is locked as well
3. The emergency pool is opened (the compaction is called with non-empty
   reserve to avoid bad_alloc exception throw in the middle of compaction)
4. The segment is forcibly removed from the histogram and the closed_occupancy
   is updated just like it is with general compaction

The segments-migration auxiliary code can be removed after this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:07:35 +03:00
Pavel Emelyanov
4db6ef7b6d logallog: Introduce RAII allocation lock
The lock disables the segment_pool to call for more segments from
the underlying allocator.

To be used in next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:07:30 +03:00
Pavel Emelyanov
2005aca444 logalloc: Shuffle code around region::impl::compact
This includes 3 small changes to facilitate next patching:
- rename region::impl::compact into compact_segment_locked
- merging former compact with compact_single_segment_locked
- moving log print and stats update into compact_segment_locked

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:06:45 +03:00
Kamil Braun
013330199d cdc/storage_proxy: keep cdc_service alive in storage_proxy operations
storage_proxy is never deinitialized, so it may have still used cdc_service
after its destructor was called.

This fixes the problem by cdc_service inheriting from
async_sharded_service and storage_proxy calling shared_from_this on
the service whenever it uses it.

cdc_service inherits from async_sharded_service and not simply from
enable_shared_from_this, because there might be other services that
cdc_service depends on. Assuming that these services are
deinitialized after cdc_service (as they should), i.e. after stop() is
called on cdc_service, making cdc_service async_sharded_service will
keep their deinitialization code from being called until all references
to cdc_service disappear (async_sharded_service keeps stop() from
returning until this happens).

Some more improvements should be possible through some refactoring:
1. Make augment_mutation_call a free function, not a member of
   cdc_service: it doesn't need any state that cdc_service has.
   db_context can be passed down from storage_proxy when it calls the
   function.
2. Remove the storage_proxy -> cdc_service reference. storage_proxy
   only needs augment_mutation_call, which would not be a part of the
   service. This would also get rid of the proxy -> cdc -> proxy
   reference cycle that we have now, and would allow storage_proxy to be
   safely deinitialized after cdc_service.
3. Maybe we could even remove the cdc_service -> storage_proxy
   reference. Is it really needed?
2020-06-08 13:25:51 +03:00
Pavel Emelyanov
8c81c6b7aa logalloc: Do not lock reclaimer twice
The tracker::impl::reclaim is already in reclaim-locked
section, no need for yet another nested lock.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
0392c5ca77 logalloc: Do not calculate object size twice
When walking objects on compaction the migrator->size() virtual fn is
called twice.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
81c9c4c7b2 logalloc: Do not convert obj_desc to migrator back and forth
When calling alloc_small the migrator is passed just to get the
object descriptor, but during compaction the descriptor is already
at hands, so no need to re-get it again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Takuya ASADA
969c4258cf aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6540
2020-06-08 12:48:36 +03:00
Takuya ASADA
bebaaa038f dist/debian: fix node-exporter.service file name
Since 287d6e5, we mistakenly packaging node-exporter.service in wrong name
on .deb, need to rename in correct name.

Fixes #6604
2020-06-08 12:39:18 +03:00
Asias He
dddde33512 gossip: Do not send shutdown message when a node is in unknown status
When a replacing node is in early boot up and is not in HIBERNATE sate
yet, if the node is killed by a user, the node will wrongly send a
shutdown message to other nodes. This is because UNKNOWN is not in
SILENT_SHUTDOWN_STATES, so in gossiper::do_stop_gossiping, the node will
send shutdown message. Other nodes in the cluster will call
storage_service::handle_state_normal for this node, since NORMAL and
SHUTDOWN status share the same status handler. As a result, other nodes
will incorrectly think the node is part of the cluster and the replace
operation is finished.

Such problem was seen in replace_node_no_hibernate_state_test dtest:

   n1, n2 are in the cluster
   n2 is dead
   n3 is started to replace n2, but n3 is killed in the middle
   n3 announces SHUTDOWN status wrongly
   n1 runs storage_service::handle_state_normal for n3
   n1 get tokens for n3 which is empty, because n3 hasn't gossip tokens yet
   n1 skips update normal tokens for n3,  but think n3 has replaced n2
   n4 starts to replace n2
   n4 checks the tokens for n2 in storage_service::join_token_ring (Cannot
      replace token {} which does not exist!) or
      storage_service::prepare_replacement_info (Cannot replace_address {}
      because it doesn't exist in gossip)

To fix, we add UNKNOWN into SILENT_SHUTDOWN_STATES and avoid sending
shutdown message.

Tests: replace_address_test.py:TestReplaceAddress.replace_node_no_hibernate_state_test
Fixes: #6436
2020-06-08 11:32:23 +02:00
Pavel Solodovnikov
6f6e6762ba cql: remove unused functions
It seems that the following functions are never used, delete them:
 * `function::has_reference_to`
 * `functions::get_overload_count`
 * `to_identifiers` in column_identifier.hh
 * `single_column_relation::get_map_key`

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200606115149.1770453-1-pa.solodovnikov@scylladb.com>
2020-06-08 11:28:57 +03:00
Piotr Sarna
3458bd2e32 db,view: fix outdated comments
Some comments still referred to variable names which are no longer
up-to-date.

Follow-up for #6560.
Message-Id: <2b857ccc900dd64f0d9379f5d6c87fd3aaa5d902.1591594042.git.sarna@scylladb.com>
2020-06-08 09:02:10 +03:00
Nadav Har'El
d6626c217a merge: add error injection to mv
Merged pull request https://github.com/scylladb/scylla/pull/6516 from
Piotr Sarna:

This series adds error injection points to materialized view paths:

	view update generation from staging sstables;
	view building;
	generating view updates from user writes.

This series comes with a corresponding dtest pull request which adds some
test cases based on error injection.

Fixes #6488
2020-06-07 19:23:23 +03:00
Avi Kivity
53a19fc1f2 Merge 'Debian version number fix' from Takuya
"
Now we generate dist/changelog on relocatable package generation time,
we cannot run '.rc' fixup on .deb package building time, need to do it
in debian_files_gen.py.

Also, we uses '_'  in version number for some test version packages,
which does not supported in .deb packaging system, need to replaced
with '-'.
"

* syuu1228-debian_version_number_fix:
  dist/debian: support version number containing '_'
  dist/debian: move version number fixup to debian_files_gen.py
2020-06-07 19:14:24 +03:00
Piotr Sarna
b3a6a33487 db,view: ensure that local updates are applied locally
In current mutate_MV() code it's possible for a local endpoint
to become a target for a network operation. That's the source
of occasional `broken promise` benign error messages appearing,
since the mutation is actually applied locally, so there's no point
in creating a write response handler - the node will not send a response
to itself via network.
While at it, the code is deduplicated a little bit - with the paths
simplified, it's easier to ensure that a local endpoint is never
listed as a target for remote network operations.

Fixes #5459
Tests: unit(dev),
       dtest(materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test)
2020-06-07 19:10:03 +03:00
Kamil Braun
a1e235b1a4 CDC: Don't split collection tombstone away from base update
Overwriting a collection cell using timestamp T is a process with
following steps:
1. inserting a row marker (if applicable) with timestamp T;
2. writing a collection tombstone with timestamp T-1;
3. writing the new collection value with timestamp T.
Since CDC does clustering of the operations by timestamp, this
would result in 3 separate calls to `transform` (in case of
INSERT, or 2 - in the case of UPDATE), which seems excessive,
especially when pre-/postimage is enabled. This patch makes
collection tombstones being treated as if they had the same TS as
the base write and thus they are processed in one call to `transform`
(as long as TTLs are not used).

Also, `cdc_test` had to be updated in places that relied on former
splitting strategy.

Fixes #6084
2020-06-07 17:09:05 +03:00
Tomasz Grabiec
c1df00859e sstables: Make deletion_time printable
Message-Id: <1591387901-7974-12-git-send-email-tgrabiec@scylladb.com>
2020-06-07 13:55:34 +03:00
Raphael S. Carvalho
8e47f61df7 compaction: Enable tombstone expiration based on the presence of the sstable set
For tombstone expiration to proceed correctly without the risk of resurrecting
data, the sstable set must be present.
Regular compaction and derivatives provide the sstable set, so they're able
to expire tombstones with no resurrection risk.
Resharding, on the other hand, can run on any shard, not necessarily on the
same shard that one of the input sstables belongs to, so it currently cannot
provide a sstable set for tombstone expiration to proceed safely.
That being said, let's only do expiration based on the presence of the set.
This makes room for the sstable set to be feeded to compaction via descriptor,
allowing even resharding to do expiration. Currently, compaction thinks that
sstable set can only come from the table, and that also needs to be changed
for further flexibility.

It's theoretically possible that a given resharding job will resurrect data if
a fully expired SSTable is resharded at a shard which it doesn't belong to.
Resharding will have no way to tell that expiring all that data will lead to
resurrection because the relevant SSTables are at different shards.
This is fixed by checking for fully expired sstables only on presence of
the sstable set.

Fixes #6600.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200605200954.24696-1-raphaelsc@scylladb.com>
2020-06-07 11:46:48 +03:00
Pavel Solodovnikov
5b1b6b1395 cql: pass cql3::operation::raw_deletion by unique_ptr
Another small step towards shared_ptr usage reduction in cql3
code. Also make `raw_deletion` dtor virtual to make address
sanitizer happy in debug builds.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200606104528.1732241-1-pa.solodovnikov@scylladb.com>
2020-06-06 21:04:06 +03:00
Juliusz Stasiewicz
0ad50013ff storage_service: Implementation of API call to repair CDC streams
The command regenerates streams when:
- generations corresponding to a gossiped timestamp cannot be
fetched from `system_distributed` table,
- or when generation token ranges do not align with token metadata.

In such case the streams are regenerated and new timestamp is
gossiped around. The returned JSON is always empty, regardless of
whether streams needed regeneration or not.
2020-06-06 16:52:21 +02:00
Takuya ASADA
9de65f26de dist/debian: support version number containing '_'
.deb packaging system does not support version number contains '_',
it should be replacedwith '-'
2020-06-05 21:35:02 +09:00
Takuya ASADA
509ad875aa dist/debian: move version number fixup to debian_files_gen.py
Now we generate dist/changelog on relocatable package generation time,
we cannot run '.rc' fixup on .deb package building time, need to do it
in debian_files_gen.py.
2020-06-05 21:34:55 +09:00
Kamil Braun
1b7f1806ac test: improve comments on test_schema_digest_does_not_change
This test tends to cause a lot of discussion resulting from
not understanding what is actually being tested.

Closes https://github.com/scylladb/scylla/issues/6582.
2020-06-05 14:30:02 +02:00
Kamil Braun
d89b7a0548 cdc: rename CDC description tables
Commit 968177da04 has changed the schema
of cdc_topology_description and cdc_description tables in the
system_distributed keyspace.

Unfortunately this was a backwards-incompatible change: these tables
would always be created, irrespective of whether or not "experimental"
was enabled. They just wouldn't be populated with experimental=off.

If the user now tries to upgrade Scylla from a version before this change
to a version after this change, it will work as long as CDC is protected
b the experimental flag and the flag is off.

However, if we drop the flag, or if the user turns experimental on,
weird things will happen, such as nodes refusing to start because they
try to populate cdc_topology_description while assuming a different schema
for this table.

The simplest fix for this problem is to rename the tables. This fix must
get merged in before CDC goes out of experimental.
If the user upgrades his cluster from a pre-rename version, he will simply
have two garbage tables that he is free to delete after upgrading.

sstables and digests need to be regenerated for schema_digest_test since
this commit effectively adds new tables to the system_distributed keyspace.
This doesn't result in schema disagreement because the table is
announced to all nodes through the migration manager.
2020-06-05 09:59:16 +02:00
Piotr Sarna
64b8b77ac2 table: add error injection points to the materialized view path
... in order to be able to test scenarios with failures.
2020-06-05 09:39:58 +02:00
Piotr Sarna
76e89efc1a db,view: add error injection points to view building
... in order to be able to test scenarios with failures.
2020-06-05 09:39:58 +02:00
Piotr Sarna
9d524a7a7e db,view: add error injection points to view update generator
... in order to be able to test scenarios with failures.
2020-06-05 09:39:58 +02:00
Piotr Sarna
9a4394327a Merge 'CDC: Disallowed CDC for tables with counter column(s)'
from Juliusz.

CDC for counters is unimplemented as of now,
therefore any attempt to enable CDC log on counter
table needs to be clearly disallowed. This patch does
exactly this.

The check whether schema has counter columns
is performed in `cdc_service::impl` in:
- `on_before_create_column_family`,
- `on_before_update_column_family`
and, if so, results in `invalid_request_exception` thrown.

Fixes #6553

* jul-stas-6553-disallow-cdc-for-counters:
  test/cql: Check that CDC for counters is disallowed
  CDC: Disallowed CDC for tables with counter column(s)
2020-06-05 07:46:53 +02:00
Nadav Har'El
ace1697aa9 alternator test: reproducer for unjustly refused condition expression
This patch adds a test reproducing issue #6572, where the perfectly
good condition expression:

   #name1 = :val1 OR #name2 = :val2

Gets refused because of the following combination in our implementation:

  1. Short-circuit evaluation, i.e., after we discover #name1 = :val1
     we don't evaluate the second half of the expression.

  2. The list of "used" references is collected at evaluation time,
     instead of at parsing time. Because evaluation never reaches
     #name2 (or :val2) our implementation complains that they are not
     used, and refuses the request - which should have been allowed.

This test xfails on Alternator. It passes on DynamoDB.

Refs #6572

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604171954.444291-1-nyh@scylladb.com>
2020-06-05 07:43:50 +02:00
Piotr Sarna
0ba23d2b40 test: add manual test for tagging return value
While not very interesting by itself, the test case shows
that in case of TagResource and UntagResource it's actually correct
to return empty HTTP body instead of an empty JSON object,
which was the case for PutItem.
Message-Id: <6331963179c5174a695f0e9eeed17de6c9f9a3be.1591269516.git.sarna@scylladb.com>
2020-06-04 16:17:24 +03:00
Nadav Har'El
db45ff2733 alternator: clean up usage of describe_item()
The DynamoDB GetItem request returns the requested item in a specific way,
wrapped in a map with a "Item" member. For historic reasons, we used the
same function that returns this (describe_item()) also in other code which
reads items - e.g. for checking conditional operations. The result is
wasteful - after adding this "Item" member we had other code to extract it,
all for no good reason.  It is also ugly and confusing.

Importantly, this situation also makes it harder for me to add support for
FilterExpression. The issue is that the expression evaluator got the item
with the wrapper (from the existing ConditionExpression code) but the
filtering code had it without this wrapper, as it didn't use describe_item().

So this patch uses describe_single_item(), which doesn't add the wrapper
map, instead of describe_item(). The latter function is used just once -
to implement GetItem. The unnecessary code to unwrap the item in multiple
places was then dropped.

All the tests still pass. I also tested test_expected.py in unsafe_rmw write
isolation mode, because code only for this mode had to be modified as well.

Refs #5038.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604092050.422092-1-nyh@scylladb.com>
2020-06-04 12:33:48 +02:00
Nadav Har'El
3d26bde4c1 alternator doc: correct state of filtering support
Correct the compatibility section in docs/alternator/alternator.md:
Filtering of Scan/Query results using the older syntax (ScanFilter,
QueryFilter) is, after commit bea9629031,
now fully supported. The newer syntax (FilterExpression) is not yet.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200604073207.416860-1-nyh@scylladb.com>
2020-06-04 12:33:10 +02:00
Avi Kivity
5b92a6d9e4 build: drop __pycache__ directories from python3 relocatable package
Recently ./reloc/build_deb.sh started failing with

dpkg-source: info: using source format '1.0'
dpkg-source: info: building scylla-python3 using existing scylla-python3_3.8.3-0.20200604.77dfa4f15.orig.tar.gz
dpkg-source: info: building scylla-python3 in scylla-python3_3.8.3-0.20200604.77dfa4f15-1.diff.gz
dpkg-source: error: cannot represent change to scylla-python3/lib64/python3.8/site-packages/urllib3/packages/backports/__pycache__/__init__.cpython-38.pyc:
dpkg-source: error:   new version is plain file
dpkg-source: error:   old version is symlink to /usr/lib/python3.8/site-packages/__pycache__/six.cpython-38.pyc
dpkg-source: error: unrepresentable changes to source
dpkg-buildpackage: error: dpkg-source -b . subprocess returned exit status 1
debuild: fatal error at line 1182:

Those files are not in fact symlinks, so it's clear that dpkg is confused
about something. Rather than debug dpkg, however, it's easier to just
drop __pycache__ directories. These hold the result of bytecode
compilation and are therefore optional, as Python will compile the sources
if the cache is not populated.

Fixes #6584.
2020-06-04 13:04:34 +03:00
Israel Fruchter
a2bb48f44b fix "scylla_coredump_setup: Remove the coredump create by the check"
In 28c3d4 `out()` was used without `shell=True` and was the spliting of arguments
failed cause of the complex commands in the cmd (pipe and such)

Fixes #6159
2020-06-04 12:55:10 +03:00
Raphael S. Carvalho
77dfa4f151 sstables: kill unused resharding code
output_sstables is no longer needed after we made resharding
use a special interposer.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200603165324.176665-1-raphaelsc@scylladb.com>
2020-06-03 23:20:15 +03:00
Avi Kivity
0c34e114e2 Merge "Upgrade to seastar api version 3" (make_file_output_stream returns future) from Rafael
"
The new seastar api changes make_file_output_stream and
make_file_data_sink to return futures. This series includes a few
refactoring patches and the actual transition.
"

* 'espindola/api-v3-v3' of https://github.com/espindola/scylla:
  table: Fix indentation
  everywhere: Move to seastar api level 3
  sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream
  sstables: Pass a data_sink to checksummed_file_writer's constructor
  sstables: Convert a file_writer constructor to a static make
  sstables: Move file_writer constructor out of line
2020-06-03 23:09:49 +03:00
Rafael Ávila de Espíndola
686f9220c1 table: Fix indentation
It was broken by the previous commit.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-03 10:32:46 -07:00
Rafael Ávila de Espíndola
e5876f6696 everywhere: Move to seastar api level 3
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-03 10:32:46 -07:00
Rafael Ávila de Espíndola
13282b3d4c sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream
This is a bit simpler as we don't have to pass in the options and
moves the calls to make_file_output_stream to places where we can
handle futures.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-03 10:32:46 -07:00
Rafael Ávila de Espíndola
f6ec7364a7 sstables: Pass a data_sink to checksummed_file_writer's constructor
checksummed_file_writer cannot be moved, so we can't have a
checksummed_file_writer::make that returns a future. So instead we
pass in a data_sink and let the callers call make_file_data_sink.

This is in preparation for make_file_data_sink returning a future in
the seastar api v3.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-03 10:32:46 -07:00
Rafael Ávila de Espíndola
c1f37db72b sstables: Convert a file_writer constructor to a static make
For now it always returns a ready future. This is in preparation for
using seastar v3 api where make_file_output_stream returns a future.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-03 10:32:45 -07:00
Rafael Ávila de Espíndola
0bc4f3683a sstables: Move file_writer constructor out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-03 10:21:29 -07:00
Juliusz Stasiewicz
bf4050ed15 test/cql: Check that CDC for counters is disallowed
This test must be removed once we have implementation of CDC for
tables with counter columns.
2020-06-03 18:31:44 +02:00
Juliusz Stasiewicz
3a079cf21b CDC: Disallowed CDC for tables with counter column(s)
Until we get implementation of CDC for counters, we explicitly
disallow it. The check is performed in `cdc_service::impl` in:
- `on_before_create_column_family`,
- `on_before_update_column_family`
and results in `invalid_request_exception` thrown.
2020-06-03 18:29:36 +02:00
Avi Kivity
86d7f2f91b Update seastar submodule
* seastar 9066edd512...42e770508c (15):
  > Revert "sharded: constrain sharded::map_reduce0"
  > tls: Fix race/unhandled case in reloadable_certificates
  > fair_queue: rename operator< to strictly_less
  > future: Add a current_exception_future_marker
  > Merge "Avoid passing non nothrow move constructible lambdas to future::then" from Rafael
  > tls_echo_server_demo: main: capture server post stop()
  > tests: fstream: remove obsolete comments about running in background
  > everywhere: Reopen inline namespaces as inline
  > Merge "Merge the two do_with implementations" from Rafael
  > sharded: constrain sharded::map_reduce0
  > Merge "Backtracing across tasks" from Tomasz
  > posix-stack: fix strict aliasing violations on CMSG_DATA(cmsghdr)
  > sharded: unify invoke_on_*() variants
  > sharded_parameter_demo: Delete unused member variable
  > futures_test: Fix delete of copy constructor
2020-06-03 19:18:27 +03:00
Botond Dénes
72b8a2d147 querier: move common stuff into querier_base
The querier cache expects all querier objects it stores to have certain
methods. To avoid accessing these via `std::visit()` (the querier object
is stored in an `std::variant`), we move all the stuff that is common to
all querier types into a base class. The querier cache now accesses the
members via a reference to this common base. Additionally the variant is
eliminated completely and the cache entry stores an
`std::unique_ptr<querier_base>` instead.

Tests: unit(dev)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200603152544.83704-1-bdenes@scylladb.com>
2020-06-03 18:45:33 +03:00
Raphael S. Carvalho
077b4ee97d table: Don't remove a SSTable from the backlog tracker if not previously added
After 7f1a215, a sstable is only added to backlog tracker if
sstable::shared() returns true.

sstable::shared() can return true for a sstable that is actually owned
by more than one shard, but it can also incorrectly return true for
a sstable which wasn't made explicitly unshared through set_unshared().
A recent work of mine is getting rid of set_unshared() because a
sstable has the knowledge to determine whether or not it's shared.

The problem starts with streaming sstable which hasn't set_unshared()
called for it, so it won't be added to backlog tracker, but it can
be eventually removed from the tracker when that sstable is compacted.
Also, it could happen that a shared sstable, which was resharded, will
be removed from the tracker even though it wasn't previously added.

When those problems happen, backlog tracker will have an incorrect
account of total bytes, which leads it to producing incorrect
backlogs that can potentially go negative.

These problems are fixed by making every add / removal go through
functions which take into account sstable::shared().

Fixes #6227.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-2-raphaelsc@scylladb.com>
2020-06-03 17:35:22 +03:00
Raphael S. Carvalho
fb6976f1b9 Make sure SSTables created by streaming are added to backlog tracker
New SStables are only added to backlog tracker if set_unshared() was
called on their behalf. SStables created for streaming are not being
added to the tracker because make_streaming_sstable_for_write()
doesn't call set_unshared() nor does it caller. Which results in backlog
not accounting for their existence, which means backlog will be much
lower than expected.

This problem could be fixed by adding a set_unshared() call but it
turns out we don't even need set_unshared() anymore. It was introduced
when Scylla metadata didn't exist, now a SSTable has built-in knowledge
of whether or not it's shared. Relying on every SSTable creator calling
set_unshared() is bug prone. Let's get rid of it and let the SStable
itself say whether or not it's shared. If an imported SSTable has not
Scylla metadata, Scylla will still be able to compute shards using
token range metadata.

Refs #6021.
Refs #6227.
Fixes #6441.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>
2020-06-03 17:35:22 +03:00
Tomasz Grabiec
087fa42c1d Merge "utils: inject errors around paxos stages" from Alejo
Add Paxos error injections before/after save promise, proposal, decision,
paxos_response_handler, delete decision.

Adds a method to inject an error providing a lambda while avoiding to add
a continuation when the error injection is disabled.

For this provide error exception and enter() to allow flow control (i.e. return)
on simple error injections without lambdas.

Also includes Pavel's patch for CQL API for error injections, updated to
current error injection API and added one_shot support. Also added some
basic CQL API boost tests.

For CQL API there's a limitation of the current grammar not supporting
f(<terminal>) so values have to be inserted in a table until this is
resolved. See #5411

* https://github.com/alecco/scylla/tree/error_injection_v11:
  paxos: fix indentation
  paxos: add error injections
  utils: add timeout error injection with lambda
  utils: error injection add enter() for control flow
  utils: error injections provide error exceptions
  failure_injector: implement CQL API for failure injector class
  lwt: fix disabled error injection templates
2020-06-03 15:42:10 +02:00
Piotr Sarna
8fc3ca855e alternator: fix the return type of PutItem
Even if there are no attributes to return from PutItem requests,
we should return a valid JSON object, not an empty string.

Fixes #6568
Tests: unit(dev)
2020-06-03 16:03:13 +03:00
Piotr Sarna
3aff52f56e alternator: fix returning UnprocessedKeys unconditionally
Client libraries (e.g. PynamoDB) expect the UnprocessedKeys
and UnprocessedItems attributes to appear in the response
unconditionally - it's hereby added, along with a simple test case.

Fixes #6569
Tests: unit(dev)
2020-06-03 15:48:16 +03:00
Alejo Sanchez
59d60ae672 paxos: fix indentation
Fix indentation

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:47:18 +02:00
Alejo Sanchez
019c96cfda paxos: add error injections
Adds error injections on critical points for:

    prepare
    accept
    learn
    release_semaphore_for_key

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:44:53 +02:00
Alejo Sanchez
a8b14b0227 utils: add timeout error injection with lambda
Even though calling then() on a ready future does not allocate a
continuation, calling then on the result of it will allocate.

This error injection only adds a continuation in the dependency
chain if error injections are enabled at compile timeand this particular
error injection is enabled.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:44:00 +02:00
Alejo Sanchez
0321172677 utils: error injection add enter() for control flow
For control flow (i.e. return) and simplicity add enter() method.

For disabled injections, this method is const returning false,
therefore it has no overhead.

Add boost test.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:42:48 +02:00
Nadav Har'El
bea9629031 alternator: implement remaining QueryFilter / ScanFilter functionality
This patch implements the missing QueryFilter (and ScanFilter)
functionality:`

1. All operators. Previously, only the "EQ" operator was implemented.
2. Either "OR" or "AND" of conditions (previously only "AND").
3. Correctly returning Count and ScannedCount for post-filter and
   pre-filter item counts, respectively.

All of the previously-xfailing tests in test_query_filter.py are now
passing.

The implementation in this patch abandons our previous attempts to
translate the DynamoDB API filters into Scylla's CQL filters.
Doing this correctly for all operators would have been exceedingly
difficult (for reasons explained in #5028), and simply not worth the
effort: CQL's filters receive a page of results and then filter them,
and we can do exactly the same without CQL's filters:

The new code just retrieves an unfiltered page of items, and then for
each of these items checks whether it passes the filters. The great thing
is that we already had code for this checking - the QueryFilter syntax is
identical to the "Expected" syntax (for conditional operations) that
we already supported, so we already had code for checking these conditions,
including all the different operators.

This patch prepares for the future need to support also the newer
FilterExpression syntax (see issue #5038), and the "filter" class
supports either type of filter - the implementation for the second
syntax is just missing and can be added (fairly easily) later.

Fixes #5028.
Refs #5038.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200603110118.399325-1-nyh@scylladb.com>
2020-06-03 13:16:45 +02:00
Piotr Dulikowski
97cb2892b2 cdc: include information about all PKs in trace
This fixes a bug in CDC mutation augmentation logic. A lambda that is
called for each partition key in a batch captures a trace state pointer,
but moves it out after being called for the first time. This caused CDC
tracing information to be included only for one of the partition keys
of the batch.

Fixes #6575
2020-06-03 11:07:57 +02:00
Nadav Har'El
f6b1f45d69 alternator: fix order conditions on binary attributes
We implemented the order operators (LT, GT, LE, GE, BETWEEN) incorrectly
for binary attributes: DynamoDB requires that the bytes be treated as
unsigned for the purpose of order (so byte 128 is higher than 127), but
our implementation uses Scylla's "bytes" type which has signed bytes.

The solution is simple - we can continue to use the "bytes" type, but
we need to use its compare_unsigned() function, not its "<" operator.

This bug affected conditional operations ("Expected" and
"ConditionExpression") and also filters ("QueryFilter", "ScanFilter",
"FilterExpression"). The bug did *not* affect Query's key conditions
("KeyConditions", "KeyConditionExpression") because those already
used Scylla's key comparison functions - which correctly compare binary
blobs as unsigned bytes (in fact, this is why we have the
compare_unsigned() function).

The patch also adds tests that reproduce the bugs in conditional
operations, and show that the bug did not exist in key conditions.

Fixes #6573

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200603084257.394136-1-nyh@scylladb.com>
2020-06-03 10:55:50 +02:00
Takuya ASADA
536ab4ebe4 reloc-pkg: move all files under project name directory
To make unified relocatable package easily, we may want to merge tarballs to single tarball like this:
zcat *.tar.gz | gzip -c > scylla-unified.tar.xz
But it's not possible with current relocatable package format, since there are multiple files conflicts, install.sh, SCYLLA-*-FILE, dist/, README.md, etc..

To support this, we need to archive everything in the directory when building relocatable package.

This is modifying relocatable package format, we need to provide a way to
detect the format version.
To do this, we added a new file ".relocatable_package_version" on the top of the
archive, and set version number "2" to the file.

Fixes #6315
2020-06-03 09:52:44 +03:00
Israel Fruchter
28c3d4f8e8 scylla_coredump_setup: Remove the coredump create by the check
We generate a coredump as part of "scylla_coredump_setup" to verify that
coredumps are working. However, we need to *remove* that test coredump
to avoid people and test infrastructure reporting those coredumps.

Fixes #6159
2020-06-03 09:30:45 +03:00
Pekka Enberg
bdd0fcd0b7 Revert "scylla_current_repo: support diffrent $PRODUCT"
This reverts commit e5da79c211 because the
URLs are incorrect: both open source and enterprise repository URLs are
in

  http://downloads.scylladb.com/rpm/centos/

or

  http://downloads.scylladb.com/deb/{debian,ubuntu}
2020-06-02 18:33:02 +03:00
Nadav Har'El
0d337a716b alternator test: confirm understanding of query paging with filtering
This test (which passes successfully on both Alternator and DynamoDB)
was written to confirm our understanding of how the *paging* feature
works.

Our understanding, based on DynamoDB documentation, has been that the
"Limit" parameter determines the number of pre-filtering items, *not*
the actual number of items returned after having passed the filter.
So the number of items actually returned may be lower than Limit - in
some cases even zero.

This test tries an extreme case: We scan a collection of 20 items with
a filter matching only 10 (or so) of them, with Limit=1, and count
the number of pages that we needed to request until collecting all these
10 (or so) matches. We note that the result is 21 - i.e., DynamoDB and
Alternator really went through the 20 pre-filtering items one by one,
and for the items which didn't match the filter returned an empty page.
The last page (the 21st) is always empty: DynamoDB or Alternator doesn't
know whether or not there is a 21st item, and it takes a 21st request
to discover there isn't.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200602145015.361694-1-nyh@scylladb.com>
2020-06-02 16:57:49 +02:00
Nadav Har'El
43138c0e5e alternator test: test Count/ScannedCount return of Query
This test reproduces a bug in the current implementation of
QueryFilter, which returns for ScannedCount the count of
post-filter items, whereas it should return the pre-filter
count.

The test tests both ScannedCount and Count, when QueryFilter
is used and when it isn't used.

The test currently xfails on Alternator, passes on DynamoDB.

Refs #5028

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200602125924.358636-1-nyh@scylladb.com>
2020-06-02 16:57:49 +02:00
Pekka Enberg
0b30df8f23 Merge 'scylla_coredump_setup: fix coredump directory mount' from Amos
"Currently in coredump setup, we enabled a systemd mount to mount default
 coredump directory to /var/lib/scylla/coredump, but we didn't start it.
 So the coredump will still be saved to default coredump directory
 before a system reboot, it might touch enospc problem.

 One patch started the systemd mount during coredump setup, and make the
 mount effect.  Another patch improved the error message of systemd
 unit, it's confused when the unit config is invalid."

Fixes #6566

* 'coredump_conf' of git://github.com/amoskong/scylla:
  scylla_util/systemd_unit: improve the error message
  active the coredump directory mount during coredump setup
2020-06-02 17:56:19 +03:00
Juliusz Stasiewicz
e04fd9f774 counters: Read the state under timeout
Counter update is a RMW operation. Until now the "Read" part was
not guarded by a timeout, which is changed in this patch.

Fixes #5069
2020-06-02 15:10:43 +03:00
Amos Kong
b2f59c9516 scylla_util/systemd_unit: improve the error message
we always raise exception 'Unit xxx not found' when exception is raised in
executing 'systemctl cat xxx'. Sometimes the error is confused.

On OEL7, the 'systemctl cat var-lib-systemd-coredump.mount' will also verify
the config content, scylla_coredump_setup failed for that the config file
is invalid, but the error is 'unit var-lib-systemd-coredump.mount not found'.

This patch improved the error message.

Related issue: https://github.com/scylladb/scylla/issues/6432
2020-06-02 18:03:15 +08:00
Amos Kong
abf246f6e5 active the coredump directory mount during coredump setup
Currently we use a systemd mount (var-lib-systemd-coredump.mount) to mount
default coredump directory (/var/lib/systemd/coredump) to
(/var/lib/scylla/coredump). The /var/lib/scylla had been mounted to a big
storage, so we will have enough space for coredump after the mount.

Currently in coredump_setup, we only enabled var-lib-systemd-coredump.mount,
but not start it. The directory won't be mounted after coredump_setup, so the
coredump will still be saved to default coredump directory.
The mount will only effect after reboot.

Fixes #6566
2020-06-02 18:03:15 +08:00
Pekka Enberg
9d9d54c804 Revert "scylla_coredump_setup: Fix incorrect coredump directory mount"
This reverts commit e77dad3adf because its
incorrect.

Amos explains:

"Quote from https://www.freedesktop.org/software/systemd/man/systemd.mount.html

 What=

   Takes an absolute path of a device node, file or other resource to
   mount. See mount(8) for details. If this refers to a device node, a
   dependency on the respective device unit is automatically created.

 Where=

   Takes an absolute path of a file or directory for the mount point; in
   particular, the destination cannot be a symbolic link. If the mount
   point does not exist at the time of mounting, it is created as
   directory.

 So the mount point is '/var/lib/systemd/coredump' and
 '/var/lib/scylla/coredump' is the file to mount, because /var/lib/scylla
 had mounted a second big storage, which has enough space for Huge
 coredumps.

 Bentsi or other touched problem with old scylla-master AMI, a coredump
 occurred but not successfully saved to disk for enospc.  The directory
 /var/lib/systemd/coredump wasn't mounted to /var/lib/scylla/coredump.
 They WRONGLY thought the wrong mount was caused by the config problem,
 so he posted a fix.

 Actually scylla-ami-setup / coredump wasn't executed on that AMI, err:
 unit scylla-ami-setup.service not found Because
 'scylla-ami-setup.service' config file doesn't exist or is invalid.

 Details of my testing: https://github.com/scylladb/scylla/issues/6300#issuecomment-637324507

 So we need to revert Bentsi's patch, it changed the right config to wrong."
2020-06-02 11:41:31 +03:00
Avi Kivity
6f394e8e90 tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators
The comparison operator (<=>) default implementation happens to exactly
match tombstone::compare(), so use the compiler-generated defaults. Also
default operator== and operator!= (these are not brought in by operator<=>).
These become slightly faster as they perform just an equality comparison,
not three-way compare.

shadowable_tombstone and row_tombstone depend on tombstone::compare(),
so convert them too in a similar way.

with_relational_operations.hh becomes unused, so delete it.

Tests: unit (dev)
Message-Id: <20200602055626.2874801-1-avi@scylladb.com>
2020-06-02 09:28:52 +03:00
Piotr Sarna
160e2b06f9 test: move random string helpers to .cc
... since there's no reason for them to reside in a header,
and .cc is our default destination.

Message-Id: <2509410f0f71df036a7829f1f799503c1a671404.1591078777.git.sarna@scylladb.com>
2020-06-02 09:27:59 +03:00
Avi Kivity
a4c44cab88 treewide: update concepts language from the Concepts TS to C++20
Seastar recently lost support for the experimental Concepts Technical
Specification (TS) and gained support for C++20 concepts. Re-enable
concepts in Scylla by updating our use of concepts to the C++20
standard.

This change:
 - peels off uses of the GCC6_CONCEPT macro
 - removes inclusions of <seastar/gcc6-concepts.hh>
 - replaces function-style concepts (no longer supported) with
   equation-style concepts
 - semicolons added and removed as needed
 - deprecated std::is_pod replaced by recommended replacement
 - updates return type constraints to use concepts instead of
   type names (either std::same_as or std::convertible_to, with
   std::same_as chosen when possible)

No attempt is made to improve the concepts; this is a specification
update only.
Message-Id: <20200531110254.2555854-1-avi@scylladb.com>
2020-06-02 09:12:21 +03:00
Nadav Har'El
c77bc5bf51 merge: big_decimal: migrate to open-coded implementation
Merged patch series by Piotr Sarna:

This series migrates the regex-based implementation of big decimal
parsing to a more efficient one, based on string views.
The series originated as a single patch, but was later
extended by more tests and a microbenchmark.
Perf results, comparing the old implementation, the new one,
and the experimental one from v2 of this series are here:

test             iterations      median         mad         min         max

Regex:                88895    11.228us    25.891ns    11.202us    11.510us
String view:         232334     4.303us    21.660ns     4.282us     4.736us
State machine (experimental, ditched):
                     148318     6.723us    51.896ns     6.672us     6.877us
Tests: unit(dev)

Piotr Sarna (4):
  big_decimal: migrate to string views
  test: add test cases to big_decimal_test
  test/lib: add generating random numeric string
  test: add big_decimal perf test

 configure.py                   |  1 +
 test/boost/big_decimal_test.cc | 29 +++++++++++++++++++
 test/lib/make_random_string.hh | 11 +++++++
 test/perf/perf_big_decimal.cc  | 52 ++++++++++++++++++++++++++++++++++
 utils/big_decimal.cc           | 51 ++++++++++++++++++++++-----------
 5 files changed, 127 insertions(+), 17 deletions(-)
2020-06-02 09:12:21 +03:00
Takuya ASADA
6b19479ce5 dist/offline_installer/debian: support latest distributions
Added Ubuntu 18.04 and Debian 9/10.
2020-06-02 09:12:21 +03:00
Piotr Sarna
d1f5d42a25 test: add big_decimal perf test
In order to be able to measure the impact of rewritting
the parsing mechanism from std::regex to a hand-written
state machine.
2020-06-01 16:11:49 +02:00
Piotr Sarna
91e02ed3ad test/lib: add generating random numeric string
Useful for testing random numeric inputs, e.g. big decimals.
2020-06-01 16:11:49 +02:00
Piotr Sarna
ecc4a87a24 test: add test cases to big_decimal_test
Test cases for big decimals were quite complete, but since the
implementation was recently changed, some corner cases are added:
 - incorrect strings
 - numbers not fitting into uint64_t
 - numbers less than uint64_t::max themselves, but with the unscaled
   value exceeding the maximum
2020-06-01 16:11:49 +02:00
Piotr Sarna
7b5db478ed big_decimal: migrate to string views
Big decimals are, among other use cases, used as a main number
type for alternator, and as such can appear on the fast path.
Parsing big decimals was performed via std::regex, which is not
precisely famous for its speeds, and also enforces unnecessary
string copying. Therefore, the implementation is replaced
with an open-coded version based on string_views.
One previous iteration of this series also included
a hand-coded state machine implementation, but it proved
to be slower than the slightly naive string_view one.
Overall, execution time is reduced by 61.6% according to
microbenchmarks, which sounds like a promising improvement.

Perf results:
test                                      iterations      median         mad         min         max

Regex (original):
big_decimal_test.from_string                   88895    11.228us    25.891ns    11.202us    11.510us

String view (new):
big_decimal_test.from_string                  232334     4.303us    21.660ns     4.282us     4.736us

State machine (experimental, ditched):
big_decimal_test.from_string                  148318     6.723us    51.896ns     6.672us     6.877us

Tests: unit(dev + release(big_decimal_test))
2020-06-01 16:11:49 +02:00
Gleb Natapov
9848328844 lwt: do not go over the replica list in case a quorum is already reached
Also add a comment that clarifies why doing prune before learning on all
replicas is safe.

Message-Id: <20200531143523.GN337013@scylladb.com>
2020-06-01 12:57:37 +02:00
Asias He
6c89cedf0a repair: Do not pass table names to repair_info
Get the table names from the table ids instead which prevents the user
of repair_info class provides inconsistent table names and table ids.

Refs: #5942
2020-06-01 17:44:05 +08:00
Asias He
12d929a5ae repair: Add table_id to row_level_repair
Now that repair_info has tables id for the tables we want to repair. Use
table_id instead of table_name in row level repair to find a table. It
guarantees we repair the same table even if a table is dropped and a new
table is created with the same name.

Refs: #5942
2020-06-01 17:34:25 +08:00
Asias He
7ea8bf648d repair: Use table id to find a table in get_sharder_for_tables
We are moving to use the table id instead of table name to get a table
in repair. It guarantees the same table is repaired.

Refs: #5942
2020-06-01 17:34:25 +08:00
Asias He
378e31b409 repair: Add table_ids to repair_info
A helper get_table_ids is added to convert the table names to table ids.
We convert it once and use the same table ids for the whole repair
operations. This guarantees we repair the same table during the same
repair request.

Refs: #5942
2020-06-01 17:34:25 +08:00
Asias He
ad878a56eb repair: Make func in tracker::run run inside a thread
It simplify the code in func and makes it easier to write loop that does
not stall.

Refs: #5942
2020-06-01 17:34:16 +08:00
Avi Kivity
cb17baea77 Merge "Remove storage service from various places" from Pavel E
"
This is a combined set of tiny cleanups that has been
collected for the past few monthes. Mostly about removing
storage_service.hh inclusions here and there.

tests: unit(dev), headers compilation
"

* 'br-storage-service-cleanups-a' of https://github.com/xemul/scylla:
  storage_service: Remove some inclusions of its header
  storage_service: Move get_generation_number to util/
  streaming: Get local db with own helper
  streaming: Fix indentation after previous patch
  streaming: Do not explicitly switch sched group
2020-06-01 10:44:12 +03:00
Israel Fruchter
cd96202dcb fix(scylla_prepare): missing platform import
as part of eabcb31503 `import platform`
was removed from scylla_utils.py

seem like we missed it's usage in scylla_prepare script
2020-06-01 10:33:18 +03:00
Pavel Emelyanov
67d5fad65f storage_service: Remove some inclusions of its header
GC pass over .cc files. Some really do not need it, some need for features/gossiper

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Pavel Emelyanov
ee31191e21 storage_service: Move get_generation_number to util/
This is purely utility helper routine. As a nice side effect the
inclusion of storage_service.hh is removed from several unrelated
places.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Pavel Emelyanov
07add9767b streaming: Get local db with own helper
There's a static global instance of needed services and helpers
for it in streaming code. This is not great to use them, but at
least this change unifies different pieces of streaming code and
removes the storage_service.hh from streaming_session.cc (the
streaming_sessio.hh doesn't include it either).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Pavel Emelyanov
428ef9c9ac streaming: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Pavel Emelyanov
5db04fcf30 streaming: Do not explicitly switch sched group
This is continuation of ac998e95 -- the sched group is
switched by messaging service for a verb, no need to do
it by hands.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Pavel Solodovnikov
022d5f6498 cql3: use unique_ptr's for cql3::operation::raw_update
These are not shared anywhere and so can be easily changed to
be stored in std::unique_ptr instead of shared_ptr's.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200531201051.946432-1-pa.solodovnikov@scylladb.com>
2020-05-31 23:45:19 +03:00
Botond Dénes
7c56e79355 test/multishard_mutation_query_test: eliminate another unsafely used boost test macro
Boost test macros are not thread safe, using them from multiple threads
results in garbled XML test report output.
3f1823a4f0 replaced most of the
thread-unsafe boost test macros in multishard_mutation_query_test, but
one still managed to slip through the cracks. This patch removes that as
well.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529130706.149603-3-bdenes@scylladb.com>
2020-05-31 16:08:02 +03:00
Botond Dénes
c5b0e8a45a test: move thread-safe test macro alternatives to lib/test_utils.hh
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529130706.149603-2-bdenes@scylladb.com>
2020-05-31 16:08:02 +03:00
Israel Fruchter
eabcb31503 scylla_util.py: replace platform.dist() with distro package
since dbuild was updated to fedora-32, hence to python3.8
`platform.dist()` is deprecated, and need to be replaced

Fixes: #6501

[avi: folded patch with install-dependencies.sh change]
[avi: regenerated toolchain]
2020-05-31 13:42:34 +03:00
Avi Kivity
e63fd76a04 Update seastar submodule
* seastar c97b05b238...9066edd512 (2):
  > Merge "Delete c++14 support code" from Rafael
  > coroutines: add support for forwarding returns
2020-05-31 13:12:16 +03:00
Botond Dénes
7ea64b1838 test: mutation_reader_test: use <ranges>
Replace all the ranges stuff we use from boost with the std equivalents.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529141407.158960-3-bdenes@scylladb.com>
2020-05-31 12:58:59 +03:00
Botond Dénes
a9e6fe4071 utils: introduce ranges::to()
Sadly, std::ranges is missing an equivalent of boost::copy_range(), so
we introduce a replacement: ranges::to(). There is an existing proposal
to introduce something similar to the standard library:
std::ranges::to() (https://github.com/cplusplus/papers/issues/145). We
name our own version similarly, so if said proposal makes it in we can
just prepend std:: and be good.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529141407.158960-2-bdenes@scylladb.com>
2020-05-31 12:58:59 +03:00
Pavel Solodovnikov
c4bbeb80db cql3: pass column_specification by ref to cql3::assignment_testable functions
This patch changes the signatures of `test_assignment` and
`test_all` functions to accept `cql3::column_specification` by
const reference instead of shared pointer.

Mostly a cosmetic change reducing overall shared_ptr bloat in
cql3 code.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200529195249.767346-1-pa.solodovnikov@scylladb.com>
2020-05-30 09:49:29 +03:00
Raphael S. Carvalho
d6b4a9a237 compaction: increase the frequency at which we check for abortion requests
Compaction is checking for abortion whenever it's consuming a new partition.
The problem with this approach is that the abortion can take too long if
compaction is working with really large partitions. If the current partition
takes minutes to be compacted, it means that abortion may be delayed by
a factor of minutes as well.

Truncate, for example, relies on this abortion mechanism, so it could happen
that the operation would take much longer than expected due to this
ineffiency, probably result in timeouts in the user side.
To fix this, it's clear that we need to increase the frequency at which
we check for abortion requests. More precisely, we need to do it not only
on partition granularity, but also on row granularity.

Fixes #6309.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200529172847.44444-1-raphaelsc@scylladb.com>
2020-05-29 21:23:49 +02:00
Pavel Emelyanov
878f8d856a logalloc: Report reclamation timing with rate
The timer.stop() call, that reports not only the time-taken, but also
the reclaimation rate, was unintentionally dropped while expanding its
scope (c70ebc7c).

Take it back (and mark the compact_and_evict_locked as private while
at it).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185331.10537-1-xemul@scylladb.com>
2020-05-29 14:50:43 +02:00
Botond Dénes
94e00186b6 test.py: centralize the determining whether stdout is a tty
Currently test.py has three different places it checks whether stdout is
a tty. This patch centralizes these into a single global variable. This
ensures consistency and makes it easier to override it later with a
command-line switch, should we want to.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529101124.123925-1-bdenes@scylladb.com>
2020-05-29 14:50:43 +02:00
Pavel Emelyanov
7696ed1343 shard_tracker: Configure it in one go
Instead of doing 3 smp::invoke_on_all-s and duplicating
tracker::impl API for the tracker itself, introduce the
tracker::configure, simplify the tracker configuration
and narrow down the public tracker API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185442.10682-1-xemul@scylladb.com>
2020-05-29 14:50:43 +02:00
Kamil Braun
a304f774f8 redis: don't include storage_proxy.hh unnecessarily
Use a forward declaration instead.
2020-05-29 13:34:42 +02:00
Juliusz Stasiewicz
f2cedbc228 cdc: Remove assert that bootstrap_tokens is nonempty 2020-05-29 12:23:08 +02:00
Juliusz Stasiewicz
aadd2ffa6a api: Added command /storage_service/cdc_streams_check_and_repair
This commit introduces a placeholder for HTTP POST request at
`/storage_service/cdc_streams_check_and_repair`.
2020-05-29 12:23:08 +02:00
Avi Kivity
0c6bbc84cd Merge "Classify queries based on their initiator, rather than their target" from Botond
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.

This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.

The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
  permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
  various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
  waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
  methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.

Fixes: #5919

Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
  querier_cache: get semaphore from querier
  reader_permit: forbid empty permits
  reader_permit: fix reader_resources::operator bool
  treewide: remove all uses of no_reader_permit()
  database: make_multishard_streaming_reader: pass valid permit to multi range reader
  sstables: pass valid permits to all internal reads
  compaction: pass a valid permit to sstable reads
  database: add compaction read concurrency semaphore
  view: use valid permits for reads from the base table
  database: use valid permit for counter read-before-write
  database: introduce make_query_class_config()
  reader_concurrency_semaphore: remove wait_admission and consume_resources()
  test: move away from reader_concurrency_semaphore::wait_admission()
  reader_permit: resource_units: introduce add()
  mutation_reader: restricted_reader: work in terms of reader_permit
  row_cache: pass a valid permit to underlying read
  memtable: pass a valid permit to the delegate reader
  table: require a valid permit to be passed to most read methods
  multishard_mutation_query: pass a valid permit to shard mutation sources
  querier: add reader_permit parameter and forward it to the mutation_source
  ...
2020-05-29 10:11:44 +03:00
Raphael S. Carvalho
097a5e9e07 compaction: Disable garbage collected writer if interposer consumer is used
GC writer, used for incremental compaction, cannot be currently used if interposer
consumer is used. That's because compaction assumes that GC writer will be operated
only by a single compaction writer at a given point in time.
With interposer consumer, multiple writers will concurrently operate on the same
GC writer, leading to race condition which potentially result in use-after-free.

Let's disable GC writer if interposer consumer is enabled. We're not losing anything
because GC writer is currently only needed on strategies which don't implement an
interposer consumer. Resharding will always disable GC writer, which is the expected
behavior because it doesn't support incremental compaction yet.
The proper fix, which allows GC writer and interposer consumer to work together,
will require more time to implement and test, and for that reason, I am postponing
it as #6472 is a showstopper for the current release.

Fixes #6472.

tests: mode(dev).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200526195428.230472-1-raphaelsc@scylladb.com>
2020-05-29 08:26:43 +02:00
Nadav Har'El
17649ad0b5 alternator: error when unimplemented ConditionalOperator is used
The ScanFilter and QueryFilter features are only partially implemented.
Most of their unimplemented features cause clear errors telling the user
of the unimplemented feature, but one exception is the ConditionalOperator
parameter, which can be used to "OR", instead of the default "AND", of
several conditions. Before this patch, we simply ignored this parameter -
causing wrong results to be returned instead of an error.

In this patch, ScanFilter and QueryFilter parse, instead of ignoring, the
ConditionalOperator. The common implementation, get_filtering_restrictions(),
still does not implement the OR case, but returns an error if we reach
this case instead of just ignoring it.

There is no new test. The existing test_query_filter.py::test_query_filter_or
xfailed before this patch, and continues to xfail after it, but the failure
is different (you can see it by running the test with "--runxfail"):
Before this patch, the failure was because of different results. After this
patch, the failure is because of an "unimplemented" error message.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200528214721.230587-2-nyh@scylladb.com>
2020-05-29 08:26:43 +02:00
Nadav Har'El
d200cde9d6 alternator: extract function for parsing ConditionalOperator
The code for parsing the ConditionalOperator attribute was used once in
for the "Expected" case, but we will also need it for the "QueryFilter" and
"ScanFilter" cases, so let's extract it into a function,
get_conditional_operator().

While doing this extraction, I also noticed a bug: when Expected is missing,
ConditionalOperator should not be allowed. We correctly checked the case
of an empty Expected, but forgot to also check the case of a missing
Expected. So the new code also fixes this corner case, and we include
a new test case for it (which passes on DynamoDB and used to fail in
Alternator but passes after this patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200528214721.230587-1-nyh@scylladb.com>
2020-05-29 08:26:43 +02:00
Rafael Ávila de Espíndola
aa778ec152 configure: Reduce the dynamic linker path size
gdb has a SO_NAME_MAX_PATH_SIZE of 512, so we use that as the path
size.

Fixes: #6494

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200528202741.398695-2-espindola@scylladb.com>
2020-05-29 08:26:43 +02:00
Rafael Ávila de Espíndola
078c680690 configure: Implement get-dynamic-linker.sh directly in python
For now it produces exactly the same output.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200528202741.398695-1-espindola@scylladb.com>
2020-05-29 08:26:43 +02:00
Rafael Ávila de Espíndola
33e1ee024f configure: Delete old seastar option
The sestar we use doesn't have
-DSeastar_STD_OPTIONAL_VARIANT_STRINGVIEW=ON anymore.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200528203324.400141-1-espindola@scylladb.com>
2020-05-29 08:26:43 +02:00
Glauber Costa
44a0e40cb2 compaction: move compaction_strategy_type to its own header
I just hit a circularity in header inclusion that I traced back to the
fact that schema.hh includes compaction_strategy.hh. schema.hh is in
turn included in lots of places, so a circularity is not hard to come
by.

The schema header really only needs to know about the compaction_type,
so it can inform schema users about it. Following the trend in header
clenups, I am moving that to a separate header which will both break
the circularity and make sure we are included less stuff that is not
needed.

With this change, Scylla fails to compile due to a new missing forward
declaration at index/secondary_index_manager.hh, so this is fixed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200527172203.915936-1-glauber@scylladb.com>
2020-05-29 08:14:27 +03:00
Piotr Sarna
77e943e9a3 db,views: unify time points used for update generation
Until now, view updates were generated with a bunch of random
time points, because the interface was not adjusted for passing
a single time point. The time points were used to determine
whether cells were alive (e.g. because of TTL), so it's better
to unify the process:
1. when generating view updates from user writes, a single time point
   is used for the whole operation
2. when generating view updates via the view building process,
   a single time point is used for each build step

NOTE: I don't see any reliable and deterministic way of writing
      test scenarios which trigger problems with the old code.
      After #6488 is resolved and error injection is integrated
      into view.cc, tests can be added.

Fixes #6429
Tests: unit(dev)
Message-Id: <f864e965eb2e27ffc13d50359ad1e228894f7121.1590070130.git.sarna@scylladb.com>
2020-05-28 12:56:09 +03:00
Alejo Sanchez
bb08b5ad5a utils: error injections provide error exceptions
Provide non-timeout error exception
to facilitate control flow in injected errors.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-28 11:13:55 +02:00
Pavel Solodovnikov
014883d560 failure_injector: implement CQL API for failure injector class
The following UDFs are defined to control failure injector API usage:
 * enable_injection(name, args)
 * disable_injection(name)

All arguments have string type.

As currently function(terminal) is not supported by the parser,
the arguments must come from selected rows.

Added boost test for CQL API.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-28 11:13:55 +02:00
Alejo Sanchez
2c7e01a3b6 lwt: fix disabled error injection templates
Fix disabled injection templates to match enabled ones.
Fix corresponding test to not be a continuation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-28 11:13:55 +02:00
Botond Dénes
e678f06a5e querier_cache: get semaphore from querier
Currently the `querier_cache` is passed a semaphore during its
construction and it uses this semaphore to do all the inactive reader
registering/unregistering. This is inaccurate as in theory cached reads
could belong to different semaphores (although currently this is not yet
the case). As all queriers store a valid permit now, use this
permit to obtain the semaphore the querier is associated with, and
register the inactive read with this semaphore.
2020-05-28 11:34:35 +03:00
Botond Dénes
3cd2598ab3 reader_permit: forbid empty permits
Remove `no_reader_permit()` and all ways to create empty (invalid)
permits. All permits are guaranteed to be valid now and are only
obtainable from a semaphore.

`reader_permit::semaphore()` now returns a reference, as it is
guaranteed to always have a valid semaphore reference.
2020-05-28 11:34:35 +03:00
Botond Dénes
e40b1fc3c8 reader_permit: fix reader_resources::operator bool 2020-05-28 11:34:35 +03:00
Botond Dénes
d68ac8bf18 treewide: remove all uses of no_reader_permit() 2020-05-28 11:34:35 +03:00
Botond Dénes
0f55e8e30f database: make_multishard_streaming_reader: pass valid permit to multi range reader
The permit is not used by the mutation source passed to it, but empty
permits will soon be forbidden, so we have to pass a valid one.
2020-05-28 11:34:35 +03:00
Botond Dénes
b5aa08ed77 sstables: pass valid permits to all internal reads
We will soon require a valid permit for all reads, including low level
index reads. The sstable layer has several internal reads which can not
be associated with either the user or the system read semaphores or it
would be very hard to obtain the correct semaphore, for limited/no gain.
To be able to pass a valid permit still, we either expose a permit
parameter so upper layers can pass down one, or create a local semaphore
for these reads and use that to obtain a permit.
The following methods now require a permit to be passed to them:
* `sstables::sstabe::read_data()`: only used in tests.

The following methods use internal semaphores:
* `sstables::sstable::generate_summary()` used when loading an sstable.
* `sstables::sstable::has_partition_key()`: used by a REST API method.
2020-05-28 11:34:35 +03:00
Botond Dénes
0952a43a9e compaction: pass a valid permit to sstable reads
Use the newly created compaction read concurrency semaphore to create
and pass valid permits to all sstable reads done on behalf of
compaction.
2020-05-28 11:34:35 +03:00
Botond Dénes
734e995639 database: add compaction read concurrency semaphore
All reads will soon require a valid permit, including those done during
compaction. To allow creating valid permits for these reads create a
compaction specific semaphore. This semaphore is unlimited as compaction
concurrency is managed by higher level layer, we use just for resource
usage accounting.
2020-05-28 11:34:35 +03:00
Botond Dénes
992e697dd5 view: use valid permits for reads from the base table
View update generation involves reading existing values from the base
table, which will soon require a valid permit to be passed to it, so
make sure we create and pass a valid permit to these reads.
We use `database::make_query_class_config()` to obtain the semaphore for
the read which selects the appropriate user/system semaphore based on
the scheduling group the base table write is running in.
2020-05-28 11:34:35 +03:00
Botond Dénes
639bbefcd3 database: use valid permit for counter read-before-write
Counter writes involve a read-before-write, which will soon require a
valid permit to be passed to it, so make sure we create and pass a valid
permit to this read. We use `database::make_query_class_config()` to
obtain the semaphore for the read which selects the appropriate
user/system semaphore based on the scheduling group the counter write is
running in.
2020-05-28 11:34:35 +03:00
Botond Dénes
e4c591aa67 database: introduce make_query_class_config()
And use it to obtain any query-class specific configuration that was
obtained from `table::config` before, such as the read concurrency
semaphore and the max memory limit for unlimited queries. As all users
of these items get these from the query class config now, we can remove
them from `table::config`.
2020-05-28 11:34:35 +03:00
Botond Dénes
f417b9a3ea reader_concurrency_semaphore: remove wait_admission and consume_resources()
Permits are now created with `make_permit()` and code is using the
permit to do all resource consumption tracking and admission waiting, so
we can remove these from the semaphore. This allows us to remove some
now unused code from the permit as well, namely the `base_cost` which
was used to track the resource amount the permit was created with. Now
this amount is also tracked with a `resource_units` RAII object, returned
from `reader_permit::wait_admission()`, so it can be removed. Curiously,
this reduces the reader permit to be glorified semaphore pointer. Still,
the permit abstraction is worth keeping, because it allows us to make
changes to how the resource tracking part of the semaphore works,
without having to change the huge amount of code sites passing around
the permit.
2020-05-28 11:34:35 +03:00
Botond Dénes
a08467da29 test: move away from reader_concurrency_semaphore::wait_admission()
And use the reader_permit for this instead. This refactoring has
revealed a pre-existing bug in the `test_lifecycle_policy`, which is
also addressed in this patch. The bug is that said policy executes
reader destructions in the background, and these are not waited for. For
some reason, the semaphore -> permit transition pushes these races over
the edge and we start seeing some of these destruction fibers still
being unfinished when test scopes are exited, causing all sorts of
trouble. The solution is to introduce a special gate that tests can use
to wait for all background work to finish, before the test scope is
exited.
2020-05-28 11:34:35 +03:00
Botond Dénes
bf4ade8917 reader_permit: resource_units: introduce add()
Allows merging two resource_units into one.
2020-05-28 11:34:35 +03:00
Botond Dénes
4409579352 mutation_reader: restricted_reader: work in terms of reader_permit
We want to refactor all read resource tracking code to work through the
read_permit, so refactor the restricted reader to also do so.
2020-05-28 11:34:35 +03:00
Botond Dénes
fe024cecdc row_cache: pass a valid permit to underlying read
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
2020-05-28 11:34:35 +03:00
Botond Dénes
9ede82ebf8 memtable: pass a valid permit to the delegate reader
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
2020-05-28 11:34:35 +03:00
Botond Dénes
cc5137ffe3 table: require a valid permit to be passed to most read methods
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
2020-05-28 11:34:35 +03:00
Botond Dénes
d5ebd763ff multishard_mutation_query: pass a valid permit to shard mutation sources
In preparation of a valid permit being required to be passed to all
mutation sources, create a permit before creating the shard readers and
pass it to the mutation source when doing so. The permit is also
persisted in the `shard_mutation_querier` object when saving the reader,
which is another forward looking change, to allow the querier-cache to
use it to obtain the semaphore the read is actually registered with.
2020-05-28 11:34:35 +03:00
Botond Dénes
bad53c4245 querier: add reader_permit parameter and forward it to the mutation_source
In preparation of a valid permit being required to be passed to all
mutation sources, also add a permit to the querier object, which is then
passed to the source when it is used to create a reader.
2020-05-28 11:34:35 +03:00
Botond Dénes
14743c4412 data_query, mutation_query: use query_class_config
We want to move away from the current practice of selecting the relevant
read concurrency semaphore inside `table` and instead want to pass it
down from `database` so that we can pass down a semaphore that is
appropriate for the class of the query. Use the recently created
`query_class_config` struct for this. This is added as a parameter to
`data_query`, `mutation_query` and propagated down to the point where we
create the `querier` to execute the read. We are already propagating
down a parameter down the same route -- max_memory_reverse_query --
which also happens to be part of `query_class_config`, so simply replace
this parameter with a `query_class_config` one. As the lower layers are
not prepared for a semaphore passed from above, make sure this semaphore
is the same that is selected inside `table`. After the lower layers are
prepared for a semaphore arriving from above, we will switch it to be
the appropriate one for the class of the query.
2020-05-28 11:34:35 +03:00
Botond Dénes
0ee58d1d47 test: lib/reader_permit.hh: add make_query_class_config()
To be used by tests to obtain a query_class_config to pass to APIs that
require one. The class config contains the test semaphore.
2020-05-28 11:34:35 +03:00
Botond Dénes
308a162247 Introduce query_class_config
This struct will serve as a container of all the query-class
dependent configuration such as the semaphore to be used and the memory
limit for unlimited queries. As there is no good place to put this, we
create a separate header for it.
2020-05-28 11:34:35 +03:00
Botond Dénes
0b4ec62332 flat_mutation_reader: flat_multi_range_reader: add reader_permit parameter
Mutation sources will soon require a valid permit so make sure we have
one and pass it to the mutation sources when creating the underlying
readers.
For now, pass no_reader_permit() on call sites, deferring the obtaining
of a valid permit to later patches.
2020-05-28 11:34:35 +03:00
Botond Dénes
97af2d98d2 test: lib: introduce reader_permit.{hh,cc}
This contains a reader concurrency semaphore for the tests, that they
can use to obtain a valid permit for reads. Soon we are going to start
working towards a point where all APIs taking a permit will require a
valid one. Before we start this work we must ensure test code is able to
obtain a valid permit.
2020-05-28 11:34:35 +03:00
Botond Dénes
4d7250d12b reader_permit: add wait_admission
We want to make `read_permit` the single interface through which reads
interact with the concurrency limiting mechanism. So far it was only
usable to track memory consumption. Add the missing `wait_admission()`
and `consume_resources()` to the permit API. As opposed to
`reader_concurrency_semaphore::` equivalents which returned a
permit, the `reader_permit::` variants jut return
`reader_permit::resource_units` which is an RAII holder for the acquired
units. This also allows for the permit to be created earlier, before the
reader is admitted, allowing for tracking pre-admission memory usage as
well. In fact this is what we are going to do in the next patches.

This patch also introduces a `broken()` method on the reader concurrency
semaphore which resolves waiters with an exception. This method is also
called internally from the semaphore's destructor. This is needed
because the semaphore can now have external waiters, who has to be
resolved before the semaphore itself is destroyed.
2020-05-28 11:34:35 +03:00
Botond Dénes
bd793d6e19 reader_permit: resource_units: work in terms of reader_resources
Refactor resource_units semantically as well to work in terms of
reader_resources, instead of just memory.
2020-05-28 11:34:35 +03:00
Botond Dénes
0f9c24631a reader_permit: s/memory_units/resource_units/
We want to refactor reader_permit::memory_units to work in terms of
reader_resources, as we are planning to use it for guarding count
resources as well. This patch makes the first step: renames it from
memory_units to resources_units. Since this is a very noisy change, we
do it in a separate patch, the semantic change is in the next patch.
2020-05-28 11:34:35 +03:00
Botond Dénes
16d8cdadc9 messaging_service: introduce the tenant concept
Tenants get their own connections for statement verbs and are further
isolated from each other by different scheduling groups. A tenant is
identified by a scheduling group and a name. When selecting the client
index for a statement verb, we look up the tenant whose scheduling group
matches the current one. This scheduling group is persisted across the
RPC call, using the name to identify the tenant on the remote end, where
a reverse lookup (name -> scheduling group) happens.

Instead of a single scheduling group to be used for all statement verbs,
messaging_service::scheduling_config now contains a list of tenants. The
first among these is the default tenant, the one we use when the current
scheduling group doesn't match that of any configured tenant.
To make this mapping easier, we reshuffle the client index assignment,
such that statement and statement-ack verbs have the idx 2 and 3
respectively, instead of 0 and 3.

The tenant configuration is configured at message service construction
time and cannot be changed after. Adding such capability should be easy
but is not needed for query classification, the current user of the
tenant concept.

Currently two tenants are configured: $user (default tenant) and
$system.
2020-05-28 11:34:32 +03:00
Avi Kivity
db8974fef3 messaging_service: de-static-ify _scheduling_info_for_connection_index
Per-user SLA means we have connection classifications determined dynamically,
as SLAs are added or removed. This means the classification information cannot
be static.

Fix by making it a non-static vector (instead of a static array), allowing it
to be extended. The scheduling group member pointer is replaced by a scheduling
group as a member pointer won't work anymore - we won't have a member to refer
to.
2020-05-28 10:40:08 +03:00
Avi Kivity
10dd08c9b0 messaging_service: supply and interpret rpc isolation_cookies
On the client side, we supply an isolation cookie based on the connection index
On the server side, we convert an isolation cookie back to a scheduling_group.

This has two advantages:
 - rpc processes the entire connection using the scheduling group, so that code
   is also isolated and accounted for
 - we can later add per-user connections; the previous approach of looking at the
   verb to decide the scheduling_group doesn't help because we don't have a set of
   verbs per user

With this, the main group sees <0.1% usage under simple read and write loads.
2020-05-28 10:40:08 +03:00
Avi Kivity
dbce57fa3c messaging_service: extract connection_index -> scheduling_group translation
Move it from a function-local static to a class static variable. We will want
to extend it in two ways:
 - add more information per connection index (like the rpc isolation cookie)
 - support adding more connections for per-user SLA

As a first step, make it an array of structures and make it accessible to all
of messaging_service.
2020-05-28 10:40:08 +03:00
Botond Dénes
e0b98ba921 database: give system reads a concurrency boost during startup
In the next patches we will match reads to the appropriate reader
concurrency semaphore based on the scheduling group they run in. This
will result in a lot of system reads that are executed during startup
and that were up to now (incorrectly) using the user read semaphore to
switch to the system read semaphore. This latter has a much more
constrained concurrency, which was observed to cause system reads to
saturate and block on the semaphore, slowing down startup.
To solve this, boost the concurrency of the system read semaphore during
startup to match that of the user semaphore. This is ok, as during
startup there are no user reads to compete with. After startup, before
we start serving user reads the concurrency is reverted back to the
normal value.
2020-05-28 10:40:08 +03:00
Botond Dénes
521342f827 reader_concurrency_semaphore: expose signal/consume
To allow the amount of available resource to be adjusted after creation.
2020-05-28 10:40:08 +03:00
Pavel Solodovnikov
d7fb51a094 cql3: remove unused functions get_stored_prepared_statement*
These functions are not used anywhere, so no reason to keep them
around.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200527172723.409019-1-pa.solodovnikov@scylladb.com>
2020-05-28 09:09:48 +02:00
Israel Fruchter
e5da79c211 scylla_current_repo: support diffrent $PRODUCT
Support point to the correct download url for
diffrent scylla products
2020-05-28 09:03:16 +03:00
Avi Kivity
9c26bdf944 Update seastar submodule
* seastar 37774aa78...c97b05b23 (13):
  > test: futures: test async with throw_on_move arg
  > Merge 'fstream: close file if construction fails' from Botond
  > util: tmp_file: include <seastar/core/thread.hh>
  > test: file_io: test_file_stat_method: convert to use tmp_dir
  > reactor: don't mlock all memory at once
  > future: specify uninitialized_wrapper_base default constructors as noexcept
  > test: tls: ignore gate_closed_exception
  > rpc: recv_helper: ignore gate_closed_exception when replying to oversized requests
  > sharded: support passing arbitrary shard-dependent parameters to service constructors
  > Update circleci configuration for C++20
  > treewide: deprecate seastar::apply()
  > Update README.md about c++ versions
  > cmake: Remove Seastar_STD_OPTIONAL_VARIANT_STRINGVIEW
2020-05-28 06:34:02 +03:00
Rafael Ávila de Espíndola
f274148be9 configure: Use seastar's api v2
No change right now as that is the current api version on the seastar
we have, but being explicit will let us upgrade seastar and change the
api independently.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200527235211.301654-1-espindola@scylladb.com>
2020-05-28 06:33:51 +03:00
Nadav Har'El
cd0fbb8d38 alternator test: add comprehensive tests for QueryFilter feature
The QueryFilter parameter of Query is only partially implemented (issue
tests for it.

In this patch, we add comprehensive tests for this feature and all its
various operators, types, and corner cases. The tests cover both the
parts we already implemented, and the parts we did not yet.

As usual, all tests succeed on DynamoDB, but many still xfail on Alternator
pending the complete implementation.

Refs #5028.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200525141242.133710-1-nyh@scylladb.com>
2020-05-27 15:29:27 +02:00
Avi Kivity
829e2508d0 logalloc: fix entropy depletion in test_compaction_with_multiple_regions()
test_compaction_with_multiple_regions() has two calls to std::shuffle(),
one using std::default_random_engine() has the PRNG, but the other, later
on, using the std::random_device directly. This can cause failures due to
entropy pool exhaustion.

Fix by making the `random` variable refer to the PRNG, not the random_device,
and adjust the first std::shuffle() call. This hides the random_device so
it can't be used more than once.

Message-Id: <20200527124247.2187364-1-avi@scylladb.com>
2020-05-27 15:51:16 +03:00
Botond Dénes
3f1823a4f0 multishard_mutation_query_test: don't use boost test macros in multiple shards
Boost test macros are not safe to use in multiple shards (threads).
Doing so will result in their output being interwoven, making it
unreadable and generating invalid XML test reports. There was a lot of
back-and-forth on how to solve this, including introducing thread-safe
wrappers of the boost test macros, that use locks. This patch does
something much simple: it defines a bunch of replacement utility
functions for the used macros. These functions use the thread safe
seastar logger to log messages and throw exceptions when the
test has to be failed, which is pretty much what boost test does too.
With this the previously seen complaint about invalid XML is gone.

Example log messages from the utility functions:
DEBUG 2020-05-27 13:32:54,248 [shard 1] testlog - check_equal(): OK @ validate_result() test/boost/multishard_mutation_query_test.cc:863: ckp{0004fe57c8d2} == ckp{0004fe57c8d2}
DEBUG 2020-05-27 13:32:54,248 [shard 1] testlog - require(): OK @ validate_result() test/boost/multishard_mutation_query_test.cc:855

Fixes: #4774

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200527104426.176342-1-bdenes@scylladb.com>
2020-05-27 15:50:05 +03:00
Botond Dénes
caf21d7db9 test.py: disable boost test's colored output when stdout is not a tty
Boost test uses colored output by default, even when the output of the
test is redirected to a file. This makes the output quite hard to read
for example in Jenkins. This patch fixes this by disabling the colored
output when stdout is not a tty. This is in line with the colored output
of configure.py itself, which is also enabled only if stdout is a tty.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200526112857.76131-1-bdenes@scylladb.com>
2020-05-27 14:20:12 +03:00
Avi Kivity
bdb5b11d19 treewide: stop using deprecated seastar::apply()
seastar::apply() is deprecated in recent versions of seastar in favor
of std::apply(), so stop including its header. Calls to unqualified
apply(..., std::tuple<>) are resolved to std::apply() by argument
dependent lookup, so no changes to call sites are necessary.

This avoids a huge number of deprecation warnings with latest seastar.
Message-Id: <20200526090552.1969633-1-avi@scylladb.com>
2020-05-27 14:07:35 +03:00
Nadav Har'El
51adaea499 alternator: use C++20 std::string_view::starts_with()
We had to wait many years for it, but finally we have a starts_with()
method in C++20. Let's use it instead of ugly substr()-based code.

This is probably not a performance gain - substr() for a string_view
was already efficient. But it makes the code easier to understand,
and it allows us to rejoice in our decision to switch to C++20.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200526185812.165038-2-nyh@scylladb.com>
2020-05-27 08:14:12 +02:00
Nadav Har'El
b2ca7f6fc0 alternator: another check base64 begins_with without decoding
In commit cb7d3c6b55 we started to check
if two base64-encoded strings begin with each other without decoding
the strings first.

However, we missed the check_BEGINS_WITH function which does the same
thing. So this patch fixes this function as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200526185812.165038-1-nyh@scylladb.com>
2020-05-27 08:13:59 +02:00
Pekka Enberg
8721534dfb Merge "tests: avoid exhausting random_device entropy" from Avi
"
In several tests we were calling random_device::operator() in a tight
loop. This is a slow operation, and in gcc 10 can fail if called too
frequently due to a bug [1].

Change to use a random_engine instead, seeded once from the
random_device.

Tests: unit (dev)

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
"

* 'entropy' of git://github.com/avikivity/scylla:
  tests: lsa_sync_eviction_test: don't exhaust random number entropy
  tests: querier_cache_test: don't exhaust random number entropy
  tests: loading_cache_test: don't exhaust random number entropy
  tests: dynamic_bitset_test: don't exhaust random number entropy
2020-05-27 08:40:06 +03:00
Botond Dénes
838b92f4b0 idl-compiler.py: don't use 'is not' for string comparison
In python, `is` and `is not` checks object identity, not value
equivalence, yet in `idl-compiler.py` it is used to compare strings.
Newer python versions (that shipped in Fedora32) complains about this
misuse, so this patch fixes it.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200526091811.50229-1-bdenes@scylladb.com>
2020-05-27 08:40:05 +03:00
Avi Kivity
427398641a build: switch C++ dialect to C++20
This gives us access to std::ranges, the spaceship operator, and more.

Note coroutines are not yet enabled (these require g++ -fcoroutines) as
we are still working our problem with address santizer support.

Tests: unit (dev, debug, release)
Message-Id: <20200521092157.1460983-1-avi@scylladb.com>
2020-05-27 08:40:05 +03:00
Nadav Har'El
c3da9f2bd4 alternator: add mandatory configurable write isolation mode
Alternator supports four ways in which write operations can use quorum
writes or LWT or both, which we called "write isolation policies".

Until this patch, Alternator defaulted to the most generally safe policy,
"always_use_lwt". This default could have been overriden for each table
separately, but there was no way to change this default for all tables.
This patch adds a "--alternator-write-isolation" configuration option which
allows changing the default.

Moreover, @dorlaor asked that users must *explicitly* choose this default
mode, and not get "always_use_lwt" without noticing. The previous default,
"always_use_lwt" supports any workload correctly but because it uses LWT
for all writes it may be disappointingly slow for users who run write-only
workloads (including most benchmarks) - such users might find the slow
writes so disappointing that they will drop Scylla. Conversely, a default
of "forbid_rmw" will be faster and still correct, but will fail on workloads
which need read-modify-write operations - and suprise users that need these
operations. So Dor asked that that *none* of the write modes be made the
default, and users must make an informed choice between the different write
modes, rather than being disappointed by a default choice they weren't
aware of.

So after this patch, Scylla refuses to boot if Alternator is enabled but
a "--alternator-write-isolation" option is missing.

The patch also modifies the relevant documentation, adds the same option to
our docker image, and the modifies the test-running script
test/alternator/run to run Scylla with the old default mode (always_use_lwt),
which we need because we want to test RMW operations as well.

Fixes #6452

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524160338.108417-1-nyh@scylladb.com>
2020-05-27 08:40:05 +03:00
Kamil Braun
7a98db2ab3 cdc: set ttl column in log rows which update only collections 2020-05-27 08:40:05 +03:00
Tomasz Grabiec
1424543e11 Merge "Move sstables_format on sstable_manager" from Pavel Emelyanov
The format is currently sitting in storage_service, but the
previous set patched all the users not to call it, instead
they use sstables_manager to get the highest supported format.
So this set finalizes this effort and places the format on
sstables_manager(s).

The set introduces the db::sstables_format_selector, that

 - starts with the lowest format (ka)
 - reads one on start from system tables
 - subscribes on sstables-related features and bumps
   up the selection if the respective feature is enabled

During its lifetime the selector holds a reference to the
sharded<database> and updates the format on it, the database,
in turn, propagates it further to sstables_managers. The
managers start with the highest known format (mc) which is
done for tests.

* https://github.com/xemul/scylla br-move-sstables-format-4:
  storage_service: Get rid of one-line helpers
  system_keyspace: Cleanup setup() from storage_service
  format_selector: Log which format is being selected
  sstables_manager: Keep format on
  format_selector: Make it standalone
  format_selector: Move the code into db/
  format_selector: Select format locally
  storage_service: Introduce format_selector
  storage_service: Split feature_enabled_listener::on_enabled
  storage_service: Tossing bits around
  features: Introduce and use masked features
  features: Get rid of per-features booleans
2020-05-27 08:40:05 +03:00
Gleb Natapov
e3ff88e674 lwt: prune system.paxos table when quorum of replicas learned the value
Instead of waiting for all replicas to reply execute prune after quorum
of replicas. This will keep system.paxos smaller in the case where one
node is down.

Fixes #6330

Message-Id: <20200525110822.GC233208@scylladb.com>
2020-05-27 08:40:05 +03:00
Piotr Sarna
ca2b96661d Update seastar submodule
* seastar ee516b1c...37774aa7 (12):
  > task: specify the default constructor as noexcept
  > scheduling: scheduling_group: specify explicit constructor as noexcept
  > net: tcp: use var after std::move()ed
  > future: implement make_exception_future_with_backtrace
  > future: Add noexcept to a few functions
  > scheduling: Add noexcept to a couple of functions
  > future: Move current_exception_as_future out of internal
  > future: Avoid a call to std::current_exception
  > seastar.hh: fix typo in doxygen main page text
  > future: Replace a call to futurize_apply with futurize_invoke
  > rpc: document how isolation work
  > future: Optimize any::move_it
2020-05-27 08:40:05 +03:00
Raphael S. Carvalho
9ebf7b442e timestamp_based_splitting_writer: fix use-after-move look-alike
rt is moved before rt.tomb.timestamp is retrieved, so there's a
something that looks like use-after-move here (but really isn't).

found it while auditting the code.

[avi: adjusted changelog to note that it's not really a use-after-move]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200525141047.168968-1-raphaelsc@scylladb.com>
2020-05-27 08:40:05 +03:00
Nadav Har'El
2eb929e89b merge: Allow our users to shoot themselves in their feet
Merge pull request https://github.com/scylladb/scylla/pull/6484 by
Kamil Braun:

Allow a node to join without bootstrapping, even if it couldn't contact
other nodes.

Print a BIG WARNING saying that you should never join nodes without
bootstrapping (by marking it as a seed or using auto_bootstrap=off).

Only the very first node should (must) be joined as a seed.

If you want to have more seeds, first join them using the only supported
way (i.e. bootstrap them), and only AFTER they have bootstrapped, change
their configuration to include them in the seed list.

Does not fix, but closes #6005. Read the discussion: it's enlightening.
See scylladb/scylla-docs#2647 for the correct procedure of joining a node.

Reverts 7cb6ac3.
2020-05-27 08:40:05 +03:00
Nadav Har'El
b12265c2d5 alternator test: improve FilterExpression tests for "contains()"
The tests for the contains() operator of FilterExpression were based on
an incorrect understanding of what this operator does. Because the tests
were (as usual) run against DynamoDB and passed, there was nothing wrong
in the test per se - but it contains comments based on the wrong
understanding, and also various corner cases which aren't as interesting
as I thought (and vice versa - missed interesting corner cases).

All these tests continue to pass on DynamoDB, and xfail on Alternator
(because we didn't implement FilterExpression yet).

Refs #5038.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200525123812.131209-1-nyh@scylladb.com>
2020-05-27 08:40:05 +03:00
Avi Kivity
8d27e1b4a9 Merge 'Propagate tracing to materialized view update path' from Piotr S
In order to improve materialized views' debuggability, tracing points are added to view update generation path.

Example trace:
```
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                                                               Execute CQL3 query | 2020-04-27 13:13:46.834000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                    Parsing a statement [shard 0] | 2020-04-27 13:13:46.834346 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                                 Processing a statement [shard 0] | 2020-04-27 13:13:46.834426 | 127.0.0.1 |             80 | 127.0.0.1
                                                                     Creating write handler for token: -3248873570005575792 natural: {127.0.0.1, 127.0.0.3} pending: {} [shard 0] | 2020-04-27 13:13:46.834494 | 127.0.0.1 |            148 | 127.0.0.1
                                                                                                      Creating write handler with live: {127.0.0.3, 127.0.0.1} dead: {} [shard 0] | 2020-04-27 13:13:46.834507 | 127.0.0.1 |            161 | 127.0.0.1
                                                                                                                                       Sending a mutation to /127.0.0.3 [shard 0] | 2020-04-27 13:13:46.834519 | 127.0.0.1 |            173 | 127.0.0.1
                                                                                                                                           Executing a mutation locally [shard 0] | 2020-04-27 13:13:46.834532 | 127.0.0.1 |            186 | 127.0.0.1
                                                                                         View updates for ks.t require read-before-write - base table reader is created [shard 0] | 2020-04-27 13:13:46.834570 | 127.0.0.1 |            224 | 127.0.0.1
        Reading key {{-3248873570005575792, pk{000400000002}}} from sstable /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db [shard 0] | 2020-04-27 13:13:46.834608 | 127.0.0.1 |            262 | 127.0.0.1
                           /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Index.db: scheduling bulk DMA read of size 8 at offset 0 [shard 0] | 2020-04-27 13:13:46.834635 | 127.0.0.1 |            289 | 127.0.0.1
  /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Index.db: finished bulk DMA read of size 8 at offset 0, successfully read 8 bytes [shard 0] | 2020-04-27 13:13:46.834975 | 127.0.0.1 |            629 | 127.0.0.1
                                                                                                                                       Message received from /127.0.0.1 [shard 0] | 2020-04-27 13:13:46.834988 | 127.0.0.3 |             11 | 127.0.0.1
                           /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db: scheduling bulk DMA read of size 41 at offset 0 [shard 0] | 2020-04-27 13:13:46.835015 | 127.0.0.1 |            669 | 127.0.0.1
                                                                                         View updates for ks.t require read-before-write - base table reader is created [shard 0] | 2020-04-27 13:13:46.835020 | 127.0.0.3 |             44 | 127.0.0.1
                                                                                                                                      Generated 1 view update mutations [shard 0] | 2020-04-27 13:13:46.835080 | 127.0.0.3 |            104 | 127.0.0.1
               Sending view update for ks.t_v2_idx_index to 127.0.0.2, with pending endpoints = {}; base token = -3248873570005575792; view token = 3728482343045213994 [shard 0] | 2020-04-27 13:13:46.835095 | 127.0.0.3 |            119 | 127.0.0.1
                                                                                                                                       Sending a mutation to /127.0.0.2 [shard 0] | 2020-04-27 13:13:46.835105 | 127.0.0.3 |            129 | 127.0.0.1
                                                                                                                    View updates for ks.t were generated and propagated [shard 0] | 2020-04-27 13:13:46.835117 | 127.0.0.3 |            141 | 127.0.0.1
 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db: finished bulk DMA read of size 41 at offset 0, successfully read 41 bytes [shard 0] | 2020-04-27 13:13:46.835160 | 127.0.0.1 |            813 | 127.0.0.1
                                                                                                                                    Sending mutation_done to /127.0.0.1 [shard 0] | 2020-04-27 13:13:46.835164 | 127.0.0.3 |            188 | 127.0.0.1
                                                                                                                                              Mutation handling is done [shard 0] | 2020-04-27 13:13:46.835177 | 127.0.0.3 |            201 | 127.0.0.1
                                                                                                                                      Generated 1 view update mutations [shard 0] | 2020-04-27 13:13:46.835215 | 127.0.0.1 |            869 | 127.0.0.1
                                                Locally applying view update for ks.t_v2_idx_index; base token = -3248873570005575792; view token = 3728482343045213994 [shard 0] | 2020-04-27 13:13:46.835226 | 127.0.0.1 |            880 | 127.0.0.1
                                                                                            Successfully applied local view update for 127.0.0.1 and 0 remote endpoints [shard 0] | 2020-04-27 13:13:46.835253 | 127.0.0.1 |            907 | 127.0.0.1
                                                                                                                    View updates for ks.t were generated and propagated [shard 0] | 2020-04-27 13:13:46.835256 | 127.0.0.1 |            910 | 127.0.0.1
                                                                                                                                         Got a response from /127.0.0.1 [shard 0] | 2020-04-27 13:13:46.835274 | 127.0.0.1 |            928 | 127.0.0.1
                                                                                                           Delay decision due to throttling: do not delay, resuming now [shard 0] | 2020-04-27 13:13:46.835276 | 127.0.0.1 |            930 | 127.0.0.1
                                                                                                                                        Mutation successfully completed [shard 0] | 2020-04-27 13:13:46.835279 | 127.0.0.1 |            933 | 127.0.0.1
                                                                                                                                   Done processing - preparing a result [shard 0] | 2020-04-27 13:13:46.835286 | 127.0.0.1 |            941 | 127.0.0.1
                                                                                                                                       Message received from /127.0.0.3 [shard 0] | 2020-04-27 13:13:46.835331 | 127.0.0.2 |             14 | 127.0.0.1
                                                                                                                                    Sending mutation_done to /127.0.0.3 [shard 0] | 2020-04-27 13:13:46.835399 | 127.0.0.2 |             82 | 127.0.0.1
                                                                                                                                              Mutation handling is done [shard 0] | 2020-04-27 13:13:46.835413 | 127.0.0.2 |             96 | 127.0.0.1
                                                                                                                                         Got a response from /127.0.0.2 [shard 0] | 2020-04-27 13:13:46.835639 | 127.0.0.3 |            662 | 127.0.0.1
                                                                                                           Delay decision due to throttling: do not delay, resuming now [shard 0] | 2020-04-27 13:13:46.835640 | 127.0.0.3 |            664 | 127.0.0.1
                                                                                                  Successfully applied view update for 127.0.0.2 and 1 remote endpoints [shard 0] | 2020-04-27 13:13:46.835649 | 127.0.0.3 |            673 | 127.0.0.1
                                                                                                                                         Got a response from /127.0.0.3 [shard 0] | 2020-04-27 13:13:46.835841 | 127.0.0.1 |           1495 | 127.0.0.1
                                                                                                                                                                 Request complete | 2020-04-27 13:13:46.834944 | 127.0.0.1 |            944 | 127.0.0.1
```

Fixes #6175
Tests: unit(dev), manual

* psarna-propagate_tracing_to_more_write_paths:
  db,view: add tracing to view update generation path
  treewide: propagate trace state to write path
2020-05-27 08:40:05 +03:00
Takuya ASADA
287d6e5ece dist/debian: drop dependency on pystache
Same as 9d91ac345a, drop dependency on pystache
since it nolonger present in Fedora 32.

To implement it, simplified debian package build process.
It will be generate debian/ directory when building relocatable package,
we just need to run debuild using the package.

To generate debian/ directory this commit added debian_files_gen.py,
it construct whole directory including control and changelog files
from template files.
Since we need to stop pystache, these template files swiched to
string.Template class which is included python3 standard library.

see: https://github.com/scylladb/scylla/pull/6313
2020-05-27 08:40:05 +03:00
Amnon Heiman
3e5beba403 estimated_histogram: clean if0 and FIXME
This patch cleans the estimated histogram implementation.
It removes the FIXME that were left in the code from the migration time
and the if0 commented out code.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-05-27 08:40:05 +03:00
Avi Kivity
3ead6feaf0 tests: lsa_sync_eviction_test: don't exhaust random number entropy
We call shuffle() with a random_device, extracting a true random
number in each of the many calls shuffle() will invoke.
Change it to use a random_engine seeded by a random_device.

This avoids exhausting entropy, see [1] for details.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
2020-05-26 20:51:38 +03:00
Avi Kivity
11698aafc1 tests: querier_cache_test: don't exhaust random number entropy
rand_int() re-creates a random device each time it is called.
Change it to use a static random_device, and get random numbers
from a random_engine instead of from the device directly.

This avoids exhausting entropy, see [1] for details.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
2020-05-26 20:51:16 +03:00
Avi Kivity
e2f4c689b1 tests: loading_cache_test: don't exhaust random number entropy
rand_int() re-creates a random device each time it is called.
Change it to use a static random_device, and get random numbers
from a random_engine instead of from the device directly.

This avoids exhausting entropy, see [1] for details.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
2020-05-26 20:49:58 +03:00
Avi Kivity
85da266cf4 tests: dynamic_bitset_test: don't exhaust random number entropy
tests_random_ops() extracts a real random number from a random_device.
Change it to use a random number engine.

This avoids exhausting entropy, see [1] for details.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087
2020-05-26 20:46:45 +03:00
Pavel Emelyanov
ccdee822e1 storage_service: Get rid of one-line helpers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 14:17:31 +03:00
Pavel Emelyanov
3c2066bd78 system_keyspace: Cleanup setup() from storage_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 14:17:31 +03:00
Pavel Emelyanov
0598b3a858 format_selector: Log which format is being selected
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 14:17:31 +03:00
Pavel Emelyanov
89a1b09214 sstables_manager: Keep format on
Make the database be the format_selector target, so
when the format is selected its set on database which
in turn just forwards the selection into sstables
managers. All users of the format are already patched
to read it from those managers.

The initial value for the format is the highest, which
is needed by tests. When scylla starts the format is
updated by format_selector, first after reading from
system tables, then by selectiing it from features.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 14:17:28 +03:00
Pavel Emelyanov
a61f18ed64 format_selector: Make it standalone
Remove the selector from storage_service and introduce
an instance in main.cc that starts soon after the gossiper
and feature_service, starts listening for features and
sets the selected format on storage_service.

This change includes

- Removal of for_testing bit from format_selector constructor,
  now tests just do not use it
- Adding a gate to selection routine to make sure on exit all
  the selection stuff is done. Although before the cluster join
  the selector waits for the feature listeners to finish (the
  .sync() method) this gate is still required to handle aborted
  start cases and wait for gossiper announcement from selector
  to complete.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 14:15:04 +03:00
Pavel Emelyanov
1692d94c9a format_selector: Move the code into db/
This is just move, no changes in code logic.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 14:09:24 +03:00
Pavel Emelyanov
f13078ce80 format_selector: Select format locally
Now format_selector uses storage_service as a place to
keep the selected format. Change this by keeping the
selected format on selector itself and after selection
update one on the target.

The selector starts with the lowest format to maybe bumps
it up later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:43:47 +03:00
Pavel Emelyanov
5eb37c3743 storage_service: Introduce format_selector
The final goal is to have a entity that will

- read the saved sstables format (if any)
- listen for sstables format related features enabling
- select the top-most format
- put the selected format onto a "target"
- spread the world about it (via gossiper)

The target is the service from which the selected format is
read (so the selector can be removed once features agreement
is reached). Today it's the storage_service, but at the end
of this series it will be sstables_manager.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:27:34 +03:00
Pavel Emelyanov
833aa91f77 storage_service: Split feature_enabled_listener::on_enabled
The split is into two parts, the goal is to move the 2nd one (the
selection logic itself) into another class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:24:10 +03:00
Pavel Emelyanov
70391feb8e storage_service: Tossing bits around
The goal is to have main.cc add code between prepare_to_join
and join_token_ring. As a side effect this drives us closer
to proper split of storage service into sharded service itslef
vs start/boot/join code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:21:08 +03:00
Pavel Emelyanov
d53d2bb664 features: Introduce and use masked features
Nowadays the knowledge about known/supported features is
scattered between frature_service and storage_service. The
latter uses knowledge about the selected _sstables_format
to alter the "supported" set.

Encapsulate this knowledge inside the feature_service with
the help of "masked_features" -- those, that shouldn't be
advertized to other nodes. When only maskable feature for
today is the UNBOUNDED_RANGE_TOMBSTONES one. Nowadays it's
reported as supported only if the sstables format is MC.
With this patch it starts as masked and gets unmasked when
the sstables format is selected to be MC, so the change is
correct.

This will make it possible to move sstables_format from
storage service to anywhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:21:07 +03:00
Pavel Emelyanov
bb3a71529a features: Get rid of per-features booleans
The set of bool enable_something-s on feature_fonfig duplicates
the disabled_features set on it, so remove the former and make
full use of the latter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:09:12 +03:00
Nadav Har'El
bf7b5a0a0d alternator test: add tests for Query's KeyConditions
We had a very limited set of tests for the KeyConditions feature of
Query, which some error cases as well as important use cases (such as
bytes keys), leading to bugs #6490 and #6495 remaining undiscovered.

This patch adds a comprehensive test for the KeyConditions and (hopefully)
all its different combinations of operators, types, and many cases of errors.

We already had a comprehensive test suite for the newer
KeyConditionsExpression syntax, and this patch brings a similar level of
coverage for the older KeyConditions syntax.

Refs #6490
Refs #6495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-3-nyh@scylladb.com>
2020-05-25 09:59:06 +02:00
Nadav Har'El
f2eab853a5 alternator: improve Query's KeyConditions error message
Improve error messages coming from Query's KeyCondition parameter when
wrong ComparisonOperators were used (issue discovered by @Orenef11).

At one point the error message was missing a parameter so resulted in an
internal error, while in another place the message mentioned an unuseful
number (enum) for the operator instead of its name. This patch fixes these
error messages.

Fixes #6490

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-2-nyh@scylladb.com>
2020-05-25 09:59:00 +02:00
Nadav Har'El
6b38126a8f alternator: fix support for bytes type in Query's KeyConditions
Our parsing of values in a KeyConditions paramter of Query was done naively.
As a result, we got bizarre error messages "condition not met: false" when
these values had incorrect type (this is issue #6490). Worse - the naive
conversion did not decode base64-encoded bytes value as needed, so
KeyConditions on bytes-typed keys did not work at all.

This patch fixes these bugs by using our existing utility function
get_key_from_typed_value(), which takes care of throwing sensible errors
when types don't match, and decoding base64 as needed.

Unfortunately, we didn't have test coverage for many of the KeyConditions
features including bytes keys, which is why this issue escaped detection.
A patch will follow with much more comprehensive tests for KeyConditions,
which also reproduce this issue and verify that it is fixed.

Refs #6490
Fixes #6495

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524141800.104950-1-nyh@scylladb.com>
2020-05-25 09:58:37 +02:00
Nadav Har'El
5ef9854e86 alternator: better error messages when 'forbid_rmw' mode is on
When the 'forbid_rmw' write isolation policy is selected, read-modify-write
are intentionally forbidden. The error message in this case used to say:

	"Read-modify-write operations not supported"

Which can lead users to believe that this operation isn't supported by this
version of Alternator - instead of realizing that this is in fact a
configurable choice.

So in this patch we just change the error message to say:

	"Read-modify-write operations are disabled by 'forbid_rmw' write isolation policy. Refer to https://github.com/scylladb/scylla/blob/master/docs/alternator/alternator.md#write-isolation-policies for more information."

Fixes #6421.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200518125538.8347-1-nyh@scylladb.com>
2020-05-24 16:31:38 +02:00
Asias He
c02fea5f04 repair: Ignore table removed in sync_data_using_repair
Commit 75cf255c67 (repair: Ignore keyspace
that is removed in sync_data_using_repair) is not enough to fix the
issue because when the repair master checks if the table is dropped, the
table might not be dropped yet on the repair master.

To fix, the repair master should check if the follower failed the repair
because the table is dropped by checking the error returned from
follower.

With this patch, we would see

WARN  2020-04-14 11:19:00,417 [shard 0] repair - repair id 1 on shard 0
completed successfully, keyspace=ks, ignoring dropped tables={cf}

when the table is dropped during bootstrap.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test

Fixes: #5942
2020-05-24 13:39:59 +03:00
Avi Kivity
5864bbcb52 cql3: untyped_result_set.hh: add missing include for column_specification
Fails dev-headers build without it.

Message-Id: <20200523061519.71855-1-avi@scylladb.com>
2020-05-24 12:28:03 +03:00
Avi Kivity
52e875430e Merge "Pass --create-cc to seastar-json2code.py" from Rafael
"
This small series instructs seastar-json2code.py to also create a .cc
file. This reduces header bloat and fixes the current stack usage
warning in a dev build.
"

* 'espindola/json2code-cc' of https://github.com/espindola/scylla:
  configure.py: Pass --create-cc to seastar-json2code.py
  configure.py: Add a Source base class
  configure.py: Fix indentation
2020-05-24 11:27:41 +03:00
Piotr Sarna
629a965cbb alternator-test: fix a test for large requests
With required headers fixed by the previous commit,
large requests test now returns a different error code (ClientError)
when run with `--aws`.
Message-Id: <d56142d1936164d22f457e30e37fd3e58cd52519.1590052823.git.sarna@scylladb.com>
2020-05-24 10:36:59 +03:00
Piotr Sarna
2adb17245b alternator-test: add missing Content-Type header
DynamoDB seems to have started refusing requests unless
they include Content-Type header set to the following value:
 application/x-amz-json-1.0

In order to make sure that manual tests work correctly,
let's add this header.
Message-Id: <ae0edafa311bce27b27e9e72aa51bb9717c360f2.1590052823.git.sarna@scylladb.com>
2020-05-24 10:29:39 +03:00
Nadav Har'El
49fd0cc42f docs/protocols.md: mention wireshark
In docs/protocols.md, describing the protocols used by Scylla's (both
inter-node protocols and client-facing protocols), add a paragraph about
the ability to inspect most of these protocols, including Scylla's internal
inter-node protocol, using wireshark. Link to Piotr Sarna's recent blog post
about how to do this.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200524065248.76898-1-nyh@scylladb.com>
2020-05-24 09:54:49 +03:00
Avi Kivity
076c8317c7 streaming_histogram: add missing include for uint64_t
Fails dev-headers build without it.

Message-Id: <20200523061555.72087-1-avi@scylladb.com>
2020-05-23 11:09:10 +03:00
Raphael S. Carvalho
2f0f72025e compaction: delete move ctor and assignment
Compaction cannot be moved because its address is forwarded to
members like garbage_collected_sstable_writer::data.

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200521193657.20782-1-raphaelsc@scylladb.com>
2020-05-22 17:07:43 +02:00
Kamil Braun
290d226034 storage_service: print a warning when joining a node improperly
Print a BIG WARNING saying that you should never join nodes without
bootstrapping (by marking it as a seed or using auto_bootstrap=off).

Only the very first node should (must) be joined as a seed.

If you want to have more seeds, first join them using the only supported
way (i.e. bootstrap them), and only AFTER they have bootstrapped, change
their configuration to include them in the seed list.
2020-05-22 16:46:39 +02:00
Kamil Braun
838f912ebf storage_service: allow a node to join without bootstrapping
... even if it couldn't contact other nodes.

This reverts 7cb6ac33f5.
2020-05-22 16:46:30 +02:00
Tomasz Grabiec
5cbf0c5748 sstables: index_reader: Add trace-level logging to the index parser
Tested against performance regression using:

  build/release/test/perf/perf_fast_forward --run-test=small-partition-skips -c1

I get similar results before and after the patch.
Message-Id: <20200521213032.15286-1-tgrabiec@scylladb.com>
2020-05-22 13:54:47 +02:00
Avi Kivity
1c2f538eb3 tools: toolchain: dbuild: allow customization of docker arguments
Introduce ~/.config/scylladb/dbuild configuration file, and
SCYLLADB_DBUILD environment variables, that inject options into
the docker run command. This allows adding bind mounts for ccache
and distcc directories, as well as any local scripts and PATH
or other environment configuration to suit the user's needs.

Message-Id: <20200521133529.25880-1-avi@scylladb.com>
2020-05-22 13:52:21 +03:00
Asias He
81f0260816 range_streamer: Handle table of RF 1 in get_range_fetch_map
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:

```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.

In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```

Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.

After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```

Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.

Ta fix, we ignore the keyspace of RF 1 when it is unable to find a source
node to stream.

Fixes #6351
2020-05-22 09:30:52 +08:00
Asias He
fa9ee234a0 streaming: Use separate streaming reason for replace operation
Currently, replace and bootstrap share the same streaming reason,
stream_reason::bootstrap, because they share most of the code
in boot_strapper.

In order to distinguish the two, we need to introduce a new stream
reason, stream_reason::replace. It is safe to do so in a mixed cluster
because current code only check if the stream_reason is
stream_reason::repair.

Refs: #6351
2020-05-22 09:30:52 +08:00
Tomasz Grabiec
a6c87a7b9e sstables: index_reader: Fix overflow when calculating promoted index end
When index file is larger than 4GB, offset calculation will overflow
uint32_t and _promoted_index_end will be too small.

As a result, promoted_index_size calculation will underflow and the
rest of the page will be interpretd as a promoted index.

The partitions which are in the remainder of the index page will not
be found by single-partition queries.

Data is not lost.

Introduced in 6c5f8e0eda.

Fixes #6040
Message-Id: <20200521174822.8350-1-tgrabiec@scylladb.com>
2020-05-21 21:24:05 +03:00
Pavel Emelyanov
2ac24d38fa row-cache: Remove variadic future from range_populating_reader
Replace it with std::tuple, introduce range_populating_reader::read_result
type alias for less keystrokes.

This makes row_cache.o compilation warn-less.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200518160511.26984-1-xemul@scylladb.com>
2020-05-21 19:29:39 +02:00
Avi Kivity
a61b3f2d78 tools: toolchain: rebase on Fedora 32
- base image changed from Fedora 31 to Fedora 32
 - disambiguate base image to use docker.io registry
 - pystache and python-casasndra-driver are no longer availble,
   so use pip3 to install them. Add pip3 to packages.
 - since pip3 installs commands to /usr/local/bin, update checks
   in build_deb to check for those too

Fedora 32 packages gcc 10, which has support for coroutines.
Message-Id: <20200521063138.1426400-1-avi@scylladb.com>
2020-05-21 18:27:50 +03:00
Piotr Sarna
032a531ea6 test: add unit tests for alternator base64 conversions
The test cases verify that base64 operations encode
and decode their data properly.

Tests: unit(dev)
2020-05-21 18:26:59 +03:00
Piotr Sarna
e503075aac alternator: apply the string_view helper function
Explicit transformation from a JSON value to a string view can be
replaced with a shorter helper function from rjson.hh.
2020-05-21 18:26:59 +03:00
Piotr Sarna
cb7d3c6b55 alternator: compute begins_with on base64 without decoding
In order to remove a FIXME, code which checks a BEGINS_WITH
relation between base64-encoded strings is computed in a way
which does not involve decoding the whole string.
In case of padding, the remainders are still decoded, but their
size is bounded by 3, which means they will be eligible for the
small string optimization.
2020-05-21 18:26:59 +03:00
Piotr Sarna
511ce82bd2 alternator: extract base64-decoding code to a helper function
In the future, the decoding routine directly to std::string
will be useful, so it's extracted out of a bigger function.
2020-05-21 18:26:59 +03:00
Piotr Sarna
3148571834 alternator: compute decoded base64 size without actually decoding
In order to get rid of a FIXME, the code which computes the size
of decoded base64 string based only on encoded size + padding is added.
The result is an O(1) function with just a couple of ops
(15 when checking with godbolt and gcc9), so it's a general improvement
over having to allocate a string and get its size.
2020-05-21 18:26:59 +03:00
Botond Dénes
06dd3d9077 queue_reader: push(): eliminate unneeded continuation on full buffer case
Currently, push() attaches a continuation to the _not_full future, if
push() is called when the buffer is already full. This is not needed as
we can safely push the fragment even if the buffer is already full.
Furthermore we can eliminate the possibility of push() being called when
the buffer is full, by checking whether it is full *after* pushing the
fragment, not before.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200521055840.376019-1-bdenes@scylladb.com>
2020-05-21 09:34:44 +03:00
Pekka Enberg
ed0d00f51e Revert "Revert "schema: Default dc_local_read_repair_chance to zero""
This reverts commit 43b488a7bc. The commit
was originally reverted because a dtest was sensitive to the value. The
dtest is fixed now, so let's revert the revert as requested by Glauber.
2020-05-21 08:05:13 +03:00
Botond Dénes
c29ccdea7e repair: switch from queue+generating_reader to queue_reader
The queue_reader was inveted exactly to replace this construct and is
more efficient than it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200520155618.369873-1-bdenes@scylladb.com>
2020-05-20 19:33:28 +03:00
Calle Wilund
7ce4a8b458 token_metadata: Prune empty racks on endpoint change
Fixes #6459

When moving or removing endpoints, we should ensure
that the set of available racks reflect the nodes
known, i.e. match what would be the result of a
reboot + create sets initially.
Message-Id: <20200519153300.15391-1-calle@scylladb.com>
2020-05-20 13:35:08 +02:00
Nadav Har'El
0673e44fc1 alternator test: small fix for Python 2
Although Python 2 is deprecated, some systems today still have "python"
and "pytest" pointing to Python 2, so it would be convenient for the
Alternator tests to work on both Python 2 and 3 if it's not too much
of an effort.

And it really isn't too much of an effort - they all work on both versions
except for one problem introduced in the previous test patch: The syntax b''
for an empty byte array works correctly on Python 3 but incorrectly on
Python 2: In Python 2, b'' is just a normal empty string, not byte array,
which confuses Boto3 which refuses to accept a string as a value for a
byte-array key.

The trivial fix is to replace b'' by bytearray('', 'utf-8').
Uglier, but works as expected on both Python 2 and 3.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200519214321.25152-1-nyh@scylladb.com>
2020-05-20 07:56:16 +02:00
Nadav Har'El
3ff4be966d merge: alternator: allow empty strings
Merged patch series from Piotr Sarna:

Given the new update from DynamoDB:
https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/
... empty strings are now allowed, so alternator and its tests
are updated accordingly.

Key values still cannot be empty, and the definition also expands
to columns which act as keys in global or local secondary indexes.

Fixes #6480
Tests: alternator(local, remote)
2020-05-20 00:10:12 +03:00
Avi Kivity
ecae7a7920 Update seastar submodule
* seastar 92365e7b8...ee516b1cc (17):
  > build: use -fcommon compiler flag for dpdk
  > coroutines: reduce template bloat
  > thread: make async noexcept
  > file: specify methods noexcept
  > doc: drop grace period for old C++ standard revisions
  > semaphore: specify consume_units as noexcept
  > doc/tutorial.md: add short intro to seastar::sharded<>
  > future: Move promise_base move constructor out of line
  > coroutines: enable for C++20
  > tutorial: adjust evaluation order warning to note it is C++14-only
  > rpc_test: Fix test_stream_connection_error with valgrind
  > file: Remove unused lambda capture
  > install-dependencies: add valgrind to arch
  > coroutines_test: Don't access a destroyed lambda
  > tutorial: warn about evaluation order pitfall
  > merge: apps: improvements in httpd and seawreck
  > file: Move functions out of line
2020-05-19 21:25:24 +03:00
Rafael Ávila de Espíndola
79117e1473 configure.py: Pass --create-cc to seastar-json2code.py
This adds a Json2Code class now that both a .cc and a .hh are
produced.

Creating a .cc file reduces header bloat and fixes the current stack
too large warning in a dev build.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-05-19 10:22:38 -07:00
Rafael Ávila de Espíndola
8238c9c9f1 configure.py: Add a Source base class
This reduces a bit of code duplication among the Thrift and
Antlr3Grammar classes.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-05-19 10:21:38 -07:00
Rafael Ávila de Espíndola
caf82755fc configure.py: Fix indentation
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-05-19 10:20:21 -07:00
Botond Dénes
54a0d8536e restricting_mutation_reader: include own buffer in buffer size calculation
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200519102902.231042-1-bdenes@scylladb.com>
2020-05-19 18:23:15 +03:00
Nadav Har'El
16b0680c40 alternator test: run Scylla with a different executable name
The Alternator test (test/alternator/run) runs the real Scylla executable
to test it. Users sometimes want to run Scylla manually in parallel (on
different IP addresses, of course) and sometimes use commands like
"killall scylla" to stop it, may be surprised that this command will also
unintentionally kill a running test.

So what this patch does is to name the Scylla process used for the test
with the name "test_scylla". It will be visible as "test_scylla" in top,
and a "killall scylla" will not touch it. You can, of course, kill it with
a "killall test_scylla" if you wish.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200519071604.19161-1-nyh@scylladb.com>
2020-05-19 18:23:15 +03:00
Piotr Sarna
d3e70356c5 alternator-test: add tests for empty indexed string values
According to DynamoDB, string/binary blob keys cannot be empty
and this definition affects secondary indexes as well.
As a result, only nonempty strings/binary blobs are accepted
as values for columns which form a GSI or LSI key.
2020-05-19 11:32:18 +02:00
Piotr Sarna
ada137b543 alternator-test: add tests for empty strings in keys
Empty string/binary blob values are not accepted by DynamoDB,
and we should follow suit.
2020-05-19 11:32:18 +02:00
Piotr Sarna
7006389f69 alternator: refuse empty strings/binary blobs in keys
In order to be compatible with DynamoDB, we should refuse
items which keys contain empty strings or byte blobs.
2020-05-19 11:32:18 +02:00
Piotr Sarna
0d25427470 alternator-test: add a table with string sort key
String sort key will be needed to ensure that empty string keys
are not accepted.
2020-05-19 11:32:18 +02:00
Piotr Sarna
9f8202806a alternator: allow empty strings in values
Given the new update from DynamoDB:
https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/
... empty strings are now allowed for non-key attributes,
so alternator and its tests are updated accordingly.

Fixes #6480
Tests: alternator(local, remote)
2020-05-19 11:32:18 +02:00
Nadav Har'El
cac9bcbbba dist/docker: instructions how to build a docker image with your own executable
Clarify in README.md that the instructions there will build a Docker image
containing a Scylla executable downloaded from downloads.scylla.com - NOT
the one you built yourself. The image is also CentOS based - not Fedora-based
as claimed.

In addition, a new dist/docker/redhat/README.md explains the somewhat
steps needed to actually build a Docker image with the Scylla executable
that you built. In the future, these steps should be automated (e.g.,
"ninja docker") but until then, let's at least document the process.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200518151123.11313-1-nyh@scylladb.com>
2020-05-19 09:23:37 +03:00
Piotr Sarna
1906e35d12 test: add --skip option
The --skip option allows providing a pattern for tests which
will not be run. Example usage:
./test.py --mode dev --skip alternator

Tests: unit(dev), with `--skip alternator` and with no parameters
Message-Id: <6970134d2bc15314f0e4944f3b167d0e105ea69b.1589811943.git.sarna@scylladb.com>
2020-05-19 08:14:32 +03:00
Tomasz Grabiec
3efef39e7e Merge "lwt: fix batch validation crash and exception message case" from Alejo
Fix a metadata crash and exception message casing consistency.

Fixes #6332

* alejo/fix_issue_6332:
  lwt: validate before constructing metadata
  lwt: consistent exception message case
2020-05-19 08:14:32 +03:00
Avi Kivity
4d15aba7c0 commitlog: capture "this" explicitly in lambda
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517150834.753463-1-avi@scylladb.com>
2020-05-19 08:14:32 +03:00
Piotr Sarna
18a37d0cb1 db,view: add tracing to view update generation path
In order to improve materialized views' debuggability,
tracing points are added to view update generation path.

Sample info of an insert statement which resulted in producing
local view updates which require read-before-write:

 activity                                                                                                                           | timestamp                  | source    | source_elapsed | client
------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                 Execute CQL3 query | 2020-04-19 12:02:48.420000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                      Parsing a statement [shard 0] | 2020-04-19 12:02:48.420674 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                   Processing a statement [shard 0] | 2020-04-19 12:02:48.420753 | 127.0.0.1 |             79 | 127.0.0.1
                                  Creating write handler for token: -6715243485458697746 natural: {127.0.0.1} pending: {} [shard 0] | 2020-04-19 12:02:48.420815 | 127.0.0.1 |            141 | 127.0.0.1
                                                                   Creating write handler with live: {127.0.0.1} dead: {} [shard 0] | 2020-04-19 12:02:48.420824 | 127.0.0.1 |            149 | 127.0.0.1
                                                                                             Executing a mutation locally [shard 0] | 2020-04-19 12:02:48.420830 | 127.0.0.1 |            155 | 127.0.0.1
                                          View updates for ks.t1 require read-before-write - base table reader is created [shard 0] | 2020-04-19 12:02:48.420862 | 127.0.0.1 |            188 | 127.0.0.1
                                                                                        Generated 2 view update mutations [shard 0] | 2020-04-19 12:02:48.420910 | 127.0.0.1 |            235 | 127.0.0.1
 Locally applying view update for ks.t1_v_idx_index; base token = -6715243485458697746; view token = -4156302194539278891 [shard 0] | 2020-04-19 12:02:48.420918 | 127.0.0.1 |            243 | 127.0.0.1
                                              Successfully applied local view update for 127.0.0.1 and 0 remote endpoints [shard 0] | 2020-04-19 12:02:48.420971 | 127.0.0.1 |            297 | 127.0.0.1
                                                                     View updates for ks.t1 were generated and propagated [shard 0] | 2020-04-19 12:02:48.420973 | 127.0.0.1 |            299 | 127.0.0.1
                                                                                           Got a response from /127.0.0.1 [shard 0] | 2020-04-19 12:02:48.420988 | 127.0.0.1 |            314 | 127.0.0.1
                                                             Delay decision due to throttling: do not delay, resuming now [shard 0] | 2020-04-19 12:02:48.420990 | 127.0.0.1 |            315 | 127.0.0.1
                                                                                          Mutation successfully completed [shard 0] | 2020-04-19 12:02:48.420994 | 127.0.0.1 |            320 | 127.0.0.1
                                                                                     Done processing - preparing a result [shard 0] | 2020-04-19 12:02:48.421000 | 127.0.0.1 |            326 | 127.0.0.1
                                                                                                                   Request complete | 2020-04-19 12:02:48.420330 | 127.0.0.1 |            330 | 127.0.0.1

Sample info for remote updates:

 activity                                                                                                                                                           | timestamp                  | source    | source_elapsed | client
--------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                                                 Execute CQL3 query | 2020-04-26 16:19:47.691000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                      Parsing a statement [shard 1] | 2020-04-26 16:19:47.691590 | 127.0.0.1 |              6 | 127.0.0.1
                                                                                                                                   Processing a statement [shard 1] | 2020-04-26 16:19:47.692368 | 127.0.0.1 |            783 | 127.0.0.1
                                                       Creating write handler for token: -3248873570005575792 natural: {127.0.0.3, 127.0.0.2} pending: {} [shard 1] | 2020-04-26 16:19:47.694186 | 127.0.0.1 |           2598 | 127.0.0.1
                                                                                        Creating write handler with live: {127.0.0.2, 127.0.0.3} dead: {} [shard 1] | 2020-04-26 16:19:47.694283 | 127.0.0.1 |           2699 | 127.0.0.1
                                                                                                                         Sending a mutation to /127.0.0.2 [shard 1] | 2020-04-26 16:19:47.694591 | 127.0.0.1 |           3006 | 127.0.0.1
                                                                                                                         Sending a mutation to /127.0.0.3 [shard 1] | 2020-04-26 16:19:47.694862 | 127.0.0.1 |           3277 | 127.0.0.1
                                                                                                                         Message received from /127.0.0.1 [shard 1] | 2020-04-26 16:19:47.696358 | 127.0.0.3 |             40 | 127.0.0.1
                                                                                                                         Message received from /127.0.0.1 [shard 1] | 2020-04-26 16:19:47.696442 | 127.0.0.2 |             32 | 127.0.0.1
                                                                           View updates for ks.t require read-before-write - base table reader is created [shard 1] | 2020-04-26 16:19:47.697762 | 127.0.0.3 |           1444 | 127.0.0.1
                                                                           View updates for ks.t require read-before-write - base table reader is created [shard 1] | 2020-04-26 16:19:47.698120 | 127.0.0.2 |           1710 | 127.0.0.1
                                                                                                                        Generated 1 view update mutations [shard 1] | 2020-04-26 16:19:47.699107 | 127.0.0.3 |           2789 | 127.0.0.1
 Sending view update for ks.t_v2_idx_index to 127.0.0.4, with pending endpoints = {}; base token = -3248873570005575792; view token = 1634052884888577606 [shard 1] | 2020-04-26 16:19:47.699345 | 127.0.0.3 |           3027 | 127.0.0.1
                                                                                                                         Sending a mutation to /127.0.0.4 [shard 1] | 2020-04-26 16:19:47.699614 | 127.0.0.3 |           3296 | 127.0.0.1
                                                                                                                        Generated 1 view update mutations [shard 1] | 2020-04-26 16:19:47.699824 | 127.0.0.2 |           3414 | 127.0.0.1
                                  Locally applying view update for ks.t_v2_idx_index; base token = -3248873570005575792; view token = 1634052884888577606 [shard 1] | 2020-04-26 16:19:47.700012 | 127.0.0.2 |           3603 | 127.0.0.1
                                                                                                      View updates for ks.t were generated and propagated [shard 1] | 2020-04-26 16:19:47.700059 | 127.0.0.3 |           3741 | 127.0.0.1
                                                                                                                         Message received from /127.0.0.3 [shard 1] | 2020-04-26 16:19:47.700958 | 127.0.0.4 |             37 | 127.0.0.1
                                                                              Successfully applied local view update for 127.0.0.2 and 0 remote endpoints [shard 1] | 2020-04-26 16:19:47.701522 | 127.0.0.2 |           5112 | 127.0.0.1
                                                                                                      View updates for ks.t were generated and propagated [shard 1] | 2020-04-26 16:19:47.701615 | 127.0.0.2 |           5206 | 127.0.0.1
                                                                                                                      Sending mutation_done to /127.0.0.1 [shard 1] | 2020-04-26 16:19:47.701913 | 127.0.0.3 |           5595 | 127.0.0.1
                                                                                                                                Mutation handling is done [shard 1] | 2020-04-26 16:19:47.702489 | 127.0.0.3 |           6171 | 127.0.0.1
                                                                                                                           Got a response from /127.0.0.3 [shard 1] | 2020-04-26 16:19:47.702667 | 127.0.0.1 |          11082 | 127.0.0.1
                                                                                             Delay decision due to throttling: do not delay, resuming now [shard 1] | 2020-04-26 16:19:47.702689 | 127.0.0.1 |          11105 | 127.0.0.1
                                                                                                                          Mutation successfully completed [shard 1] | 2020-04-26 16:19:47.702784 | 127.0.0.1 |          11200 | 127.0.0.1
                                                                                                                      Sending mutation_done to /127.0.0.1 [shard 1] | 2020-04-26 16:19:47.703016 | 127.0.0.2 |           6606 | 127.0.0.1
                                                                                                                     Done processing - preparing a result [shard 1] | 2020-04-26 16:19:47.703054 | 127.0.0.1 |          11470 | 127.0.0.1
                                                                                                                      Sending mutation_done to /127.0.0.3 [shard 1] | 2020-04-26 16:19:47.703720 | 127.0.0.4 |           2800 | 127.0.0.1
                                                                                                                                Mutation handling is done [shard 1] | 2020-04-26 16:19:47.704527 | 127.0.0.4 |           3607 | 127.0.0.1
                                                                                                                           Got a response from /127.0.0.4 [shard 1] | 2020-04-26 16:19:47.704580 | 127.0.0.3 |           8262 | 127.0.0.1
                                                                                             Delay decision due to throttling: do not delay, resuming now [shard 1] | 2020-04-26 16:19:47.704606 | 127.0.0.3 |           8288 | 127.0.0.1
                                                                                    Successfully applied view update for 127.0.0.4 and 1 remote endpoints [shard 1] | 2020-04-26 16:19:47.704853 | 127.0.0.3 |           8535 | 127.0.0.1
                                                                                                                                Mutation handling is done [shard 1] | 2020-04-26 16:19:47.706092 | 127.0.0.2 |           9682 | 127.0.0.1
                                                                                                                           Got a response from /127.0.0.2 [shard 1] | 2020-04-26 16:19:47.709933 | 127.0.0.1 |          18348 | 127.0.0.1
                                                                                                                                                   Request complete | 2020-04-26 16:19:47.702582 | 127.0.0.1 |          11582 | 127.0.0.1

Tests: unit(dev, debug)
2020-05-18 16:05:23 +02:00
Piotr Sarna
92aadb94e5 treewide: propagate trace state to write path
In order to add tracing to places where it can be useful,
e.g. materialized view updates and hinted handoff, tracing state
is propagated to all applicable call sites.
2020-05-18 16:05:23 +02:00
Piotr Jastrzebski
cd33b9f406 cdc: Tune expired sstables check frequency
CDC Log is a time series which uses time window compaction with some
time window. Data is TTLed with the same value. This means that sstable
won't become fully expired more often than once per time window
duration.

This patch sets expired_sstable_check_frequency_seconds compaction
strategy parameter to half of the time window. Default value of this
parameter is 10 minutes which in most cases won't be a good fit.
By default, we set TTL to 24h and time window to 1h. This means that
with a default value of the parameter we would be checking every 10
minutes but new expired sstable would appear only every 60 minutes.

The parameter is set to half of the time window duration because it's
the expected time we have to wait for sstable to become fully expired.
Half of the time we will wait longer and half of the time we will wait
shorter.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-18 16:49:19 +03:00
Alejo Sanchez
d1521e6721 lwt: validate before constructing metadata
LWT batches conditions can't span multiple tables.
This was detected in batch_statement::validate() called in ::prepare().
But ::cas_result_set_metadata() was built in the constructor,
causing a bitset assert/crash in a reported scenario.
This patch moves validate() to the constructor before building metadata.

Closes #6332

Tested with https://github.com/scylladb/scylla-dtest/pull/1465

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-18 10:40:21 +02:00
Alejo Sanchez
74edb3f20b lwt: consistent exception message case
Fix case Batch -> BATCH to match similar exception in same file

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-18 10:40:06 +02:00
Avi Kivity
61a8c8c989 storage_proxy: capture "this" explicitly in lambda
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517150921.754073-1-avi@scylladb.com>
2020-05-18 10:30:10 +03:00
Avi Kivity
2d933c62ec thrift: capture "this" explicitly in lambda
C++20 deprecates capturing this in default-copy lambdas ([=]), with
good reason. Move to explicit captures to avoid any ambiguity and
reduce warning spew.
Message-Id: <20200517151023.754906-1-avi@scylladb.com>
2020-05-18 10:24:00 +03:00
Rafael Ávila de Espíndola
311fbe2f0a repair: Make sure sinks are always closed
In a recent next failure I got the following backtrace

#3  0x00007efd71251a66 in __GI___assert_fail (assertion=assertion@entry=0x2d0c00 "this->_con->get()->sink_closed()", file=file@entry=0x32c9d0 "./seastar/include/seastar/rpc/rpc_impl.hh", line=line@entry=795,
    function=function@entry=0x270360 "seastar::rpc::sink_impl<Serializer, Out>::~sink_impl() [with Serializer = netw::serializer; Out = {repair_row_on_wire_with_cmd}]") at assert.c:101
#4  0x0000000001f5d2c3 in seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd>::~sink_impl (this=<optimized out>, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:312
#5  0x0000000001f5d2f4 in seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>)
    at ./seastar/include/seastar/core/shared_ptr.hh:463
#6  seastar::shared_ptr_count_for<seastar::rpc::sink_impl<netw::serializer, repair_row_on_wire_with_cmd> >::~shared_ptr_count_for (this=0x60100075b680, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/shared_ptr.hh:463
#7  0x000000000240f2e6 in seastar::shared_ptr<seastar::rpc::sink<repair_row_on_wire_with_cmd>::impl>::~shared_ptr (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/core/future.hh:427
#8  seastar::rpc::sink<repair_row_on_wire_with_cmd>::~sink (this=0x601003118590, __in_chrg=<optimized out>) at ./seastar/include/seastar/rpc/rpc_types.hh:270
#9  <lambda(auto:134&)>::<lambda(const seastar::rpc::client_info&, uint64_t, seastar::rpc::source<repair_hash_with_cmd>)>::<lambda(std::__exception_ptr::exception_ptr)>::~<lambda> (this=0x601003118570, __in_chrg=<optimized out>)
    at repair/row_level.cc:2059

This patch changes a few functions to use finally to make sure the sink
is always closed.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200515202803.60020-1-espindola@scylladb.com>
2020-05-18 08:13:42 +03:00
Avi Kivity
beaeda5234 database: remove variadic future from query() and query_mutations()
Variadic futures are deprecated; replace with future<std::tuple<...>>.

Tests: unit (dev)
2020-05-17 18:45:38 +02:00
Nadav Har'El
4cf44ddbdf docs: update alternator.md
Some statements made in docs/alternator/alternator.md on having a single
keyspace, or recommending a DNS setup, are not up-to-date. So fix them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200517132444.9422-1-nyh@scylladb.com>
2020-05-17 18:38:13 +02:00
Nadav Har'El
1b807a5018 alternator test: better recognition that Alternator failed to boot
The test/alternator/run script starts Scylla to be tested. It waits until
CQL is responsive and if Scylla dies earlier, recognizes the failure
immediately. This is useful so we see boot errors immediately instead of
waiting for the first test to timeout and fail.

However, Scylla starts the Alternator service after CQL. So it is possible
that after the "run" script found CQL to be up, Alternator couldn't start
(e.g., bad configuration parameters) and Scylla is shut down, and instead
of recognizing this situation, we start the actual test.

The fix is simple: don't start the tests until verifying that Alternator
is up. We verify this using the trivial healthcheck request (which is
nothing more than an HTTP GET request).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200517125851.8484-1-nyh@scylladb.com>
2020-05-17 18:33:27 +02:00
Nadav Har'El
2b9437076f README.md: update instructions for building docker image
The instructions in README.md about building a docker image start with
"cd dist/docker", but it actually needs to be "cd dist/docker/redhat".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200517152815.15346-1-nyh@scylladb.com>
2020-05-17 18:29:55 +03:00
Tzach Livyatan
82dfab0a54 Fix a link to contributor-agreement in the CONTRIBUTING page 2020-05-17 14:15:49 +03:00
Avi Kivity
513faa5c71 Merge 'Use http Stream for describe ring' from Amnon
"
This series changes the describe_ring API to use HTTP stream instead of serializing the results and send it as a single buffer.

While testing the change I hit a 4-year-old issue inside service/storage_proxy.cc that causes a use after free, so I fixed it along the way.

Fixes #6297
"

* amnonh-stream_describe_ring:
  api/storage_service.cc: stream result of token_range
  storage_service: get_range_to_address_map prevent use after free
2020-05-17 14:05:26 +03:00
Amnon Heiman
7c4562d532 api/storage_service.cc: stream result of token_range
The get token range API can become big which can cause large allocation
and stalls.

This patch replace the implementation so it would stream the results
using the http stream capabilities instead of serialization and sending
one big buffer.

Fixes #6297

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-05-17 13:56:05 +03:00
Amnon Heiman
69a46d4179 storage_service: get_range_to_address_map prevent use after free
The implementation of get_range_to_address_map has a default behaviour,
when getting an empty keypsace, it uses the first non-system keyspace
(first here is basically, just a keyspace).

The current implementation has two issues, first, it uses a reference to
a string that is held on a stack of another function. In other word,
there's a use after free that is not clear why we never hit.

The second, it calls get_non_system_keyspaces twice. Though this is not
a bug, it's redundant (get_non_system_keyspaces uses a loop, so calling
that function does have a cost).

This patch solves both issues, by chaning the implementation to hold a
string instead of a reference to a string.

Second, it stores the results from get_non_system_keyspaces and reuse
them it's more efficient and holds the returned values on the local
stack.

Fixes #6465

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-05-17 13:53:13 +03:00
Dejan Mircevski
8db7e4cc96 cql: Add test for invalid unbounded DELETE
In add40d4e59, we relaxed the prohibition of unbounded DELETE and
stopped testing the failure message.  But there are still scenarios
when unbounded DELETE is prohibited, so add a test to ensure we
continue to catch it where appropriate.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-05-17 12:28:36 +03:00
Avi Kivity
b155eef726 Merge "allow early aborts through abort sources." from Glauber
"
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.

One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.

This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.

We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.

The abort source is passed to the database, who is responsible from
constructing the compaction manager

Tests: unit (debug), manual start+stop, manual drain + stop, previously
       failing dtests.
"
2020-05-17 11:49:00 +03:00
Avi Kivity
777d5e88c3 types: support altering fixed-size integer types to varint
Fixed-size integer types are legal varints - both are serialized as
two's complement in network byte order. So there's tinyint, shortint,
int, and bigint can be interpreted as varints.

Change is_compatible_with() to reflect that.
Message-Id: <20200516115143.28690-2-avi@scylladb.com>
2020-05-17 11:31:00 +03:00
Avi Kivity
ff57e4d9a5 types: make short and byte types value-compatible with varint
The short and byte types are two's complement network byte order,
just like varint (except fixed size) and so varint can read them
just fine.

Mark them as value compatible like int32_type and long_type.

A unit test is added.
Message-Id: <20200516115143.28690-1-avi@scylladb.com>
2020-05-17 11:31:00 +03:00
Benny Halevy
a96087165a hints: get_device_id: use seastar file_stat
This avoids potential use-after-move, since undefined c++ sequencing order
may std::move(f) in the lambda capture before evaluating f.stat().

Also, this makes use of a more generic library function that doesn't
require to open and hold on to the file in the application.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200514152054.162168-1-bhalevy@scylladb.com>
2020-05-15 10:11:45 +02:00
Asias He
b2c4d9fdbc repair: Fix race between write_end_of_stream and apply_rows
Consider: n1, n2, n1 is the repair master, n2 is the repair follower.

=== Case 1 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to
   partition 1, r2 belongs to partition 2. It yields after row r1 is
   written.
   data: partition_start, r1
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream()
   data: partition_start, r1, partition_end
5) Step 2 resumes to apply the rows.
   data: partition_start, r1, partition_end, partition_end, partition_start, r2

=== Case 2 ===
1) n1 sends missing rows {r1, r2} to n2
2) n2 runs apply_rows_on_follower to apply rows, e.g., {r1, r2}, r1
   is written to sstable, r2 is not written yet, r1 belongs to partition
   1, r2 belongs to partition 2. It yields after partition_start for r2
   is written but before _partition_opened is set to true.
   data: partition_start, r1, partition_end, partition_start
3) n1 sends repair_row_level_stop to n2 because error has happened on n1
4) n2 calls wait_for_writer_done() which in turn calls write_end_of_stream().
   Since _partition_opened[node_idx] is false, partition_end is skipped,
   end_of_stream is written.
   data: partition_start, r1, partition_end, partition_start, end_of_stream

This causes unbalanced partition_start and partition_end in the stream
written to sstables.

To fix, serialize the write_end_of_stream and apply_rows with a semaphore.

Fixes: #6394
Fixes: #6296
Fixes: #6414
2020-05-14 18:15:01 +03:00
Pekka Enberg
96e35f841c docs/redis: API reference documentation
The Redis API in Scylla only supports a small subset of the Redis
commands. Let's document what we support so people have the right
expectations when they try it out.
2020-05-14 17:33:39 +03:00
Benny Halevy
0d4b93b11d sstable: fix potential use-after-move sites
Avoid `f(s).then([s = std::move(s)] {})` patterns,
where the move into the lambda capture may potentially be
sequenced by the compiler before passing `s` to function `f`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200514131701.140046-1-bhalevy@scylladb.com>
2020-05-14 16:06:07 +02:00
Nadav Har'El
f3fd976120 docs, alternator: improve description of status of global tables support
The existing text did not explain what happens if additional DCs are added
to the cluster, so this patch improves the explanation of the status of
our support for global tables, including that issue.

Fixes #6353

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200513175908.21642-1-nyh@scylladb.com>
2020-05-14 08:03:16 +02:00
Glauber Costa
7423ccc318 compaction_manager: allow early aborts through abort sources.
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.

One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.

This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.

We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.

The abort source is passed to the database, who is responsible from
constructing the compaction manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-05-13 16:51:25 -04:00
Glauber Costa
45dc9cc6e5 compaction_manager: carve out a drain method
We want stop() to be callable just once. Having the compaction manager
stopped twice is a potential indication that something is wrong.

Still there are places where we want to stop all ongoing compactions
and prevent new from running - like the drain operation. Today the
only operation that allows for cancellation of all existing compations
is stop(). To unweave this, we will split those two things.

A drain operation is carved out, and it should be safe to be called many
times. The compaction manager is usable after this, and new compactions
can even be sent if it happen to be enabled again (we currently don't)

A stop operation, which includes a drain, will only be allowed once. After
a stop() the compaction_manager object is no longer usable.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-05-13 16:51:25 -04:00
Glauber Costa
e29701ca1c compaction_manager: expand state to be able to differentiate between enabled and stopped
We are having many issues with the stop code in the compaction_manager.
Part of the reason is that the "stopped" state has its meaning overloaded
to indicate both "compaction manager is not accepting compactions" and
"compaction manager is not ready or destructed".

In a later step we could default to enabled-at-start, but right now we
maintain current behavior to minimize noise.

It is only possible to stop the compaction manager once.
It is possible to enable / disable the compaction manager many times.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-05-13 16:51:25 -04:00
Nadav Har'El
62c00a3f17 merge: Use time window compaction strategy for CDC Log table
Merged pull request https://github.com/scylladb/scylla/pull/6427
by Piotr Jastrzębski:

CDC Log is a time series so it makes sense to use time window compaction
strategy for it.
Our support for time series is limited so we make sure that we don't create
more than 24 sstables.
If TTL is configured to 0, meaning data does not expire, we don't use time
window compaction strategy.

This PR also sets gc_grace_seconds to 0 when TTL is not set to 0.
2020-05-13 14:36:43 +03:00
Benny Halevy
94a558e9a8 test.py: print test command line and env to log
Print the test command line and the UBSAN and ASAN env settings to the log
so the run can be easily reproduced (optionally with providing --random-seed=XXX
that is printed by scylla unit tests when they start).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200513110959.32015-1-bhalevy@scylladb.com>
2020-05-13 14:27:15 +03:00
Raphael S. Carvalho
c06cdcdb3c table: Don't allow a shared SSTable to be selected for regular compaction
After commit 88d2486fca, removal of shared SSTables is not atomic anymore.
They can be first removed from the list of shared SSTables and only later be
removed from the SSTable set. That list is used to filter out shared SSTables
from regular compaction candidates.
So it can happen that regular compaction pick up a shared SSTable as candidate
after it was removed from that list but before it was removed from the set.
To fix this, let's only remove a shared SSTable from that aforementioned list
after it was successfully removed from the SSTable set, so that a shared
SSTable cannot be selected for regular compaction anymore.

Fixes #6439.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512175224.114487-1-raphaelsc@scylladb.com>
2020-05-13 10:43:48 +03:00
Avi Kivity
fc5568167b tests: like_matcher_test: adjust for C++20 char8_t
C++20 makes string literals defined with u8"my string" as using
a new type char8_t. This is sensible, as plain char might not
have 8 bits, but conflicts with our bytes type.

Adjust by having overloads that cast back to char*. This limits
us to environments where char is 8 bits, but this is already a
restriction we have.

Reviewed-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20200512101646.127688-1-avi@scylladb.com>
2020-05-13 09:37:39 +03:00
Avi Kivity
33fda05388 counters: change deprecated std::is_pod<> to replacement
C++20 deprecates std::is_pod<> in favor of the easier-to-type
std::is_starndard_layout<> && std::is_trivial<>. Change to the
recommendation in order to avoid a flood of warnings.

Reviewed-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200512092200.115351-1-avi@scylladb.com>
2020-05-13 09:36:52 +03:00
Avi Kivity
2afd40fe6f tracing: use correct std::memory_order_* scoping
std::memory_order is an unscoped enum, and so does not need its
members to be prefixed with std::memory_order::, just std::.

This used to work, but in C++20 it no longer does. Use the
standard way to name these constants, which works in both C++17
and C++20.

Reviewed-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200512092408.115649-1-avi@scylladb.com>
2020-05-13 09:36:23 +03:00
Avi Kivity
8d4bdc49f1 tests: sstable_run_based_compaction_strategy_for_tests: adjust for C++20 pass-by-value in std::accumulate
C++20 changed the parameter to the binary operation function in std::accumulate()
to be passed by value (quite sensibly). Adjust the code to be compatible by
using a #if. This will be removed once we switch over to C++20.
Message-Id: <20200512105427.142423-1-avi@scylladb.com>
2020-05-12 20:41:16 +02:00
Avi Kivity
74c1db7f59 tests: like_matcher_test: add casts for utf8 string literals
C++20 makes string literals defined with u8"foo" return a new char8_t.
This is sensible but is noisy for us. Cast them to plain const char.
Message-Id: <20200512104751.137816-1-avi@scylladb.com>
2020-05-12 20:41:02 +02:00
Avi Kivity
07061f9a00 duration: adjust for C++20 char8_t type
C++20 makes string literals defined with u8"blah" return a new
char8_t type, which is sensible but noisy here.

Adjust for it by dropping an unneeded u8 in one place, and adding a
cast in another.
Message-Id: <20200512104515.137459-1-avi@scylladb.com>
2020-05-12 20:40:30 +02:00
Avi Kivity
89ea879ba9 storage_proxy: adjust for C++20 std::accumulate() pass-by-value
C++20 passes the input to the binary operation by value (which is
sensible), but is not compatible with C++17. Add some #if logic
to support both methods. We can remove the logic when we fully
transition to C++20.
Message-Id: <20200512101355.127333-1-avi@scylladb.com>
2020-05-12 20:39:21 +02:00
Tomasz Grabiec
df4b698309 Merge "Add more defenses against empty keys" from Botond
In theory we shouldn't have empty keys in the database, as we validate
all keys that enter the database via CQL with
`validation::validate_cql_keys()`, which will reject empty keys. In this
context, empty means a single-component key, with its only component
being empty.

Yet recently we've seen empty keys appear in a cluster and wreak havoc
on it, as they will cause the memtable flush to fail due to the sstable
summary rejecting the empty key. This will cause an infinite loop, where
Scylla keeps retrying to flush the memtable and failing. The intermediate
consequence of this is that the node cannot be shut down gracefully. The
indirect consequence is possible data loss, as commitlog files cannot be
replayed as they just re-insert the empty key into the memtable and the
infinite flush retry circle starts all over again. A workaround is to
move problematic commitlog files away, allowing the node to start up.
This can however lead to data loss, if multiple replicas had to move
away commitlogs that contain the same data.

To prevent the node getting into an unusable state and subsequent data
loss, extend the existing defenses against invalid (empty) keys to the
commitlog replay, which will now ignore them during replay.

Fixes: #6106

* denesb/empty-keys/v5:
  commitlog_replayer: ignore entries with invalid keys
  test: lib/sstable_utils: add make_keys_for_shard
  validation: add is_cql_key_invalid()
  validation: validate_cql_key(): make key parameter a `partition_key_view`
  partition_key_view: add validate method
2020-05-12 20:36:40 +02:00
Avi Kivity
72172effc8 transport: stop using boost::bimap<>
We use boost::bimap for bi-directional conversion from protocol type
encodings to type objects.

Unfortunately, boost::bimap isn't C++20-ready.

Fortunately, we only used one direction of the bimap.

Replace with plain old std::unordered_map<>.
Message-Id: <20200512103726.134124-1-avi@scylladb.com>
2020-05-12 18:55:26 +03:00
Botond Dénes
74b020ad05 main: run redis service in the statement scheduling group
Like all the other API services (CQL, thrift and alternator).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200512145631.104051-1-bdenes@scylladb.com>
2020-05-12 18:01:27 +03:00
Piotr Dulikowski
0c5ac0da98 hinted handoff: remove discarded hint positions from rps_set
Related commit: 85d5c3d

When attempting to send a hint, an exception might occur that results in
that hint being discarded (e.g. keyspace or table of the hint was
removed).

When such an exception is thrown, position of the hint will already be
stored in rps_set. We are only allowed to retain positions of hints that
failed to be sent and needed to be retried later. Dropping a hint is not
an error, therefore its position should be removed from rps_set - but
current logic does not do that.

Because of that bug, hint files with many discardable hints might cause
rps_set to grow large when the file is replayed. Furthermore, leaving
positions of such hints in rps_set might cause more hints than necessary
to be re-sent if some non-discarded hints fail to be sent.

This commit fixes the problem by removing positions of discarded hints
from rps_set.

Fixes #6433
2020-05-12 15:13:59 +02:00
Avi Kivity
05e19078f6 storage_proxy: replace removed std::not1() by replacement std::not_fn()
C++17 deprecated std::not1() and C++20 removed it; replace with its
successor.
Message-Id: <20200512101205.127046-1-avi@scylladb.com>
2020-05-12 14:05:03 +03:00
Avi Kivity
e774ee06ed Update seastar submodule
* seastar e708d1df3a...92365e7b87 (11):
  > tests: distributed_test: convert to SEASTAR_TEST_CASE
  > Merge "Avoid undefined behavior on future self move assignments" from Rafael
  > Merge "C++20 support" from Avi
  > optimized_optional: don't use experimental C++ features
  > tests: scheduling_group_test: verify that later() doesn't modify the current group
  > tests: demos: coroutine_demo: add missing include for open_file_dma()
  > rpc: minor documentation improvements
  > rpc: Assert that sinks are closed
  > Merge "Fix most tests under valgrind" from Rafael
  > distributed_test: Fix it on slow machines
  > rpc_test: Make sure we always flush and close the sink

loading_shard_values.hh: added missing include for gcc6-concepts.hh,
exposed by the submodule update.

Frozen toolchain updated for the new valgrind dependency.
2020-05-12 14:04:16 +03:00
Botond Dénes
6083ed668b commitlog_replayer: ignore entries with invalid keys
When replaying the commitlog, pass keys to
`validation::validate_cql_key()`. Discard entries which fail validation
and warn about it in the logs.
This prevents invalid keys from getting into the system, possibly
failing the commitlog replay and the successful boot of the node,
preventing the node from recovering data.
2020-05-12 12:07:21 +03:00
Botond Dénes
e0f5ef5ef0 test: lib/sstable_utils: add make_keys_for_shard
A variant of make_keys() which creates keys for the requested shard. As
this version is more generic than the existing local_shards_only
variant, the former is reimplemented on top of the latter.
2020-05-12 12:07:21 +03:00
Botond Dénes
dd76e8c8de validation: add is_cql_key_invalid() 2020-05-12 12:07:00 +03:00
Botond Dénes
95bf3a75de validation: validate_cql_key(): make key parameter a partition_key_view
This is more general than the previous `const partition_key&` and allows
for passing keys obtained from the likes of `frozen_mutation` that only
have a view of the key.

While at it also change the schema parameter from schema_ptr to const
schema&. No need to pass a shared pointer.
2020-05-12 12:07:00 +03:00
Botond Dénes
84c47c4228 partition_key_view: add validate method
We want to be able to pass `partition_key_view` to
`validation::validate_cql_key()`. As the latter wants to call
`validate()` on the key, replicate `partition_key::validate()` in
`partition_key_view`.
2020-05-12 12:07:00 +03:00
Asias He
b744dba75a repair: Abort the queue in write_end_of_stream in case of error
In write_end_of_stream, it does:

1) Write write_partition_end
2) Write empty mutation_fragment_opt

If 1) fails, 2) will be skipped, the consumer of the queue will wait for
the empty mutation_fragment_opt forever.

Found this issue when injecting random exceptions between 1) and 2).

Refs #6272
Refs #6248
2020-05-12 10:50:52 +02:00
Avi Kivity
f1fde537a9 Merge 'Support Snapshot of multiple tables' from Amnon
This series adds support for taking a snapshot of multiple tables.

Fixes #6333

* amnonh-snapshot_keyspace_table:
  api/storage_service.cc: Snapshot, support multiple tables
  service/storage_service: Take snapshot of multiple tables
2020-05-12 11:34:09 +03:00
Piotr Jastrzebski
49b6010cb4 cdc: Use time window compaction strategy for CDC Log table
CDC Log is a time series with data TTLed by default to 24 hours so
it makes sense to use for it a time window compaction.

A window size is adjusted to the TTL configured for CDC Log so that
no more than 24 sstables will be created.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-12 07:53:40 +02:00
Glauber Costa
70a89ab4ab compaction: do not assume I/O priority class
We shouldn't assume the I/O priority class for compactions.  For
instance, if we are dealing with offstrategy compactions we may want to
use the maintenance group priority for them.

For now, all compactions are put in the compaction class.  rewrite
compactions (scrub, cleanup) could be maintenance, but we don't have
clear access to the database object at this time to derive the
equivalent CPU priority. This is planned to be changed in the future,
and when we do change it, we'll adjust.

Same goes for resharding: while we could at this point change it we'd
risking memory pressure since resharding is run online and sstables are
shared until resharding is done. When we move it to offline execution
we'll do it with maintenance priority.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200512002233.306538-3-glauber@scylladb.com>
2020-05-12 08:23:19 +03:00
Glauber Costa
4234538292 compaction: pass descriptor all the way down to compaction object.
To do that - and still avoid a copy - we need to add some fields
to the compaction object that are exclusive to regular_compaction.
Still, not only this simplifies the code, resharding and regular
compaction look more and more alike.

This is done now in preparation for another patch that will add
more information to the descriptor.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200512002233.306538-2-glauber@scylladb.com>
2020-05-12 08:23:19 +03:00
Piotr Sarna
5f2eadce09 alternator: wait for schema agreement after table creation
In order to be sure that all nodes acknowledged that a table was
created, the CreateTable request will now only return after
seeing that schema agreement was reached.
Rationale: alternator users check if the table was created by issuing
a DescribeTable request, and assume that the table was correctly
created if it returns nonempty results. However, our current
implementation of DescribeTable returns local results, which is
not enough to judge if all the other nodes acknowledge the new table.
CQL drivers are reported to always wait for schema agreement after
issuing DDL-changing requests, so there should be no harm in waiting
a little longer for alternator's CreateTable as well.

Fixes #6361
Tests: alternator(local)
2020-05-11 21:51:12 +03:00
Piotr Jastrzebski
0cd0775a27 cdc: Set CDC Log gc_grace_seconds to 0
Data in CDC Log is TTLed and we want to remove it as soon as it expires.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-11 17:59:52 +02:00
Piotr Sarna
517f2c0490 alternator: unify error messages for existing tables/keyspaces
Since alternator is based on Scylla, two "already exists" error types
can appear when trying to create a table - that a table itself exists,
or that its keyspace does. That's however an implementation detail,
since alternator does not have a notion of keyspaces at all.
This patch unifies the error message to simply mention that a table
already exists, and comes with a more robust test case.
If the keyspace already exists, table creation will still be attempted.

Fixes #6340
Tests: alternator(local, remote)
2020-05-11 18:30:02 +03:00
Gleb Natapov
d555fb60d7 lwt: add counters for background and foreground paxos operations
Paxos may leave an operation in a background after returning result to a
caller. Lest add a counter for background/foreground paxos handlers so
that it will be easier to detect memory related issues.

Message-Id: <20200510092942.GA24506@scylladb.com>
2020-05-11 14:37:00 +02:00
Avi Kivity
f4a703fc66 Merge "tools/scylla-types: add compound_type and validation support" from Botond
"
A good portion of the values that one would want to be examine with
scylla-tools will be partition or clustering keys. While examining them
was possible before too, especially for single component keys, it
required manually extracting the components from it, so they can be
individually examined.
This series adds support for working with keys directly, by adding
prefixable and full compound type support.

When passing --prefix-compound or --full-compound, multiple types can be
passed, which will form the compound type.

Example:

$ scylla_types --print --prefix-compound -t TimeUUIDType -t Int32Type 0010d00819896f6b11ea00000000001c571b000400000010
(d0081989-6f6b-11ea-0000-0000001c571b, 16)

Another feature added in this series is validation. For this,
`compound_type::validate()` had to be implemented first. We already use
this in our code, but currently has a no-op body.

Example:

$ scylla-types --validate --full-compound -t TimeUUIDType -t Int32Type 0010d00819896f6b11ea00000000001c571b0004000000
0010d00819896f6b11ea00000000001c571b0004000000:  INVALID - seastar::internal::backtraced<marshal_exception> (marshaling error: compound_type iterator - not enough bytes, expected 4, got 3 Backtrace:   0x1b2e30f
  0x85c9d5
  0x85cb07
  0x85cc7b
  0x85cd7c
  0x85d2d7
  0x844e03
  0x84241b
  0x84490b
  0x844ae5
  0x19c0362
  0x19c0741
  0x19c13d1
  0x19c4b44
  0x8aeb7a
  0x8aeca7
  0x19ebc90
  0x19fb8d5
  0x1a12b49
  0x19c4376
  0x19c47a6
  0x19c4900
  0x843373
  /lib64/libc.so.6+0x271a2
  0x84202d
)

Tests: unit(dev)
"

* 'tools-scylla-types-compound-support/v1' of https://github.com/denesb/scylla:
  tools/scylla_types: add validation action
  tools/scylla_types: add compound_type support
  tools/scylla_types: single source of truth for actions
  compound_type: implement validate()
  compound_type: fix const correctness
  tools: mv scylla_types scylla-types
2020-05-11 15:28:33 +03:00
Juan Ramon Martin
9d0198140b dist/docker: Add "--reserve-memory" command line option
Fixes #6311
2020-05-11 13:34:42 +03:00
Piotr Dulikowski
85d5c3d5ee hinted handoff: don't keep positions of old hints in rps_set
When sending hints from one file, rps_set field in send_one_file_ctx
keeps track of commitlog positions of hints that are being currently
sent, or have failed to be sent. At the end of the operation, if sending
of some hints failed, we will choose position of the earliest hint that
failed to be sent, and will retry sending that file later, starting from
that position. This position is stored in _last_not_complete_rp.

Usually, this set has a bounded size, because we impose a limit of at
most 128 hints being sent concurrently. Because we do not attempt to
send any more hints after a failure is detected, rps_set should not have
more than 128 elements at a time.

Due to a bug, commitlog positions of old hints (older than
gc_grace_seconds of the destination table) were inserted into rps_set
but not removed after checking their age. This could cause rps_set to
grow very large when replaying a file with old hints.

Moreover, if the file mixed expired and non-expired hints (which could
happen if it had hints to two tables with different gc_grace_seconds),
and sending of some non-expired hints failed, then positions of expired
hints could influence calculation _last_not_complete_rp, and more hints
than necessary would be resent on the next retry.

This simple patch removes commitlog position of a hint from rps_set when
it is detected to be too old.

Fixes #6422
2020-05-11 11:33:31 +02:00
Avi Kivity
76d21a0c22 Merge 'Make it possible to turn caching off per table and stop caching CDC Log' from Piotr J.
"
We inherited from Origin a `caching` table parameter. It's a map of named caching parameters. Before this PR two caching parameters were expected: `keys` and `rows_per_partition`. So far we have been ignoring them. This PR adds a new caching parameter called `enabled` which can be set to `true` or `false` and controls the usage of the cache for the table. By default, it's set to `true` which reflects Scylla behavior before this PR.

This new capability is used to disable caching for CDC Log table. It is desirable because CDC Log entries are not expected to be read often. They also put much more pressure on memory than entries in Base Table. This is caused by the fact that some writes to Base Table can override previous writes. Every write to CDC Log is unique and does not invalidate any previous entry.

Fixes #6098
Fixes #6146

Tests: unit(dev, release), manual
"

* haaawk-dont_cache_cdc:
  cdc: Don't cache CDC Log table
  table: invalidate disabled cache on memtable flush
  table: Add cache_enabled member function
  cf_prop_defs: persist caching_options in schema
  property_definitions: add get that returns variant
  feature: add PER_TABLE_CACHING feature
  caching_options: add enabled parameter
2020-05-10 15:39:42 +03:00
Avi Kivity
9d91ac345a dist: redhat: drop dependency on pystache
We use pystache to parametrize our scylla.spec, but pystache is not
present in Fedora 32. Fortunately rpm provides its own template mechanism,
and this patch switches to using it:

 - no longer install pystache
 - pass parameters via rpm "-D" options
 - use 0/1 for conditionals instead of true/false as per rpm conventions
 - sanitize the "product" variable to not contain dashes
 - change the .spec file to use rpm templating: %{...} and %if ... %endif
   instead of mustache templating
2020-05-10 14:42:31 +03:00
Avi Kivity
5b971397aa Revert "compaction_manager: allow early aborts through abort sources."
This reverts commit e8213fb5c3. It results
in an assertion failure in remove_index_file_test.

Fixes #6413.
2020-05-10 12:32:18 +03:00
Raphael S. Carvalho
88d2486fca sstables: Synchronize deletion of SSTables in resharding with other operations
Input SSTables of resharding is deleted at the coordinator shard, not at the
shards they belong to.
We're not acquiring deletion semaphore before removing those input SSTables
from the SSTable set, so it could happen that resharding deletes those
SSTables while another operation like snapshot, which acquires the semaphore,
find them deleted.

Let's acquire the deletion semaphore so that the input SSTables will only
be removed from the set, when we're certain that nobody is relying on their
existence anymore.
Now resharding will only delete input SStables after they're safely removed
from the SSTable set of all shards they belong to.

unit: test(dev).

Fixes #6328.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200507233636.92104-1-raphaelsc@scylladb.com>
2020-05-10 10:50:32 +03:00
Takuya ASADA
4d957eeda7 dist/redhat/python3: drop dependency on pystache
Same as dist/redhat, stop using mustache since pystache is no longer available
on Fedora 32.

see: https://github.com/scylladb/scylla/pull/6313
2020-05-09 23:35:33 +03:00
Botond Dénes
4e83cc4413 tools/scylla_types: add validation action
Allow validating values according to their declared type.
2020-05-07 16:35:23 +03:00
Botond Dénes
4662ad111c tools/scylla_types: add compound_type support
Allow examining partition and clustering keys, by adding support for
full and prefix compound types. The members of the compound type are
specified by passing several types with --type on the command line.
2020-05-07 16:35:21 +03:00
Botond Dénes
70331bad6f tools/scylla_types: single source of truth for actions
Currently the available actions are documented in several different
places:
* code implementing them
* description
* documentation for --action
* error message that validates value for --action

This is guaranteed to result in incorrect, possibly self-contradicting
documentation. Resolve by generating all documentation from the handler
registry, which now also contains the description of the action.
Also have a separate flag for each action, instead of --action=$ACTION.
2020-05-07 16:20:18 +03:00
Botond Dénes
84e38ae358 compound_type: implement validate()
Validate the number of present components, then validate each of them.
A unit test for both the prefix and full instances is also added.
2020-05-07 16:19:56 +03:00
Botond Dénes
3e400cf54e compound_type: fix const correctness
Make all methods that don't mutate members const.
2020-05-07 16:15:11 +03:00
Botond Dénes
7176660e12 tools: mv scylla_types scylla-types
Using hypen seems to be the standard among executables.
2020-05-07 15:14:59 +03:00
Piotr Jastrzebski
e3dd78b68f cdc: Don't cache CDC Log table
CDC writes are not expected to be read multiple times so it makes little sense
to cache them. Moreover, CDC Log puts much bigger pressure on memory usage than
Base Table because some updates to the Base Table override existing data while
related CDC Log updates are always a new entry in a memtable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:39:01 +02:00
Piotr Jastrzebski
38ede62a02 table: invalidate disabled cache on memtable flush
table::update_cache has two branches of its logic.
One when caching is enabled and the other when it's
disabled. This patch adds unconditional cache invalidation
to the second (disabled caching) branch.

This is done for two purposes. First and foremost, it gives
the guarantee that when we enable the cache later it will be in
the right state and will be ready for usage. This is because
any memtable flush that would logically invalidate the cache,
actually physically does that too now. An additional benefit of this
change is that disabled cache will be cleared during the next
memtable flush that will happen after turning the switch off.
Previously, the cache would also be emptied but it would take
more time before all its elements are removed by eviction.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:39:01 +02:00
Piotr Jastrzebski
1a43849cd2 table: Add cache_enabled member function
This function determines cache usage based both on table _config
and dynamic schema information.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:39:01 +02:00
Piotr Jastrzebski
546dbf1fcc cf_prop_defs: persist caching_options in schema
Previously 'WITH CACHING =' was ignored both in
CREATE TABLE and in ALTER TABLE statements.
Now it will be persisted in schema so that
it can be used later to control caching per table.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:38:37 +02:00
Piotr Jastrzebski
812dfd22bd property_definitions: add get that returns variant
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-06 18:38:04 +02:00
Amnon Heiman
ee7b40e31b api/storage_service.cc: Snapshot, support multiple tables
It is sometimes useful to take a snapshot of multiple tables inside a
keyspace.

This patch add support for multiple tables names when taking a snapshot.

The change consist of splitting the table (column family) name and use
the array of table instead of just one.

After this patch this will be supported:
curl -X POST 'http://localhost:10000/storage_service/snapshots?tag=snapshottag&kn=system&cf=range_xfers,large_partitions'

Fixes #6333

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-05-05 12:55:36 +03:00
Amnon Heiman
75e2a3b0e7 service/storage_service: Take snapshot of multiple tables
This patch change the table snapshot implementation to support multiple
tables.

The method for taking a snapshot using a single table was modified to
use the new implementation.

To support multiple tables, the method now takes a vector of tables and
it loops over it.

Relates to #6333

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-05-05 12:55:16 +03:00
Piotr Jastrzebski
0475dab359 feature: add PER_TABLE_CACHING feature
This feature will ensure that caching can be switched
off per table only after the whole cluster supports it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-05 08:14:49 +02:00
Piotr Jastrzebski
2d727114ed caching_options: add enabled parameter
Scylla inherits from Origin two caching parameters
(keys and rows_per_partition) that are ignored.

This patch adds a new parameter called "enabled"
which is true by default and controls whether cache
is used for a selected table or not.

If the parameter is missing in the map then it
has the default value of true. To minimize the impact
of this change, enabled == true is represented as an
absence of this parameter.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-05-05 08:14:49 +02:00
463 changed files with 18249 additions and 9355 deletions

9
.gitmodules vendored
View File

@@ -9,9 +9,12 @@
[submodule "libdeflate"]
path = libdeflate
url = ../libdeflate
[submodule "zstd"]
path = zstd
url = ../zstd
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-jmx"]
path = scylla-jmx
url = ../scylla-jmx
[submodule "scylla-tools"]
path = scylla-tools
url = ../scylla-tools-java

View File

@@ -134,15 +134,11 @@ add_executable(scylla
${SEASTAR_SOURCE_FILES}
${SCYLLA_SOURCE_FILES})
# Note that since CLion does not undestand GCC6 concepts, we always disable them (even if users configure otherwise).
# CLion seems to have trouble with `-U` (macro undefinition), so we do it this way instead.
list(REMOVE_ITEM SEASTAR_CFLAGS "-DHAVE_GCC6_CONCEPTS")
# If the Seastar pkg-config information is available, append to the default flags.
#
# For ease of browsing the source code, we always pretend that DPDK is enabled.
target_compile_options(scylla PUBLIC
-std=gnu++1z
-std=gnu++20
-DHAVE_DPDK
-DHAVE_HWLOC
"${SEASTAR_CFLAGS}")

View File

@@ -8,4 +8,4 @@ Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to re
# Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

View File

@@ -18,23 +18,35 @@ $ git submodule update --init --recursive
### Dependencies
Scylla depends on the system package manager for its development dependencies.
Scylla is fairly fussy about its build environment, requiring a very recent
version of the C++20 compiler and numerous tools and libraries to build.
Running `./install-dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
Run `./install-dependencies.sh` (as root) to use your Linux distributions's
package manager to install the appropriate packages on your build machine.
However, this will only work on very recent distributions. For example,
currently Fedora users must upgrade to Fedora 32 otherwise the C++ compiler
will be too old, and not support the new C++20 standard that Scylla uses.
On Ubuntu and Debian based Linux distributions, some packages
required to build Scylla are missing in the official upstream:
Alternatively, to avoid having to upgrade your build machine or install
various packages on it, we provide another option - the **frozen toolchain**.
This is a script, `./tools/toolchain/dbuild`, that can execute build or run
commands inside a Docker image that contains exactly the right build tools and
libraries. The `dbuild` technique is useful for beginners, but is also the way
in which ScyllaDB produces official releases, so it is highly recommended.
- libthrift-dev and libthrift
- antlr3-c++-dev
To use `dbuild`, you simply prefix any build or run command with it. Building
and running Scylla becomes as easy as:
Try running ```sudo ./scripts/scylla_current_repo``` to add Scylla upstream,
and get the missing packages from it.
```bash
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla
$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1
```
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native
thread, and up to 3 GB per native thread while linking. GCC >= 8.1.1. is
thread, and up to 3 GB per native thread while linking. GCC >= 10 is
required.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.

View File

@@ -2,22 +2,24 @@
## Quick-start
To get the build going quickly, Scylla offers a [frozen toolchain](tools/toolchain/README.md)
which would build and run Scylla using a pre-configured Docker image.
Using the frozen toolchain will also isolate all of the installed
dependencies in a Docker container.
Assuming you have met the toolchain prerequisites, which is running
Docker in user mode, building and running is as easy as:
Scylla is fairly fussy about its build environment, requiring very recent
versions of the C++20 compiler and of many libraries to build. The document
[HACKING.md](HACKING.md) includes detailed information on building and
developing Scylla, but to get Scylla building quickly on (almost) any build
machine, Scylla offers offers a [frozen toolchain](tools/toolchain/README.md),
This is a pre-configured Docker image which includes recent versions of all
the required compilers, libraries and build tools. Using the frozen toolchain
allows you to avoid changing anything in your build machine to meet Scylla's
requirements - you just need to meet the frozen toolchain's prerequisites
(mostly, Docker or Podman being available).
Building and running Scylla with the frozen toolchain is as easy as:
```bash
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla
$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1
```
Please see [HACKING.md](HACKING.md) for detailed information on building and developing Scylla.
**Note**: GCC >= 8.1.1 is required to compile Scylla.
```
## Running Scylla
@@ -67,15 +69,20 @@ The courses are free, self-paced and include hands-on examples. They cover a var
administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions,
multi-datacenters and how Scylla integrates with third-party applications.
## Building Fedora-based Docker image
## Building a CentOS-based Docker image
Build a Docker image with:
```
cd dist/docker
cd dist/docker/redhat
docker build -t <image-name> .
```
This build is based on executables downloaded from downloads.scylladb.com,
**not** on the executables built in this source directory. See further
instructions in dist/docker/redhat/README.md to build a docker image from
your own executables.
Run the image with:
```

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.1.11
VERSION=4.2.4
if test -f version
then

26
absl-flat_hash_map.cc Normal file
View File

@@ -0,0 +1,26 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "absl-flat_hash_map.hh"
size_t sstring_hash::operator()(std::string_view v) const noexcept {
return absl::Hash<std::string_view>{}(v);
}

47
absl-flat_hash_map.hh Normal file
View File

@@ -0,0 +1,47 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <absl/container/flat_hash_map.h>
#include <seastar/core/sstring.hh>
using namespace seastar;
struct sstring_hash {
using is_transparent = void;
size_t operator()(std::string_view v) const noexcept;
};
struct sstring_eq {
using is_transparent = void;
bool operator()(std::string_view a, std::string_view b) const noexcept {
return a == b;
}
};
template <typename K, typename V, typename... Ts>
struct flat_hash_map : public absl::flat_hash_map<K, V, Ts...> {
};
template <typename V>
struct flat_hash_map<sstring, V>
: public absl::flat_hash_map<sstring, V, sstring_hash, sstring_eq> {};

View File

@@ -77,7 +77,7 @@ std::string base64_encode(bytes_view in) {
return ret;
}
bytes base64_decode(std::string_view in) {
static std::string base64_decode_string(std::string_view in) {
int i = 0;
int8_t chunk4[4]; // chunk of input, each byte converted to 0..63;
std::string ret;
@@ -104,8 +104,42 @@ bytes base64_decode(std::string_view in) {
if (i==3)
ret += ((chunk4[1] & 0xf) << 4) + ((chunk4[2] & 0x3c) >> 2);
}
return ret;
}
bytes base64_decode(std::string_view in) {
// FIXME: This copy is sad. The problem is we need back "bytes"
// but "bytes" doesn't have efficient append and std::string.
// To fix this we need to use bytes' "uninitialized" feature.
std::string ret = base64_decode_string(in);
return bytes(ret.begin(), ret.end());
}
static size_t base64_padding_len(std::string_view str) {
size_t padding = 0;
padding += (!str.empty() && str.back() == '=');
padding += (str.size() > 1 && *(str.end() - 2) == '=');
return padding;
}
size_t base64_decoded_len(std::string_view str) {
return str.size() / 4 * 3 - base64_padding_len(str);
}
bool base64_begins_with(std::string_view base, std::string_view operand) {
if (base.size() < operand.size() || base.size() % 4 != 0 || operand.size() % 4 != 0) {
return false;
}
if (base64_padding_len(operand) == 0) {
return base.starts_with(operand);
}
const std::string_view unpadded_base_prefix = base.substr(0, operand.size() - 4);
const std::string_view unpadded_operand = operand.substr(0, operand.size() - 4);
if (unpadded_base_prefix != unpadded_operand) {
return false;
}
// Decode and compare last 4 bytes of base64-encoded strings
const std::string base_remainder = base64_decode_string(base.substr(operand.size() - 4, operand.size()));
const std::string operand_remainder = base64_decode_string(operand.substr(operand.size() - 4));
return base_remainder.starts_with(operand_remainder);
}

View File

@@ -32,3 +32,7 @@ bytes base64_decode(std::string_view);
inline bytes base64_decode(const rjson::value& v) {
return base64_decode(std::string_view(v.GetString(), v.GetStringLength()));
}
size_t base64_decoded_len(std::string_view str);
bool base64_begins_with(std::string_view base, std::string_view operand);

View File

@@ -34,7 +34,7 @@
#include <boost/algorithm/cxx11/any_of.hpp>
#include "utils/overloaded_functor.hh"
#include "expressions_eval.hh"
#include "expressions.hh"
namespace alternator {
@@ -67,49 +67,6 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
return it->second;
}
static ::shared_ptr<cql3::restrictions::single_column_restriction::contains> make_map_element_restriction(const column_definition& cdef, std::string_view key, const rjson::value& value) {
bytes raw_key = utf8_type->from_string(sstring_view(key.data(), key.size()));
auto key_value = ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(std::move(raw_key)));
bytes raw_value = serialize_item(value);
auto entry_value = ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(std::move(raw_value)));
return make_shared<cql3::restrictions::single_column_restriction::contains>(cdef, std::move(key_value), std::move(entry_value));
}
static ::shared_ptr<cql3::restrictions::single_column_restriction::EQ> make_key_eq_restriction(const column_definition& cdef, const rjson::value& value) {
bytes raw_value = get_key_from_typed_value(value, cdef);
auto restriction_value = ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(std::move(raw_value)));
return make_shared<cql3::restrictions::single_column_restriction::EQ>(cdef, std::move(restriction_value));
}
::shared_ptr<cql3::restrictions::statement_restrictions> get_filtering_restrictions(schema_ptr schema, const column_definition& attrs_col, const rjson::value& query_filter) {
clogger.trace("Getting filtering restrictions for: {}", rjson::print(query_filter));
auto filtering_restrictions = ::make_shared<cql3::restrictions::statement_restrictions>(schema, true);
for (auto it = query_filter.MemberBegin(); it != query_filter.MemberEnd(); ++it) {
std::string_view column_name(it->name.GetString(), it->name.GetStringLength());
const rjson::value& condition = it->value;
const rjson::value& comp_definition = rjson::get(condition, "ComparisonOperator");
const rjson::value& attr_list = rjson::get(condition, "AttributeValueList");
comparison_operator_type op = get_comparison_operator(comp_definition);
if (op != comparison_operator_type::EQ) {
throw api_error("ValidationException", "Filtering is currently implemented for EQ operator only");
}
if (attr_list.Size() != 1) {
throw api_error("ValidationException", format("EQ restriction needs exactly 1 attribute value: {}", rjson::print(attr_list)));
}
if (const column_definition* cdef = schema->get_column_definition(to_bytes(column_name.data()))) {
// Primary key restriction
filtering_restrictions->add_restriction(make_key_eq_restriction(*cdef, attr_list[0]), false, true);
} else {
// Regular column restriction
filtering_restrictions->add_restriction(make_map_element_restriction(attrs_col, column_name, attr_list[0]), false, true);
}
}
return filtering_restrictions;
}
namespace {
struct size_check {
@@ -202,36 +159,47 @@ static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException", format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error("ValidationException", format("BEGINS_WITH operator requires String or Binary in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error("ValidationException", "begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error("ValidationException", format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error("ValidationException", "begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error("ValidationException", format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
if (it2->name == "S") {
std::string_view val1(it1->value.GetString(), it1->value.GetStringLength());
std::string_view val2(it2->value.GetString(), it2->value.GetStringLength());
return val1.substr(0, val2.size()) == val2;
return rjson::to_string_view(it1->value).starts_with(rjson::to_string_view(it2->value));
} else /* it2->name == "B" */ {
// TODO (optimization): Check the begins_with condition directly on
// the base64-encoded string, without making a decoded copy.
bytes val1 = base64_decode(it1->value);
bytes val2 = base64_decode(it2->value);
return val1.substr(0, val2.size()) == val2;
return base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
@@ -246,11 +214,6 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2) {
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("CONTAINS operator requires a single AttributeValue of type String, Number, or Binary, "
"got {} instead", kv2.name));
}
if (kv1.name == "S" && kv2.name == "S") {
return rjson::to_string_view(kv1.value).find(rjson::to_string_view(kv2.value)) != std::string_view::npos;
} else if (kv1.name == "B" && kv2.name == "B") {
@@ -333,24 +296,38 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error("ValidationException", format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error("ValidationException", format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -364,7 +341,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -395,57 +373,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error("ValidationException", "between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error(
"ValidationException",
if (bounds_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -492,19 +484,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -516,56 +508,87 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_CONTAINS(got, (*attribute_value_list)[0]);
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
// Expected's "CONTAINS" has this artificial limitation.
// ConditionExpression's "contains()" does not...
const rjson::value& arg = (*attribute_value_list)[0];
const auto& argtype = (*arg.MemberBegin()).name;
if (argtype != "S" && argtype != "N" && argtype != "B") {
throw api_error("ValidationException",
format("CONTAINS operator requires a single AttributeValue of type String, Number, or Binary, "
"got {} instead", argtype));
}
return check_CONTAINS(got, arg);
}
case comparison_operator_type::NOT_CONTAINS:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_NOT_CONTAINS(got, (*attribute_value_list)[0]);
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
// Expected's "NOT_CONTAINS" has this artificial limitation.
// ConditionExpression's "contains()" does not...
const rjson::value& arg = (*attribute_value_list)[0];
const auto& argtype = (*arg.MemberBegin()).name;
if (argtype != "S" && argtype != "N" && argtype != "B") {
throw api_error("ValidationException",
format("CONTAINS operator requires a single AttributeValue of type String, Number, or Binary, "
"got {} instead", argtype));
}
return check_NOT_CONTAINS(got, arg);
}
}
throw std::logic_error(format("Internal error: corrupted operator enum: {}", int(op)));
}
}
conditional_operator_type get_conditional_operator(const rjson::value& req) {
const rjson::value* conditional_operator = rjson::find(req, "ConditionalOperator");
if (!conditional_operator) {
return conditional_operator_type::MISSING;
}
if (!conditional_operator->IsString()) {
throw api_error("ValidationException", "'ConditionalOperator' parameter, if given, must be a string");
}
auto s = rjson::to_string_view(*conditional_operator);
if (s == "AND") {
return conditional_operator_type::AND;
} else if (s == "OR") {
return conditional_operator_type::OR;
} else {
throw api_error("ValidationException",
format("'ConditionalOperator' parameter must be AND, OR or missing. Found {}.", s));
}
}
// Check if the existing values of the item (previous_item) match the
// conditions given by the Expected and ConditionalOperator parameters
// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).
// This function can throw an ValidationException API error if there
// are errors in the format of the condition itself.
bool verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value>& previous_item) {
bool verify_expected(const rjson::value& req, const rjson::value* previous_item) {
const rjson::value* expected = rjson::find(req, "Expected");
auto conditional_operator = get_conditional_operator(req);
if (conditional_operator != conditional_operator_type::MISSING &&
(!expected || (expected->IsObject() && expected->GetObject().ObjectEmpty()))) {
throw api_error("ValidationException", "'ConditionalOperator' parameter cannot be specified for missing or empty Expression");
}
if (!expected) {
return true;
}
if (!expected->IsObject()) {
throw api_error("ValidationException", "'Expected' parameter, if given, must be an object");
}
// ConditionalOperator can be "AND" for requiring all conditions, or
// "OR" for requiring one condition, and defaults to "AND" if missing.
const rjson::value* conditional_operator = rjson::find(req, "ConditionalOperator");
bool require_all = true;
if (conditional_operator) {
if (!conditional_operator->IsString()) {
throw api_error("ValidationException", "'ConditionalOperator' parameter, if given, must be a string");
}
std::string_view s(conditional_operator->GetString(), conditional_operator->GetStringLength());
if (s == "AND") {
// require_all is already true
} else if (s == "OR") {
require_all = false;
} else {
throw api_error("ValidationException", "'ConditionalOperator' parameter must be AND, OR or missing");
}
if (expected->GetObject().ObjectEmpty()) {
throw api_error("ValidationException", "'ConditionalOperator' parameter cannot be specified for empty Expression");
}
}
bool require_all = conditional_operator != conditional_operator_type::OR;
return verify_condition(*expected, require_all, previous_item);
}
for (auto it = expected->MemberBegin(); it != expected->MemberEnd(); ++it) {
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item) {
for (auto it = condition.MemberBegin(); it != condition.MemberEnd(); ++it) {
const rjson::value* got = nullptr;
if (previous_item && previous_item->IsObject() && previous_item->HasMember("Item")) {
got = rjson::find((*previous_item)["Item"], rjson::to_string_view(it->name));
if (previous_item) {
got = rjson::find(*previous_item, rjson::to_string_view(it->name));
}
bool success = verify_expected_one(it->value, got);
if (success && !require_all) {
@@ -581,12 +604,8 @@ bool verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value
return require_all;
}
bool calculate_primitive_condition(const parsed::primitive_condition& cond,
std::unordered_set<std::string>& used_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
const rjson::value& req,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item) {
static bool calculate_primitive_condition(const parsed::primitive_condition& cond,
const rjson::value* previous_item) {
std::vector<rjson::value> calculated_values;
calculated_values.reserve(cond._values.size());
for (const parsed::value& v : cond._values) {
@@ -594,9 +613,7 @@ bool calculate_primitive_condition(const parsed::primitive_condition& cond,
cond._op == parsed::primitive_condition::type::VALUE ?
calculate_value_caller::ConditionExpressionAlone :
calculate_value_caller::ConditionExpression,
rjson::find(req, "ExpressionAttributeValues"),
used_attribute_names, used_attribute_values,
req, schema, previous_item));
previous_item));
}
switch (cond._op) {
case parsed::primitive_condition::type::BETWEEN:
@@ -604,7 +621,8 @@ bool calculate_primitive_condition(const parsed::primitive_condition& cond,
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -635,13 +653,17 @@ bool calculate_primitive_condition(const parsed::primitive_condition& cond,
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));
@@ -652,23 +674,17 @@ bool calculate_primitive_condition(const parsed::primitive_condition& cond,
// conditions given by the given parsed ConditionExpression.
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,
std::unordered_set<std::string>& used_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
const rjson::value& req,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item) {
const rjson::value* previous_item) {
if (condition_expression.empty()) {
return true;
}
bool ret = std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) -> bool {
return calculate_primitive_condition(cond, used_attribute_values,
used_attribute_names, req, schema, previous_item);
return calculate_primitive_condition(cond, previous_item);
},
[&] (const parsed::condition_expression::condition_list& list) -> bool {
auto verify_condition = [&] (const parsed::condition_expression& e) {
return verify_condition_expression(e, used_attribute_values,
used_attribute_names, req, schema, previous_item);
return verify_condition_expression(e, previous_item);
};
switch (list.op) {
case '&':

View File

@@ -33,6 +33,7 @@
#include "cql3/restrictions/statement_restrictions.hh"
#include "serialization.hh"
#include "expressions_types.hh"
namespace alternator {
@@ -42,8 +43,19 @@ enum class comparison_operator_type {
comparison_operator_type get_comparison_operator(const rjson::value& comparison_operator);
::shared_ptr<cql3::restrictions::statement_restrictions> get_filtering_restrictions(schema_ptr schema, const column_definition& attrs_col, const rjson::value& query_filter);
enum class conditional_operator_type {
AND, OR, MISSING
};
conditional_operator_type get_conditional_operator(const rjson::value& req);
bool verify_expected(const rjson::value& req, const std::unique_ptr<rjson::value>& previous_item);
bool verify_expected(const rjson::value& req, const rjson::value* previous_item);
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,
const rjson::value* previous_item);
}

File diff suppressed because it is too large Load Diff

View File

@@ -20,16 +20,24 @@
*/
#include "expressions.hh"
#include "serialization.hh"
#include "base64.hh"
#include "conditions.hh"
#include "alternator/expressionsLexer.hpp"
#include "alternator/expressionsParser.hpp"
#include "utils/overloaded_functor.hh"
#include "error.hh"
#include <seastarx.hh>
#include "seastarx.hh"
#include <seastar/core/print.hh>
#include <seastar/util/log.hh>
#include <boost/algorithm/cxx11/any_of.hpp>
#include <boost/algorithm/cxx11/all_of.hpp>
#include <functional>
#include <unordered_map>
namespace alternator {
@@ -122,6 +130,555 @@ void condition_expression::append(condition_expression&& a, char op) {
}, _expression);
}
} // namespace parsed
// The following resolve_*() functions resolve references in parsed
// expressions of different types. Resolving a parsed expression means
// replacing:
// 1. In parsed::path objects, replace references like "#name" with the
// attribute name from ExpressionAttributeNames,
// 2. In parsed::constant objects, replace references like ":value" with
// the value from ExpressionAttributeValues.
// These function also track which name and value references were used, to
// allow complaining if some remain unused.
// Note that the resolve_*() functions modify the expressions in-place,
// so if we ever intend to cache parsed expression, we need to pass a copy
// into this function.
//
// Doing the "resolving" stage before the evaluation stage has two benefits.
// First, it allows us to be compatible with DynamoDB in catching unused
// names and values (see issue #6572). Second, in the FilterExpression case,
// we need to resolve the expression just once but then use it many times
// (once for each item to be filtered).
static void resolve_path(parsed::path& p,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
const std::string& column_name = p.root();
if (column_name.size() > 0 && column_name.front() == '#') {
if (!expression_attribute_names) {
throw api_error("ValidationException",
format("ExpressionAttributeNames missing, entry '{}' required by expression", column_name));
}
const rjson::value* value = rjson::find(*expression_attribute_names, column_name);
if (!value || !value->IsString()) {
throw api_error("ValidationException",
format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
p.set_root(std::string(rjson::to_string_view(*value)));
}
}
static void resolve_constant(parsed::constant& c,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_values) {
std::visit(overloaded_functor {
[&] (const std::string& valref) {
if (!expression_attribute_values) {
throw api_error("ValidationException",
format("ExpressionAttributeValues missing, entry '{}' required by expression", valref));
}
const rjson::value* value = rjson::find(*expression_attribute_values, valref);
if (!value) {
throw api_error("ValidationException",
format("ExpressionAttributeValues missing entry '{}' required by expression", valref));
}
if (value->IsNull()) {
throw api_error("ValidationException",
format("ExpressionAttributeValues null value for entry '{}' required by expression", valref));
}
validate_value(*value, "ExpressionAttributeValues");
used_attribute_values.emplace(valref);
c.set(*value);
},
[&] (const parsed::constant::literal& lit) {
// Nothing to do, already resolved
}
}, c._value);
}
void resolve_value(parsed::value& rhs,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values) {
std::visit(overloaded_functor {
[&] (parsed::constant& c) {
resolve_constant(c, expression_attribute_values, used_attribute_values);
},
[&] (parsed::value::function_call& f) {
for (parsed::value& value : f._parameters) {
resolve_value(value, expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
}
},
[&] (parsed::path& p) {
resolve_path(p, expression_attribute_names, used_attribute_names);
}
}, rhs._value);
}
void resolve_set_rhs(parsed::set_rhs& rhs,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values) {
resolve_value(rhs._v1, expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
if (rhs._op != 'v') {
resolve_value(rhs._v2, expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
}
}
void resolve_update_expression(parsed::update_expression& ue,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values) {
for (parsed::update_expression::action& action : ue.actions()) {
resolve_path(action._path, expression_attribute_names, used_attribute_names);
std::visit(overloaded_functor {
[&] (parsed::update_expression::action::set& a) {
resolve_set_rhs(a._rhs, expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
},
[&] (parsed::update_expression::action::remove& a) {
// nothing to do
},
[&] (parsed::update_expression::action::add& a) {
resolve_constant(a._valref, expression_attribute_values, used_attribute_values);
},
[&] (parsed::update_expression::action::del& a) {
resolve_constant(a._valref, expression_attribute_values, used_attribute_values);
}
}, action._action);
}
}
static void resolve_primitive_condition(parsed::primitive_condition& pc,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values) {
for (parsed::value& value : pc._values) {
resolve_value(value,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
}
}
void resolve_condition_expression(parsed::condition_expression& ce,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values) {
std::visit(overloaded_functor {
[&] (parsed::primitive_condition& cond) {
resolve_primitive_condition(cond,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
},
[&] (parsed::condition_expression::condition_list& list) {
for (parsed::condition_expression& cond : list.conditions) {
resolve_condition_expression(cond,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
}
}
}, ce._expression);
}
void resolve_projection_expression(std::vector<parsed::path>& pe,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
for (parsed::path& p : pe) {
resolve_path(p, expression_attribute_names, used_attribute_names);
}
}
// condition_expression_on() checks whether a condition_expression places any
// condition on the given attribute. It can be useful, for example, for
// checking whether the condition tries to restrict a key column.
static bool value_on(const parsed::value& v, std::string_view attribute) {
return std::visit(overloaded_functor {
[&] (const parsed::constant& c) {
return false;
},
[&] (const parsed::value::function_call& f) {
for (const parsed::value& value : f._parameters) {
if (value_on(value, attribute)) {
return true;
}
}
return false;
},
[&] (const parsed::path& p) {
return p.root() == attribute;
}
}, v._value);
}
static bool primitive_condition_on(const parsed::primitive_condition& pc, std::string_view attribute) {
for (const parsed::value& value : pc._values) {
if (value_on(value, attribute)) {
return true;
}
}
return false;
}
bool condition_expression_on(const parsed::condition_expression& ce, std::string_view attribute) {
return std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) {
return primitive_condition_on(cond, attribute);
},
[&] (const parsed::condition_expression::condition_list& list) {
for (const parsed::condition_expression& cond : list.conditions) {
if (condition_expression_on(cond, attribute)) {
return true;
}
}
return false;
}
}, ce._expression);
}
// for_condition_expression_on() runs a given function over all the attributes
// mentioned in the expression. If the same attribute is mentioned more than
// once, the function will be called more than once for the same attribute.
static void for_value_on(const parsed::value& v, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::constant& c) { },
[&] (const parsed::value::function_call& f) {
for (const parsed::value& value : f._parameters) {
for_value_on(value, func);
}
},
[&] (const parsed::path& p) {
func(p.root());
}
}, v._value);
}
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) {
for (const parsed::value& value : cond._values) {
for_value_on(value, func);
}
},
[&] (const parsed::condition_expression::condition_list& list) {
for (const parsed::condition_expression& cond : list.conditions) {
for_condition_expression_on(cond, func);
}
}
}, ce._expression);
}
// The following calculate_value() functions calculate, or evaluate, a parsed
// expression. The parsed expression is assumed to have been "resolved", with
// the matching resolve_* function.
// Take two JSON-encoded list values (remember that a list value is
// {"L": [...the actual list]}) and return the concatenation, again as
// a list value.
static rjson::value list_concatenate(const rjson::value& v1, const rjson::value& v2) {
const rjson::value* list1 = unwrap_list(v1);
const rjson::value* list2 = unwrap_list(v2);
if (!list1 || !list2) {
throw api_error("ValidationException", "UpdateExpression: list_append() given a non-list");
}
rjson::value cat = rjson::copy(*list1);
for (const auto& a : list2->GetArray()) {
rjson::push_back(cat, rjson::copy(a));
}
rjson::value ret = rjson::empty_object();
rjson::set(ret, "L", std::move(cat));
return ret;
}
// calculate_size() is ConditionExpression's size() function, i.e., it takes
// a JSON-encoded value and returns its "size" as defined differently for the
// different types - also as a JSON-encoded number.
// It return a JSON-encoded "null" value if this value's type has no size
// defined. Comparisons against this non-numeric value will later fail.
static rjson::value calculate_size(const rjson::value& v) {
// NOTE: If v is improperly formatted for our JSON value encoding, it
// must come from the request itself, not from the database, so it makes
// sense to throw a ValidationException if we see such a problem.
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error("ValidationException", format("invalid object: {}", v));
}
auto it = v.MemberBegin();
int ret;
if (it->name == "S") {
if (!it->value.IsString()) {
throw api_error("ValidationException", format("invalid string: {}", v));
}
ret = it->value.GetStringLength();
} else if (it->name == "NS" || it->name == "SS" || it->name == "BS" || it->name == "L") {
if (!it->value.IsArray()) {
throw api_error("ValidationException", format("invalid set: {}", v));
}
ret = it->value.Size();
} else if (it->name == "M") {
if (!it->value.IsObject()) {
throw api_error("ValidationException", format("invalid map: {}", v));
}
ret = it->value.MemberCount();
} else if (it->name == "B") {
if (!it->value.IsString()) {
throw api_error("ValidationException", format("invalid byte string: {}", v));
}
ret = base64_decoded_len(rjson::to_string_view(it->value));
} else {
rjson::value json_ret = rjson::empty_object();
rjson::set(json_ret, "null", rjson::value(true));
return json_ret;
}
rjson::value json_ret = rjson::empty_object();
rjson::set(json_ret, "N", rjson::from_string(std::to_string(ret)));
return json_ret;
}
static const rjson::value& calculate_value(const parsed::constant& c) {
return std::visit(overloaded_functor {
[&] (const parsed::constant::literal& v) -> const rjson::value& {
return *v;
},
[&] (const std::string& valref) -> const rjson::value& {
// Shouldn't happen, we should have called resolve_value() earlier
// and replaced the value reference by the literal constant.
throw std::logic_error("calculate_value() called before resolve_value()");
}
}, c._value);
}
static rjson::value to_bool_json(bool b) {
rjson::value json_ret = rjson::empty_object();
rjson::set(json_ret, "BOOL", rjson::value(b));
return json_ret;
}
static bool known_type(std::string_view type) {
static thread_local const std::unordered_set<std::string_view> types = {
"N", "S", "B", "NS", "SS", "BS", "L", "M", "NULL", "BOOL"
};
return types.contains(type);
}
using function_handler_type = rjson::value(calculate_value_caller, const rjson::value*, const parsed::value::function_call&);
static const
std::unordered_map<std::string_view, function_handler_type*> function_handlers {
{"list_append", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::UpdateExpression) {
throw api_error("ValidationException",
format("{}: list_append() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
format("{}: list_append() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
return list_concatenate(v1, v2);
}
},
{"if_not_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::UpdateExpression) {
throw api_error("ValidationException",
format("{}: if_not_exists() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
format("{}: if_not_exists() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
if (!std::holds_alternative<parsed::path>(f._parameters[0]._value)) {
throw api_error("ValidationException",
format("{}: if_not_exists() must include path as its first argument", caller));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
return v1.IsNull() ? std::move(v2) : std::move(v1);
}
},
{"size", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpression) {
throw api_error("ValidationException",
format("{}: size() not allowed here", caller));
}
if (f._parameters.size() != 1) {
throw api_error("ValidationException",
format("{}: size() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
return calculate_size(v);
}
},
{"attribute_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
format("{}: attribute_exists() not allowed here", caller));
}
if (f._parameters.size() != 1) {
throw api_error("ValidationException",
format("{}: attribute_exists() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
if (!std::holds_alternative<parsed::path>(f._parameters[0]._value)) {
throw api_error("ValidationException",
format("{}: attribute_exists()'s parameter must be a path", caller));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
return to_bool_json(!v.IsNull());
}
},
{"attribute_not_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
format("{}: attribute_not_exists() not allowed here", caller));
}
if (f._parameters.size() != 1) {
throw api_error("ValidationException",
format("{}: attribute_not_exists() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
if (!std::holds_alternative<parsed::path>(f._parameters[0]._value)) {
throw api_error("ValidationException",
format("{}: attribute_not_exists()'s parameter must be a path", caller));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
return to_bool_json(v.IsNull());
}
},
{"attribute_type", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
format("{}: attribute_type() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
format("{}: attribute_type() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
// There is no real reason for the following check (not
// allowing the type to come from a document attribute), but
// DynamoDB does this check, so we do too...
if (!f._parameters[1].is_constant()) {
throw api_error("ValidationException",
format("{}: attribute_types()'s first parameter must be an expression attribute", caller));
}
rjson::value v0 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v1 = calculate_value(f._parameters[1], caller, previous_item);
if (v1.IsObject() && v1.MemberCount() == 1 && v1.MemberBegin()->name == "S") {
// If the type parameter is not one of the legal types
// we should generate an error, not a failed condition:
if (!known_type(rjson::to_string_view(v1.MemberBegin()->value))) {
throw api_error("ValidationException",
format("{}: attribute_types()'s second parameter, {}, is not a known type",
caller, v1.MemberBegin()->value));
}
if (v0.IsObject() && v0.MemberCount() == 1) {
return to_bool_json(v1.MemberBegin()->value == v0.MemberBegin()->name);
} else {
return to_bool_json(false);
}
} else {
throw api_error("ValidationException",
format("{}: attribute_type() second parameter must refer to a string, got {}", caller, v1));
}
}
},
{"begins_with", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
format("{}: begins_with() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
format("{}: begins_with() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
format("{}: contains() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
format("{}: contains() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
return to_bool_json(check_CONTAINS(v1.IsNull() ? nullptr : &v1, v2));
}
},
};
// Given a parsed::value, which can refer either to a constant value from
// ExpressionAttributeValues, to the value of some attribute, or to a function
// of other values, this function calculates the resulting value.
// "caller" determines which expression - ConditionExpression or
// UpdateExpression - is asking for this value. We need to know this because
// DynamoDB allows a different choice of functions for different expressions.
rjson::value calculate_value(const parsed::value& v,
calculate_value_caller caller,
const rjson::value* previous_item) {
return std::visit(overloaded_functor {
[&] (const parsed::constant& c) -> rjson::value {
return rjson::copy(calculate_value(c));
},
[&] (const parsed::value::function_call& f) -> rjson::value {
auto function_it = function_handlers.find(std::string_view(f._function_name));
if (function_it == function_handlers.end()) {
throw api_error("ValidationException",
format("UpdateExpression: unknown function '{}' called.", f._function_name));
}
return function_it->second(caller, previous_item, f);
},
[&] (const parsed::path& p) -> rjson::value {
if (!previous_item) {
return rjson::null_value();
}
std::string update_path = p.root();
if (p.has_operators()) {
// FIXME: support this
throw api_error("ValidationException", "Reading attribute paths not yet implemented");
}
const rjson::value* previous_value = rjson::find(*previous_item, update_path);
return previous_value ? rjson::copy(*previous_value) : rjson::null_value();
}
}, v._value);
}
// Same as calculate_value() above, except takes a set_rhs, which may be
// either a single value, or v1+v2 or v1-v2.
rjson::value calculate_value(const parsed::set_rhs& rhs,
const rjson::value* previous_item) {
switch(rhs._op) {
case 'v':
return calculate_value(rhs._v1, calculate_value_caller::UpdateExpression, previous_item);
case '+': {
rjson::value v1 = calculate_value(rhs._v1, calculate_value_caller::UpdateExpression, previous_item);
rjson::value v2 = calculate_value(rhs._v2, calculate_value_caller::UpdateExpression, previous_item);
return number_add(v1, v2);
}
case '-': {
rjson::value v1 = calculate_value(rhs._v1, calculate_value_caller::UpdateExpression, previous_item);
rjson::value v2 = calculate_value(rhs._v2, calculate_value_caller::UpdateExpression, previous_item);
return number_subtract(v1, v2);
}
}
// Can't happen
return rjson::null_value();
}
} // namespace alternator

View File

@@ -24,8 +24,13 @@
#include <string>
#include <stdexcept>
#include <vector>
#include <unordered_set>
#include <string_view>
#include <seastar/util/noncopyable_function.hh>
#include "expressions_types.hh"
#include "rjson.hh"
namespace alternator {
@@ -38,4 +43,60 @@ parsed::update_expression parse_update_expression(std::string query);
std::vector<parsed::path> parse_projection_expression(std::string query);
parsed::condition_expression parse_condition_expression(std::string query);
void resolve_update_expression(parsed::update_expression& ue,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values);
void resolve_projection_expression(std::vector<parsed::path>& pe,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names);
void resolve_condition_expression(parsed::condition_expression& ce,
const rjson::value* expression_attribute_names,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values);
void validate_value(const rjson::value& v, const char* caller);
bool condition_expression_on(const parsed::condition_expression& ce, std::string_view attribute);
// for_condition_expression_on() runs the given function on the attributes
// that the expression uses. It may run for the same attribute more than once
// if the same attribute is used more than once in the expression.
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func);
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:
enum class calculate_value_caller {
UpdateExpression, ConditionExpression, ConditionExpressionAlone
};
inline std::ostream& operator<<(std::ostream& out, calculate_value_caller caller) {
switch (caller) {
case calculate_value_caller::UpdateExpression:
out << "UpdateExpression";
break;
case calculate_value_caller::ConditionExpression:
out << "ConditionExpression";
break;
case calculate_value_caller::ConditionExpressionAlone:
out << "ConditionExpression";
break;
default:
out << "unknown type of expression";
break;
}
return out;
}
rjson::value calculate_value(const parsed::value& v,
calculate_value_caller caller,
const rjson::value* previous_item);
rjson::value calculate_value(const parsed::set_rhs& rhs,
const rjson::value* previous_item);
} /* namespace alternator */

View File

@@ -1,78 +0,0 @@
/*
* Copyright 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <string>
#include <unordered_set>
#include "rjson.hh"
#include "schema_fwd.hh"
#include "expressions_types.hh"
namespace alternator {
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:
enum class calculate_value_caller {
UpdateExpression, ConditionExpression, ConditionExpressionAlone
};
inline std::ostream& operator<<(std::ostream& out, calculate_value_caller caller) {
switch (caller) {
case calculate_value_caller::UpdateExpression:
out << "UpdateExpression";
break;
case calculate_value_caller::ConditionExpression:
out << "ConditionExpression";
break;
case calculate_value_caller::ConditionExpressionAlone:
out << "ConditionExpression";
break;
default:
out << "unknown type of expression";
break;
}
return out;
}
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
rjson::value calculate_value(const parsed::value& v,
calculate_value_caller caller,
const rjson::value* expression_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
std::unordered_set<std::string>& used_attribute_values,
const rjson::value& update_info,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,
std::unordered_set<std::string>& used_attribute_values,
std::unordered_set<std::string>& used_attribute_names,
const rjson::value& req,
schema_ptr schema,
const std::unique_ptr<rjson::value>& previous_item);
} /* namespace alternator */

View File

@@ -25,6 +25,10 @@
#include <string>
#include <variant>
#include <seastar/core/shared_ptr.hh>
#include "rjson.hh"
/*
* Parsed representation of expressions and their components.
*
@@ -63,10 +67,27 @@ public:
}
};
// When an expression is first parsed, all constants are references, like
// ":val1", into ExpressionAttributeValues. This uses std::string() variant.
// The resolve_value() function replaces these constants by the JSON item
// extracted from the ExpressionAttributeValues.
struct constant {
// We use lw_shared_ptr<rjson::value> just to make rjson::value copyable,
// to make this entire object copyable as ANTLR needs.
using literal = lw_shared_ptr<rjson::value>;
std::variant<std::string, literal> _value;
void set(const rjson::value& v) {
_value = make_lw_shared<rjson::value>(rjson::copy(v));
}
void set(std::string& s) {
_value = s;
}
};
// "value" is is a value used in the right hand side of an assignment
// expression, "SET a = ...". It can be a reference to a value included in
// the request (":val"), a path to an attribute from the existing item
// (e.g., "a.b[3].c"), or a function of other such values.
// expression, "SET a = ...". It can be a constant (a reference to a value
// included in the request, e.g., ":val"), a path to an attribute from the
// existing item (e.g., "a.b[3].c"), or a function of other such values.
// Note that the real right-hand-side of an assignment is actually a bit
// more general - it allows either a value, or a value+value or value-value -
// see class set_rhs below.
@@ -75,9 +96,12 @@ struct value {
std::string _function_name;
std::vector<value> _parameters;
};
std::variant<std::string, path, function_call> _value;
std::variant<constant, path, function_call> _value;
void set_constant(constant c) {
_value = std::move(c);
}
void set_valref(std::string s) {
_value = std::move(s);
_value = constant { std::move(s) };
}
void set_path(path p) {
_value = std::move(p);
@@ -88,8 +112,8 @@ struct value {
void add_func_parameter(value v) {
std::get<function_call>(_value)._parameters.emplace_back(std::move(v));
}
bool is_valref() const {
return std::holds_alternative<std::string>(_value);
bool is_constant() const {
return std::holds_alternative<constant>(_value);
}
bool is_path() const {
return std::holds_alternative<path>(_value);
@@ -130,10 +154,10 @@ public:
struct remove {
};
struct add {
std::string _valref;
constant _valref;
};
struct del {
std::string _valref;
constant _valref;
};
std::variant<set, remove, add, del> _action;
@@ -147,11 +171,11 @@ public:
}
void assign_add(path p, std::string v) {
_path = std::move(p);
_action = add { std::move(v) };
_action = add { constant { std::move(v) } };
}
void assign_del(path p, std::string v) {
_path = std::move(p);
_action = del { std::move(v) };
_action = del { constant { std::move(v) } };
}
};
private:
@@ -169,6 +193,9 @@ public:
const std::vector<action>& actions() const {
return _actions;
}
std::vector<action>& actions() {
return _actions;
}
};
// A primitive_condition is a condition expression involving one condition,

View File

@@ -21,9 +21,9 @@
#pragma once
#include <seastarx.hh>
#include <service/storage_proxy.hh>
#include <service/storage_proxy.hh>
#include "seastarx.hh"
#include "service/storage_proxy.hh"
#include "service/storage_proxy.hh"
#include "rjson.hh"
#include "executor.hh"

View File

@@ -31,8 +31,8 @@ static logging::logger slogger("alternator-serialization");
namespace alternator {
type_info type_info_from_string(std::string type) {
static thread_local const std::unordered_map<std::string, type_info> type_infos = {
type_info type_info_from_string(std::string_view type) {
static thread_local const std::unordered_map<std::string_view, type_info> type_infos = {
{"S", {alternator_type::S, utf8_type}},
{"B", {alternator_type::B, bytes_type}},
{"BOOL", {alternator_type::BOOL, boolean_type}},
@@ -87,7 +87,7 @@ bytes serialize_item(const rjson::value& item) {
throw api_error("ValidationException", format("An item can contain only one attribute definition: {}", item));
}
auto it = item.MemberBegin();
type_info type_info = type_info_from_string(it->name.GetString()); // JSON keys are guaranteed to be strings
type_info type_info = type_info_from_string(rjson::to_string_view(it->name)); // JSON keys are guaranteed to be strings
if (type_info.atype == alternator_type::NOT_SUPPORTED_YET) {
slogger.trace("Non-optimal serialization of type {}", it->name.GetString());
@@ -186,6 +186,11 @@ bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column
format("Type mismatch: expected type {} for key column {}, got type {}",
type_to_string(column.type), column.name_as_text(), it->name.GetString()));
}
std::string_view value_view = rjson::to_string_view(it->value);
if (value_view.empty()) {
throw api_error("ValidationException",
format("The AttributeValue for a key attribute cannot contain an empty string value. Key: {}", column.name_as_text()));
}
if (column.type == bytes_type) {
return base64_decode(it->value);
} else {
@@ -270,4 +275,93 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&
return std::make_pair(it_key, &(it->value));
}
const rjson::value* unwrap_list(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return nullptr;
}
auto it = v.MemberBegin();
if (it->name != std::string("L")) {
return nullptr;
}
return &(it->value);
}
// Take two JSON-encoded numeric values ({"N": "thenumber"}) and return the
// sum, again as a JSON-encoded number.
rjson::value number_add(const rjson::value& v1, const rjson::value& v2) {
auto n1 = unwrap_number(v1, "UpdateExpression");
auto n2 = unwrap_number(v2, "UpdateExpression");
rjson::value ret = rjson::empty_object();
std::string str_ret = std::string((n1 + n2).to_string());
rjson::set(ret, "N", rjson::from_string(str_ret));
return ret;
}
rjson::value number_subtract(const rjson::value& v1, const rjson::value& v2) {
auto n1 = unwrap_number(v1, "UpdateExpression");
auto n2 = unwrap_number(v2, "UpdateExpression");
rjson::value ret = rjson::empty_object();
std::string str_ret = std::string((n1 - n2).to_string());
rjson::set(ret, "N", rjson::from_string(str_ret));
return ret;
}
// Take two JSON-encoded set values (e.g. {"SS": [...the actual set]}) and
// return the sum of both sets, again as a set value.
rjson::value set_sum(const rjson::value& v1, const rjson::value& v2) {
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error("ValidationException", format("Mismatched set types: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error("ValidationException", "UpdateExpression: ADD operation for sets must be given sets as arguments");
}
rjson::value sum = rjson::copy(*set1);
std::set<rjson::value, rjson::single_value_comp> set1_raw;
for (auto it = sum.Begin(); it != sum.End(); ++it) {
set1_raw.insert(rjson::copy(*it));
}
for (const auto& a : set2->GetArray()) {
if (set1_raw.count(a) == 0) {
rjson::push_back(sum, rjson::copy(a));
}
}
rjson::value ret = rjson::empty_object();
rjson::set_with_string_name(ret, set1_type, std::move(sum));
return ret;
}
// Take two JSON-encoded set values (e.g. {"SS": [...the actual list]}) and
// return the difference of s1 - s2, again as a set value.
// DynamoDB does not allow empty sets, so if resulting set is empty, return
// an unset optional instead.
std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value& v2) {
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error("ValidationException", format("Mismatched set types: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error("ValidationException", "UpdateExpression: DELETE operation can only be performed on a set");
}
std::set<rjson::value, rjson::single_value_comp> set1_raw;
for (auto it = set1->Begin(); it != set1->End(); ++it) {
set1_raw.insert(rjson::copy(*it));
}
for (const auto& a : set2->GetArray()) {
set1_raw.erase(a);
}
if (set1_raw.empty()) {
return std::nullopt;
}
rjson::value ret = rjson::empty_object();
rjson::set_with_string_name(ret, set1_type, rjson::empty_array());
rjson::value& result_set = ret[set1_type];
for (const auto& a : set1_raw) {
rjson::push_back(result_set, rjson::copy(a));
}
return ret;
}
}

View File

@@ -45,7 +45,7 @@ struct type_representation {
data_type dtype;
};
type_info type_info_from_string(std::string type);
type_info type_info_from_string(std::string_view type);
type_representation represent_type(alternator_type atype);
bytes serialize_item(const rjson::value& item);
@@ -69,4 +69,21 @@ big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);
// returned value is {"", nullptr}
const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v);
// Check if a given JSON object encodes a list (i.e., it is a {"L": [...]}
// and returns a pointer to that list.
const rjson::value* unwrap_list(const rjson::value& v);
// Take two JSON-encoded numeric values ({"N": "thenumber"}) and return the
// sum, again as a JSON-encoded number.
rjson::value number_add(const rjson::value& v1, const rjson::value& v2);
rjson::value number_subtract(const rjson::value& v1, const rjson::value& v2);
// Take two JSON-encoded set values (e.g. {"SS": [...the actual set]}) and
// return the sum of both sets, again as a set value.
rjson::value set_sum(const rjson::value& v1, const rjson::value& v2);
// Take two JSON-encoded set values (e.g. {"SS": [...the actual list]}) and
// return the difference of s1 - s2, again as a set value.
// DynamoDB does not allow empty sets, so if resulting set is empty, return
// an unset optional instead.
std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value& v2);
}

View File

@@ -23,7 +23,7 @@
#include "log.hh"
#include <seastar/http/function_handlers.hh>
#include <seastar/json/json_elements.hh>
#include <seastarx.hh>
#include "seastarx.hh"
#include "error.hh"
#include "rjson.hh"
#include "auth.hh"

View File

@@ -26,8 +26,8 @@
#include <seastar/http/httpd.hh>
#include <seastar/net/tls.hh>
#include <optional>
#include <alternator/auth.hh>
#include <utils/small_vector.hh>
#include "alternator/auth.hh"
#include "utils/small_vector.hh"
#include <seastar/core/units.hh>
namespace alternator {

View File

@@ -511,6 +511,21 @@
}
]
},
{
"path":"/storage_service/cdc_streams_check_and_repair",
"operations":[
{
"method":"POST",
"summary":"Checks that CDC streams reflect current cluster topology and regenerates them if not.",
"type":"void",
"nickname":"cdc_streams_check_and_repair",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/snapshots",
"operations":[

View File

@@ -93,6 +93,22 @@ static future<> register_api(http_context& ctx, const sstring& api_name,
});
}
future<> set_transport_controller(http_context& ctx, cql_transport::controller& ctl) {
return ctx.http_server.set_routes([&ctx, &ctl] (routes& r) { set_transport_controller(ctx, r, ctl); });
}
future<> unset_transport_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_transport_controller(ctx, r); });
}
future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl) {
return ctx.http_server.set_routes([&ctx, &ctl] (routes& r) { set_rpc_controller(ctx, r, ctl); });
}
future<> unset_rpc_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx) {
return register_api(ctx, "storage_service", "The storage service API", set_storage_service);
}

View File

@@ -25,6 +25,8 @@
namespace service { class load_meter; }
namespace locator { class token_metadata; }
namespace cql_transport { class controller; }
class thrift_controller;
namespace api {
@@ -48,6 +50,10 @@ future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx);
future<> set_transport_controller(http_context& ctx, cql_transport::controller& ctl);
future<> unset_transport_controller(http_context& ctx);
future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl);
future<> unset_rpc_controller(http_context& ctx);
future<> set_server_snapshot(http_context& ctx);
future<> set_server_gossip(http_context& ctx);
future<> set_server_load_sstable(http_context& ctx);

View File

@@ -650,7 +650,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -658,7 +658,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -666,7 +666,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -674,7 +674,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -682,7 +682,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -690,7 +690,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});

View File

@@ -20,7 +20,7 @@
*/
#include "commitlog.hh"
#include <db/commitlog/commitlog.hh>
#include "db/commitlog/commitlog.hh"
#include "api/api-doc/commitlog.json.hh"
#include "database.hh"
#include <vector>

View File

@@ -21,7 +21,7 @@
#include "gossiper.hh"
#include "api/api-doc/gossiper.json.hh"
#include <gms/gossiper.hh>
#include "gms/gossiper.hh"
namespace api {
using namespace json;

View File

@@ -116,6 +116,23 @@ static future<json::json_return_type> sum_timed_rate_as_long(distributed<proxy>
});
}
utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val) {
utils_json::estimated_histogram res;
for (size_t i = 0; i < val.size(); i++) {
res.buckets.push(val.get(i));
res.bucket_offsets.push(val.get_bucket_lower_limit(i));
}
return res;
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::time_estimated_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, f, utils::time_estimated_histogram_merge,
utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {
return make_ready_future<json::json_return_type>(time_to_json_histogram(val));
});
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, f, utils::estimated_histogram_merge,

View File

@@ -41,6 +41,8 @@
#include "sstables/sstables.hh"
#include "database.hh"
#include "db/extensions.hh"
#include "transport/controller.hh"
#include "thrift/controller.hh"
namespace api {
@@ -85,21 +87,66 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
};
}
future<> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
return ctx.db.invoke_on_all([keyspace, tables, enabled] (database& db) {
return parallel_for_each(tables, [&db, keyspace, enabled](const sstring& table) mutable {
column_family& cf = db.find_column_family(keyspace, table);
if (enabled) {
cf.enable_auto_compaction();
} else {
cf.disable_auto_compaction();
}
return make_ready_future<>();
return service::get_local_storage_service().set_tables_autocompaction(keyspace, tables, enabled).then([]{
return make_ready_future<json::json_return_type>(json_void());
});
}
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl) {
ss::start_native_transport.set(r, [&ctl](std::unique_ptr<request> req) {
return ctl.start_server().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::stop_native_transport.set(r, [&ctl](std::unique_ptr<request> req) {
return ctl.stop_server().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::is_native_transport_running.set(r, [&ctl] (std::unique_ptr<request> req) {
return ctl.is_server_running().then([] (bool running) {
return make_ready_future<json::json_return_type>(running);
});
});
}
void unset_transport_controller(http_context& ctx, routes& r) {
ss::start_native_transport.unset(r);
ss::stop_native_transport.unset(r);
ss::is_native_transport_running.unset(r);
}
void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl) {
ss::stop_rpc_server.set(r, [&ctl](std::unique_ptr<request> req) {
return ctl.stop_server().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_rpc_server.set(r, [&ctl](std::unique_ptr<request> req) {
return ctl.start_server().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::is_rpc_server_running.set(r, [&ctl] (std::unique_ptr<request> req) {
return ctl.is_server_running().then([] (bool running) {
return make_ready_future<json::json_return_type>(running);
});
});
}
void unset_rpc_controller(http_context& ctx, routes& r) {
ss::stop_rpc_server.unset(r);
ss::start_rpc_server.unset(r);
ss::is_rpc_server_running.unset(r);
}
void set_storage_service(http_context& ctx, routes& r) {
@@ -232,6 +279,12 @@ void set_storage_service(http_context& ctx, routes& r) {
req.get_query_param("key")));
});
ss::cdc_streams_check_and_repair.set(r, [&ctx] (std::unique_ptr<request> req) {
return service::get_local_storage_service().check_and_repair_cdc_streams().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = split_cf(req->get_query_param("cf"));
@@ -496,42 +549,6 @@ void set_storage_service(http_context& ctx, routes& r) {
});
});
ss::stop_rpc_server.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().stop_rpc_server().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_rpc_server.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().start_rpc_server().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::is_rpc_server_running.set(r, [] (std::unique_ptr<request> req) {
return service::get_local_storage_service().is_rpc_server_running().then([] (bool running) {
return make_ready_future<json::json_return_type>(running);
});
});
ss::start_native_transport.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().start_native_transport().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::stop_native_transport.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().stop_native_transport().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::is_native_transport_running.set(r, [] (std::unique_ptr<request> req) {
return service::get_local_storage_service().is_native_transport_running().then([] (bool running) {
return make_ready_future<json::json_return_type>(running);
});
});
ss::join_ring.set(r, [](std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -718,17 +735,15 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = split_cf(req->get_query_param("cf"));
return set_tables_autocompaction(ctx, keyspace, tables, true).then([]{
return make_ready_future<json::json_return_type>(json_void());
});
return set_tables_autocompaction(ctx, keyspace, tables, true);
});
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = split_cf(req->get_query_param("cf"));
return set_tables_autocompaction(ctx, keyspace, tables, false).then([]{
return make_ready_future<json::json_return_type>(json_void());
});
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
ss::deliver_hints.set(r, [](std::unique_ptr<request> req) {
@@ -1005,12 +1020,12 @@ void set_snapshot(http_context& ctx, routes& r) {
ss::take_snapshot.set(r, [](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
auto column_family = req->get_query_param("cf");
auto column_families = split(req->get_query_param("cf"), ",");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
auto resp = make_ready_future<>();
if (column_family.empty()) {
if (column_families.empty()) {
resp = service::get_local_storage_service().take_snapshot(tag, keynames);
} else {
if (keynames.empty()) {
@@ -1019,7 +1034,7 @@ void set_snapshot(http_context& ctx, routes& r) {
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
resp = service::get_local_storage_service().take_column_family_snapshot(keynames[0], column_family, tag);
resp = service::get_local_storage_service().take_column_family_snapshot(keynames[0], column_families, tag);
}
return resp.then([] {
return make_ready_future<json::json_return_type>(json_void());

View File

@@ -23,9 +23,16 @@
#include "api.hh"
namespace cql_transport { class controller; }
class thrift_controller;
namespace api {
void set_storage_service(http_context& ctx, routes& r);
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl);
void unset_transport_controller(http_context& ctx, routes& r);
void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);
void unset_rpc_controller(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r);
}

View File

@@ -29,7 +29,6 @@
#include <seastar/net//byteorder.hh>
#include <cstdint>
#include <iosfwd>
#include <seastar/util/gcc6-concepts.hh>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"

View File

@@ -178,7 +178,7 @@ future<> service::start(::service::migration_manager& mm) {
return create_keyspace_if_missing(mm);
}).then([this] {
return _role_manager->start().then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
return when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
});
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
@@ -199,7 +199,7 @@ future<> service::stop() {
}
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
});
}
@@ -458,7 +458,9 @@ future<> drop_role(const service& ser, std::string_view name) {
return when_all_succeed(
a.revoke_all(name),
a.revoke_all(r)).handle_exception_type([](const unsupported_authorization_operation&) {
a.revoke_all(r))
.discard_result()
.handle_exception_type([](const unsupported_authorization_operation&) {
// Nothing.
});
}).then([&ser, name] {
@@ -471,7 +473,7 @@ future<> drop_role(const service& ser, std::string_view name) {
future<bool> has_role(const service& ser, std::string_view grantee, std::string_view name) {
return when_all_succeed(
validate_role_exists(ser, name),
ser.get_roles(grantee)).then([name](role_set all_roles) {
ser.get_roles(grantee)).then_unpack([name](role_set all_roles) {
return make_ready_future<bool>(all_roles.count(sstring(name)) != 0);
});
}

View File

@@ -161,7 +161,7 @@ future<> standard_role_manager::create_metadata_tables_if_missing() const {
meta::role_members_table::name,
_qp,
create_role_members_query,
_migration_manager));
_migration_manager)).discard_result();
}
future<> standard_role_manager::create_default_role_if_missing() const {
@@ -367,7 +367,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
{sstring(role_name)}).discard_result();
};
return when_all_succeed(revoke_from_members(), revoke_members_of()).then([delete_role = std::move(delete_role)] {
return when_all_succeed(revoke_from_members(), revoke_members_of()).then_unpack([delete_role = std::move(delete_role)] {
return delete_role();
});
});
@@ -416,7 +416,7 @@ standard_role_manager::modify_membership(
return make_ready_future<>();
};
return when_all_succeed(modify_roles(), modify_role_members());
return when_all_succeed(modify_roles(), modify_role_members()).discard_result();
}
future<>
@@ -445,7 +445,7 @@ standard_role_manager::grant(std::string_view grantee_name, std::string_view rol
});
};
return when_all_succeed(check_redundant(), check_cycle()).then([this, role_name, grantee_name] {
return when_all_succeed(check_redundant(), check_cycle()).then_unpack([this, role_name, grantee_name] {
return this->modify_membership(grantee_name, role_name, membership_change::add);
});
}

View File

@@ -39,7 +39,10 @@ class caching_options {
sstring _key_cache;
sstring _row_cache;
caching_options(sstring k, sstring r) : _key_cache(k), _row_cache(r) {
bool _enabled = true;
caching_options(sstring k, sstring r, bool enabled)
: _key_cache(k), _row_cache(r), _enabled(enabled)
{
if ((k != "ALL") && (k != "NONE")) {
throw exceptions::configuration_exception("Invalid key value: " + k);
}
@@ -59,36 +62,53 @@ class caching_options {
caching_options() : _key_cache(default_key), _row_cache(default_row) {}
public:
bool enabled() const {
return _enabled;
}
std::map<sstring, sstring> to_map() const {
return {{ "keys", _key_cache }, { "rows_per_partition", _row_cache }};
std::map<sstring, sstring> res = {{ "keys", _key_cache },
{ "rows_per_partition", _row_cache }};
if (!_enabled) {
res.insert({"enabled", "false"});
}
return res;
}
sstring to_sstring() const {
return json::to_json(to_map());
}
static caching_options get_disabled_caching_options() {
return caching_options("NONE", "NONE", false);
}
template<typename Map>
static caching_options from_map(const Map & map) {
sstring k = default_key;
sstring r = default_row;
bool e = true;
for (auto& p : map) {
if (p.first == "keys") {
k = p.second;
} else if (p.first == "rows_per_partition") {
r = p.second;
} else if (p.first == "enabled") {
e = p.second == "true";
} else {
throw exceptions::configuration_exception("Invalid caching option: " + p.first);
}
}
return caching_options(k, r);
return caching_options(k, r, e);
}
static caching_options from_sstring(const sstring& str) {
return from_map(json::to_map(str));
}
bool operator==(const caching_options& other) const {
return _key_cache == other._key_cache && _row_cache == other._row_cache;
return _key_cache == other._key_cache && _row_cache == other._row_cache
&& _enabled == other._enabled;
}
bool operator!=(const caching_options& other) const {
return !(*this == other);

View File

@@ -190,12 +190,7 @@ public:
, _bootstrap_tokens(bootstrap_tokens)
, _token_metadata(token_metadata)
, _gossiper(gossiper)
{
if (_bootstrap_tokens.empty()) {
throw std::runtime_error(
"cdc: bootstrap tokens is empty in generate_topology_description");
}
}
{}
/*
* Generate a set of CDC stream identifiers such that for each shard
@@ -257,8 +252,6 @@ db_clock::time_point make_new_cdc_generation(
db::system_distributed_keyspace& sys_dist_ks,
std::chrono::milliseconds ring_delay,
bool for_testing) {
assert(!bootstrap_tokens.empty());
auto gen = topology_description_generator(cfg, bootstrap_tokens, tm, g).generate();
// Begin the race.

View File

@@ -51,6 +51,7 @@
#include "types/listlike_partial_deserializing_iterator.hh"
#include "tracing/trace_state.hh"
#include "stats.hh"
#include "compaction_strategy.hh"
namespace std {
@@ -173,6 +174,7 @@ public:
auto& db = _ctxt._proxy.get_db().local();
auto logname = log_name(schema.cf_name());
check_that_cdc_log_table_does_not_exist(db, schema, logname);
ensure_that_table_has_no_counter_columns(schema);
// in seastar thread
auto log_schema = create_log_schema(schema);
@@ -199,6 +201,7 @@ public:
}
if (is_cdc) {
check_for_attempt_to_create_nested_cdc_log(new_schema);
ensure_that_table_has_no_counter_columns(new_schema);
}
auto logname = log_name(old_schema.cf_name());
@@ -263,6 +266,13 @@ private:
schema.ks_name(), logname));
}
}
static void ensure_that_table_has_no_counter_columns(const schema& schema) {
if (schema.is_counter()) {
throw exceptions::invalid_request_exception(format("Cannot create CDC log for table {}.{}. Counter support not implemented",
schema.ks_name(), schema.cf_name()));
}
}
};
cdc::cdc_service::cdc_service(service::storage_proxy& proxy)
@@ -276,6 +286,7 @@ cdc::cdc_service::cdc_service(db_context ctxt)
}
future<> cdc::cdc_service::stop() {
_impl->_ctxt._proxy.set_cdc_service(nullptr);
return _impl->stop();
}
@@ -392,12 +403,37 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner("com.scylladb.dht.CDCPartitioner");
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
b.set_comment(sprint("CDC log for %s.%s", s.ks_name(), s.cf_name()));
auto ttl_seconds = s.cdc_options().ttl();
if (ttl_seconds > 0) {
b.set_gc_grace_seconds(0);
auto ceil = [] (int dividend, int divisor) {
return dividend / divisor + (dividend % divisor == 0 ? 0 : 1);
};
auto seconds_to_minutes = [] (int seconds_value) {
using namespace std::chrono;
return std::chrono::ceil<minutes>(seconds(seconds_value)).count();
};
// What's the minimum window that won't create more than 24 sstables.
auto window_seconds = ceil(ttl_seconds, 24);
auto window_minutes = seconds_to_minutes(window_seconds);
b.set_compaction_strategy_options({
{"compaction_window_unit", "MINUTES"},
{"compaction_window_size", std::to_string(window_minutes)},
// A new SSTable will become fully expired every
// `window_seconds` seconds so we shouldn't check for expired
// sstables too often.
{"expired_sstable_check_frequency_seconds",
std::to_string(std::max(1, window_seconds / 2))},
});
}
b.with_column(log_meta_column_name_bytes("stream_id"), bytes_type, column_kind::partition_key);
b.with_column(log_meta_column_name_bytes("time"), timeuuid_type, column_kind::clustering_key);
b.with_column(log_meta_column_name_bytes("batch_seq_no"), int32_type, column_kind::clustering_key);
b.with_column(log_meta_column_name_bytes("operation"), data_type_for<operation_native_type>());
b.with_column(log_meta_column_name_bytes("ttl"), long_type);
b.set_caching_options(caching_options::get_disabled_caching_options());
auto add_columns = [&] (const schema::const_iterator_range_type& columns, bool is_data_col = false) {
for (const auto& column : columns) {
auto type = column.type;
@@ -443,7 +479,7 @@ static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID>
if (uuid) {
b.set_uuid(*uuid);
}
return b.build();
}
@@ -521,6 +557,12 @@ api::timestamp_type find_timestamp(const schema& s, const mutation& m) {
[&] (collection_mutation_view_description mview) {
t = mview.tomb.timestamp;
if (t != api::missing_timestamp) {
// A collection tombstone with timestamp T can be created with:
// UPDATE ks.t USING TIMESTAMP T + 1 SET X = null WHERE ...
// where X is a non-atomic column.
// This is, among others, the reason why we show it in the CDC log
// with cdc$time using timestamp T + 1 instead of T.
t += 1;
return stop_iteration::yes;
}
@@ -716,17 +758,79 @@ private:
const column_definition& _op_col;
const column_definition& _ttl_col;
ttl_opt _cdc_ttl_opt;
/**
* #6070
* When mutation splitting was added, non-atomic column assignments were broken
* into two invocation of transform. This means the second (actual data assignment)
* does not know about the tombstone in first one -> postimage is created as if
* we were _adding_ to the collection, not replacing it.
* #6070, #6084
* Non-atomic column assignments which use a TTL are broken into two invocations
* of `transform`, such as in the following example:
* CREATE TABLE t (a int PRIMARY KEY, b map<int, int>) WITH cdc = {'enabled':true};
* UPDATE t USING TTL 5 SET b = {0:0} WHERE a = 0;
*
* The above UPDATE creates a tombstone and a (0, 0) cell; because tombstones don't have the notion
* of a TTL, we split the UPDATE into two separate changes (represented as two separate delta rows in the log,
* resulting in two invocations of `transform`): one change for the deletion with no TTL,
* and one change for adding cells with TTL = 5.
*
* In other words, we use the fact that
* UPDATE t USING TTL 5 SET b = {0:0} WHERE a = 0;
* is equivalent to
* BEGIN UNLOGGED BATCH
* UPDATE t SET b = null WHERE a = 0;
* UPDATE t USING TTL 5 SET b = b + {0:0} WHERE a = 0;
* APPLY BATCH;
* (the mutations are the same in both cases),
* and perform a separate `transform` call for each statement in the batch.
*
* An assignment also happens when an INSERT statement is used as follows:
* INSERT INTO t (a, b) VALUES (0, {0:0}) USING TTL 5;
*
* Not pretty, but to handle this we use the knowledge that we always get
* invoked in timestamp order -> tombstone first, then assign.
* So we simply keep track of non-atomic columns deleted across calls
* and filter out preimage data post this.
* This will be split into three separate changes (three invocations of `transform`):
* 1. One with TTL = 5 for the row marker (introduces by the INSERT), indicating that a row was inserted.
* 2. One without a TTL for the tombstone, indicating that the collection was cleared.
* 3. One with TTL = 5 for the addition of cell (0, 0), indicating that the collection
* was extended by a new key/value.
*
* Why do we need three changes and not two, like in the UPDATE case?
* The tombstone needs to be a separate change because it doesn't have a TTL,
* so only the row marker change could potentially be merged with the cell change (1 and 3 above).
* However, we cannot do that: the row marker change is of INSERT type (cdc$operation == cdc::operation::insert),
* but there is no way to create a statement that
* - has a row marker,
* - adds cells to a collection,
* - but *doesn't* add a tombstone for this collection.
* INSERT statements that modify collections *always* add tombstones.
*
* Merging the row marker with the cell addition would result in such an impossible statement.
*
* Instead, we observe that
* INSERT INTO t (a, b) VALUES (0, {0:0}) USING TTL 5;
* is equivalent to
* BEGIN UNLOGGED BATCH
* INSERT INTO t (a) VALUES (0) USING TTL 5;
* UPDATE t SET b = null WHERE a = 0;
* UPDATE t USING TTL 5 SET b = b + {0:0} WHERE a = 0;
* APPLY BATCH;
* and perform a separate `transform` call for each statement in the batch.
*
* Unfortunately, due to splitting, the cell addition call (b + b {0:0}) does not know about the tombstone.
* If it was performed independently from the tombstone call, it would create a wrong post-image:
* the post-image would look as if the previous cells still existed.
* For example, suppose that b was equal to {1:1} before the above statement was performed.
* Then the final post-image for b for above statement/batch would be {0:0, 1:1}, when instead it should be {0:0}.
*
* To handle this we use the fact that
* 1. changes without a TTL are treated as if TTL = 0,
* 2. `transform` is invoked in order of increasing TTLs,
* and we maintain state between `transform` invocations (`_non_atomic_column_deletes`).
*
* Thus, the tombstone call will happen *before* the cell addition call,
* so the cell addition call will know that there previously was a tombstone
* and create a correct post-image.
*
* Furthermore, `transform` calls for INSERT changes (i.e. with a row marker)
* happen before `transform` calls for UPDATE changes, so in the case of an INSERT
* which modifies a collection column as above, the row marker call will happen first;
* its post-image will still show {1:1} for the collection column. Good.
*/
std::unordered_set<const column_definition*> _non_atomic_column_deletes;
@@ -929,6 +1033,9 @@ public:
: value.value().first_fragment()
;
value_callback(key, val, live);
if (value.is_live_and_has_ttl()) {
ttl = value.ttl();
}
}
};
@@ -1382,7 +1489,7 @@ cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout,
tracing::trace(tr_state, "CDC: Preimage not enabled for the table, not querying current value of {}", m.decorated_key());
}
return f.then([trans = std::move(trans), &mutations, idx, tr_state = std::move(tr_state), &details] (lw_shared_ptr<cql3::untyped_result_set> rs) mutable {
return f.then([trans = std::move(trans), &mutations, idx, tr_state, &details] (lw_shared_ptr<cql3::untyped_result_set> rs) mutable {
auto& m = mutations[idx];
auto& s = m.schema();
details.had_preimage |= s->cdc_options().preimage();

View File

@@ -75,7 +75,7 @@ class metadata;
/// CDC service will listen for schema changes and iff CDC is enabled/changed
/// create/modify/delete corresponding log tables etc as part of the schema change.
///
class cdc_service {
class cdc_service final : public async_sharded_service<cdc::cdc_service> {
class impl;
std::unique_ptr<impl> _impl;
public:

View File

@@ -30,23 +30,16 @@ struct atomic_column_update {
atomic_cell cell;
};
// see the comment inside `clustered_row_insert` for motivation for separating
// nonatomic deletions from nonatomic updates
struct nonatomic_column_deletion {
column_id id;
tombstone t;
};
struct nonatomic_column_update {
column_id id;
tombstone t; // optional
utils::chunked_vector<std::pair<bytes, atomic_cell>> cells;
};
struct static_row_update {
gc_clock::duration ttl;
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
std::vector<nonatomic_column_update> nonatomic_updates;
std::vector<nonatomic_column_update> nonatomic_entries;
};
struct clustered_row_insert {
@@ -54,19 +47,14 @@ struct clustered_row_insert {
clustering_key key;
row_marker marker;
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
// INSERTs can't express updates of individual cells inside a non-atomic
// (without deleting the entire field first), so no `nonatomic_updates` field
// overwriting a nonatomic column inside an INSERT will be split into two changes:
// one with a nonatomic deletion, and one with a nonatomic update
std::vector<nonatomic_column_update> nonatomic_entries;
};
struct clustered_row_update {
gc_clock::duration ttl;
clustering_key key;
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
std::vector<nonatomic_column_update> nonatomic_updates;
std::vector<nonatomic_column_update> nonatomic_entries;
};
struct clustered_row_deletion {
@@ -95,8 +83,7 @@ using set_of_changes = std::map<api::timestamp_type, batch>;
struct row_update {
std::vector<atomic_column_update> atomic_entries;
std::vector<nonatomic_column_deletion> nonatomic_deletions;
std::vector<nonatomic_column_update> nonatomic_updates;
std::vector<nonatomic_column_update> nonatomic_entries;
};
static
@@ -122,7 +109,7 @@ extract_row_updates(const row& r, column_kind ckind, const schema& schema) {
v.timestamp(),
v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0)
);
auto& updates = result[timestamp_and_ttl].nonatomic_updates;
auto& updates = result[timestamp_and_ttl].nonatomic_entries;
if (updates.empty() || updates.back().id != id) {
updates.push_back({id, {}});
}
@@ -130,8 +117,12 @@ extract_row_updates(const row& r, column_kind ckind, const schema& schema) {
}
if (desc.tomb) {
auto timestamp_and_ttl = std::pair(desc.tomb.timestamp, gc_clock::duration(0));
result[timestamp_and_ttl].nonatomic_deletions.push_back({id, desc.tomb});
auto timestamp_and_ttl = std::pair(desc.tomb.timestamp + 1, gc_clock::duration(0));
auto& updates = result[timestamp_and_ttl].nonatomic_entries;
if (updates.empty() || updates.back().id != id) {
updates.push_back({id, {}});
}
updates.back().t = std::move(desc.tomb);
}
});
});
@@ -148,8 +139,7 @@ set_of_changes extract_changes(const mutation& base_mutation, const schema& base
res[timestamp].static_updates.push_back({
ttl,
std::move(up.atomic_entries),
std::move(up.nonatomic_deletions),
std::move(up.nonatomic_updates)
std::move(up.nonatomic_entries)
});
}
@@ -173,6 +163,9 @@ set_of_changes extract_changes(const mutation& base_mutation, const schema& base
};
for (auto& [k, up]: cr_updates) {
// It is important that changes in the resulting `set_of_changes` are listed
// in increasing TTL order. The reason is explained in a comment in cdc/log.cc,
// search for "#6070".
auto [timestamp, ttl] = k;
if (is_insert(timestamp, ttl)) {
@@ -181,25 +174,70 @@ set_of_changes extract_changes(const mutation& base_mutation, const schema& base
cr.key(),
marker,
std::move(up.atomic_entries),
std::move(up.nonatomic_deletions)
{}
});
if (!up.nonatomic_updates.empty()) {
// nonatomic updates cannot be expressed with an INSERT.
res[timestamp].clustered_updates.push_back({
ttl,
cr.key(),
{},
{},
std::move(up.nonatomic_updates)
});
auto& cr_insert = res[timestamp].clustered_inserts.back();
bool clustered_update_exists = false;
for (auto& nonatomic_up: up.nonatomic_entries) {
// Updating a collection column with an INSERT statement implies inserting a tombstone.
//
// For example, suppose that we have:
// CREATE TABLE t (a int primary key, b map<int, int>);
// Then the following statement:
// INSERT INTO t (a, b) VALUES (0, {0:0}) USING TIMESTAMP T;
// creates a tombstone in column b with timestamp T-1.
// It also creates a cell (0, 0) with timestamp T.
//
// There is no way to create just the cell using an INSERT statement.
// This can only be done using an UPDATE, as follows:
// UPDATE t USING TIMESTAMP T SET b = b + {0:0} WHERE a = 0;
// note that this is different than
// UPDATE t USING TIMESTAMP T SET b = {0:0} WHERE a = 0;
// which also creates a tombstone with timestamp T-1.
//
// It follows that:
// - if `nonatomic_up` has a tombstone, it can be made merged with our `cr_insert`,
// which represents an INSERT change.
// - but if `nonatomic_up` only has cells, we must create a separate UPDATE change
// for the cells alone.
if (nonatomic_up.t) {
cr_insert.nonatomic_entries.push_back(std::move(nonatomic_up));
} else {
if (!clustered_update_exists) {
res[timestamp].clustered_updates.push_back({
ttl,
cr.key(),
{},
{}
});
// Multiple iterations of this `for` loop (for different collection columns)
// might want to put their `nonatomic_up`s into an UPDATE change;
// but we don't want to create a separate change for each of them, reusing one instead.
//
// Example:
// CREATE TABLE t (a int primary key, b map<int, int>, c map <int, int>) with cdc = {'enabled':true};
// insert into t (a, b, c) values (0, {1:1}, {2:2}) USING TTL 5;
//
// this should create 3 delta rows:
// 1. one for the row marker (indicating an INSERT), with TTL 5
// 2. one for the b and c tombstones, without TTL (cdc$ttl = null)
// 3. one for the b and c cells, with TTL 5
// This logic takes care that b cells and c cells are put into a single change (3. above).
clustered_update_exists = true;
}
auto& cr_update = res[timestamp].clustered_updates.back();
cr_update.nonatomic_entries.push_back(std::move(nonatomic_up));
}
}
} else {
res[timestamp].clustered_updates.push_back({
ttl,
cr.key(),
std::move(up.atomic_entries),
std::move(up.nonatomic_deletions),
std::move(up.nonatomic_updates)
std::move(up.nonatomic_entries)
});
}
}
@@ -271,7 +309,7 @@ bool should_split(const mutation& base_mutation, const schema& base_schema) {
}
if (desc.tomb) {
if (check_or_set(desc.tomb.timestamp, gc_clock::duration(0))) {
if (check_or_set(desc.tomb.timestamp + 1, gc_clock::duration(0))) {
should_split = true;
return;
}
@@ -326,7 +364,7 @@ bool should_split(const mutation& base_mutation, const schema& base_schema) {
}
if (mview.tomb) {
if (check_or_set(mview.tomb.timestamp, gc_clock::duration(0))) {
if (check_or_set(mview.tomb.timestamp + 1, gc_clock::duration(0))) {
should_split = true;
return;
}
@@ -392,13 +430,9 @@ void for_each_change(const mutation& base_mutation, const schema_ptr& base_schem
auto& cdef = base_schema->column_at(column_kind::static_column, atomic_update.id);
m.set_static_cell(cdef, std::move(atomic_update.cell));
}
for (auto& nonatomic_delete : sr_update.nonatomic_deletions) {
auto& cdef = base_schema->column_at(column_kind::static_column, nonatomic_delete.id);
m.set_static_cell(cdef, collection_mutation_description{nonatomic_delete.t, {}}.serialize(*cdef.type));
}
for (auto& nonatomic_update : sr_update.nonatomic_updates) {
for (auto& nonatomic_update : sr_update.nonatomic_entries) {
auto& cdef = base_schema->column_at(column_kind::static_column, nonatomic_update.id);
m.set_static_cell(cdef, collection_mutation_description{{}, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
m.set_static_cell(cdef, collection_mutation_description{nonatomic_update.t, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
f(std::move(m), change_ts, tuuid, batch_no);
}
@@ -411,9 +445,9 @@ void for_each_change(const mutation& base_mutation, const schema_ptr& base_schem
auto& cdef = base_schema->column_at(column_kind::regular_column, atomic_update.id);
row.cells().apply(cdef, std::move(atomic_update.cell));
}
for (auto& nonatomic_delete : cr_insert.nonatomic_deletions) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_delete.id);
row.cells().apply(cdef, collection_mutation_description{nonatomic_delete.t, {}}.serialize(*cdef.type));
for (auto& nonatomic_update : cr_insert.nonatomic_entries) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_update.id);
row.cells().apply(cdef, collection_mutation_description{nonatomic_update.t, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
row.apply(cr_insert.marker);
@@ -428,13 +462,9 @@ void for_each_change(const mutation& base_mutation, const schema_ptr& base_schem
auto& cdef = base_schema->column_at(column_kind::regular_column, atomic_update.id);
row.apply(cdef, std::move(atomic_update.cell));
}
for (auto& nonatomic_delete : cr_update.nonatomic_deletions) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_delete.id);
row.apply(cdef, collection_mutation_description{nonatomic_delete.t, {}}.serialize(*cdef.type));
}
for (auto& nonatomic_update : cr_update.nonatomic_updates) {
for (auto& nonatomic_update : cr_update.nonatomic_entries) {
auto& cdef = base_schema->column_at(column_kind::regular_column, nonatomic_update.id);
row.apply(cdef, collection_mutation_description{{}, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
row.apply(cdef, collection_mutation_description{nonatomic_update.t, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
f(std::move(m), change_ts, tuuid, batch_no);

View File

@@ -122,26 +122,26 @@ public:
return {_empty_prefix, bound_kind::incl_end};
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
requires Range<R, clustering_key_prefix_view>
static bound_view from_range_start(const R<clustering_key_prefix>& range) {
return range.start()
? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start)
: bottom();
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )
requires Range<R, clustering_key_prefix>
static bound_view from_range_end(const R<clustering_key_prefix>& range) {
return range.end()
? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end)
: top();
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )
requires Range<R, clustering_key_prefix>
static std::pair<bound_view, bound_view> from_range(const R<clustering_key_prefix>& range) {
return {from_range_start(range), from_range_end(range)};
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
requires Range<R, clustering_key_prefix_view>
static std::optional<typename R<clustering_key_prefix_view>::bound> to_range_bound(const bound_view& bv) {
if (&bv._prefix.get() == &_empty_prefix) {
return {};

View File

@@ -61,7 +61,7 @@ bool collection_mutation_view::is_empty() const {
}
template <typename F>
GCC6_CONCEPT(requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>)
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_clock::time_point now, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
auto has_tomb = in.read_trivial<bool>();
@@ -108,7 +108,7 @@ bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone
}
template <typename F>
GCC6_CONCEPT(requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>)
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
api::timestamp_type max = api::missing_timestamp;
@@ -313,7 +313,7 @@ collection_mutation collection_mutation_view_description::serialize(const abstra
}
template <typename C>
GCC6_CONCEPT(requires std::is_base_of_v<abstract_type, std::remove_reference_t<C>>)
requires std::is_base_of_v<abstract_type, std::remove_reference_t<C>>
static collection_mutation_view_description
merge(collection_mutation_view_description a, collection_mutation_view_description b, C&& key_type) {
using element_type = std::pair<bytes_view, atomic_cell_view>;
@@ -375,7 +375,7 @@ collection_mutation merge(const abstract_type& type, collection_mutation_view a,
}
template <typename C>
GCC6_CONCEPT(requires std::is_base_of_v<abstract_type, std::remove_reference_t<C>>)
requires std::is_base_of_v<abstract_type, std::remove_reference_t<C>>
static collection_mutation_view_description
difference(collection_mutation_view_description a, collection_mutation_view_description b, C&& key_type)
{
@@ -421,7 +421,7 @@ collection_mutation difference(const abstract_type& type, collection_mutation_vi
}
template <typename F>
GCC6_CONCEPT(requires std::is_invocable_r_v<std::pair<bytes_view, atomic_cell_view>, F, collection_mutation_input_stream&>)
requires std::is_invocable_r_v<std::pair<bytes_view, atomic_cell_view>, F, collection_mutation_input_stream&>
static collection_mutation_view_description
deserialize_collection_mutation(collection_mutation_input_stream& in, F&& read_kv) {
collection_mutation_view_description ret;

View File

@@ -23,11 +23,13 @@
#include <seastar/core/future.hh>
#include <seastar/util/noncopyable_function.hh>
#include <seastar/core/file.hh>
#include "schema_fwd.hh"
#include "sstables/shared_sstable.hh"
#include "exceptions/exceptions.hh"
#include "sstables/compaction_backlog_manager.hh"
#include "compaction_strategy_type.hh"
class table;
using column_family = table;
@@ -37,15 +39,6 @@ struct mutation_source_metadata;
namespace sstables {
enum class compaction_strategy_type {
null,
major,
size_tiered,
leveled,
date_tiered,
time_window,
};
class compaction_strategy_impl;
class sstable;
class sstable_set;
@@ -70,8 +63,6 @@ public:
compaction_descriptor get_major_compaction_job(column_family& cf, std::vector<shared_sstable> candidates);
std::vector<resharding_descriptor> get_resharding_jobs(column_family& cf, std::vector<shared_sstable> candidates);
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added);
@@ -143,6 +134,20 @@ public:
// Returns whether or not interposer consumer is used by a given strategy.
bool use_interposer_consumer() const;
// Informs the caller (usually the compaction manager) about what would it take for this set of
// SSTables closer to becoming in-strategy. If this returns an empty compaction descriptor, this
// means that the sstable set is already in-strategy.
//
// The caller can specify one of two modes: strict or relaxed. In relaxed mode the tolerance for
// what is considered offstrategy is higher. It can be used, for instance, for when the system
// is restarting and previous compactions were likely in-flight. In strict mode, we are less
// tolerant to invariant breakages.
//
// The caller should also pass a maximum number of SSTables which is the maximum amount of
// SSTables that can be added into a single job.
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, const ::io_priority_class& iop, reshape_mode mode);
};
// Creates a compaction_strategy object from one of the strategies available.

View File

@@ -0,0 +1,36 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
namespace sstables {
enum class compaction_strategy_type {
null,
major,
size_tiered,
leveled,
date_tiered,
time_window,
};
enum class reshape_mode { strict, relaxed };
}

View File

@@ -29,7 +29,6 @@
#include <boost/range/adaptor/transformed.hpp>
#include "utils/serialization.hh"
#include <seastar/util/backtrace.hh>
#include "unimplemented.hh"
enum class allow_prefixes { no, yes };
@@ -91,7 +90,7 @@ private:
return len;
}
public:
bytes serialize_single(bytes&& v) {
bytes serialize_single(bytes&& v) const {
return serialize_value({std::move(v)});
}
template<typename RangeOfSerializedComponents>
@@ -109,7 +108,7 @@ public:
static bytes serialize_value(std::initializer_list<T> values) {
return serialize_value(boost::make_iterator_range(values.begin(), values.end()));
}
bytes serialize_optionals(const std::vector<bytes_opt>& values) {
bytes serialize_optionals(const std::vector<bytes_opt>& values) const {
return serialize_value(values | boost::adaptors::transformed([] (const bytes_opt& bo) -> bytes_view {
if (!bo) {
throw std::logic_error("attempted to create key component from empty optional");
@@ -117,7 +116,7 @@ public:
return *bo;
}));
}
bytes serialize_value_deep(const std::vector<data_value>& values) {
bytes serialize_value_deep(const std::vector<data_value>& values) const {
// TODO: Optimize
std::vector<bytes> partial;
partial.reserve(values.size());
@@ -128,7 +127,7 @@ public:
}
return serialize_value(partial);
}
bytes decompose_value(const value_type& values) {
bytes decompose_value(const value_type& values) const {
return serialize_value(values);
}
class iterator : public std::iterator<std::input_iterator_tag, const bytes_view> {
@@ -180,7 +179,7 @@ public:
static boost::iterator_range<iterator> components(const bytes_view& v) {
return { begin(v), end(v) };
}
value_type deserialize_value(bytes_view v) {
value_type deserialize_value(bytes_view v) const {
std::vector<bytes> result;
result.reserve(_types.size());
std::transform(begin(v), end(v), std::back_inserter(result), [] (auto&& v) {
@@ -188,10 +187,10 @@ public:
});
return result;
}
bool less(bytes_view b1, bytes_view b2) {
bool less(bytes_view b1, bytes_view b2) const {
return compare(b1, b2) < 0;
}
size_t hash(bytes_view v) {
size_t hash(bytes_view v) const {
if (_byte_order_equal) {
return std::hash<bytes_view>()(v);
}
@@ -203,7 +202,7 @@ public:
}
return h;
}
int compare(bytes_view b1, bytes_view b2) {
int compare(bytes_view b1, bytes_view b2) const {
if (_byte_order_comparable) {
if (_is_reversed) {
return compare_unsigned(b2, b1);
@@ -224,11 +223,21 @@ public:
bool is_empty(bytes_view v) const {
return begin(v) == end(v);
}
void validate(bytes_view v) {
// FIXME: implement
warn(unimplemented::cause::VALIDATION);
void validate(bytes_view v) const {
std::vector<bytes_view> values(begin(v), end(v));
if (AllowPrefixes == allow_prefixes::no && values.size() < _types.size()) {
throw marshal_exception(fmt::format("compound::validate(): non-prefixable compound cannot be a prefix"));
}
if (values.size() > _types.size()) {
throw marshal_exception(fmt::format("compound::validate(): cannot have more values than types, have {} values but only {} types",
values.size(), _types.size()));
}
for (size_t i = 0; i != values.size(); ++i) {
//FIXME: is it safe to assume internal serialization-format format?
_types[i]->validate(values[i], cql_serialization_format::internal());
}
}
bool equal(bytes_view v1, bytes_view v2) {
bool equal(bytes_view v1, bytes_view v2) const {
if (_byte_order_equal) {
return compare_unsigned(v1, v2) == 0;
}

View File

@@ -213,6 +213,8 @@ public:
, _is_compound(true)
{ }
explicit composite(const composite_view& v);
composite()
: _bytes()
, _is_compound(true)
@@ -503,6 +505,7 @@ public:
};
class composite_view final {
friend class composite;
bytes_view _bytes;
bool _is_compound;
public:
@@ -602,6 +605,11 @@ public:
}
};
inline
composite::composite(const composite_view& v)
: composite(bytes(v._bytes), v._is_compound)
{ }
inline
std::ostream& operator<<(std::ostream& os, const composite& v) {
return os << composite_view(v);

View File

@@ -152,41 +152,39 @@ struct uuid_type_impl final : public concrete_type<utils::UUID> {
template <typename Func> using visit_ret_type = std::invoke_result_t<Func, const ascii_type_impl&>;
GCC6_CONCEPT(
template <typename Func> concept bool CanHandleAllTypes = requires(Func f) {
{ f(*static_cast<const ascii_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const boolean_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const byte_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const bytes_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const counter_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const date_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const decimal_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const double_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const duration_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const empty_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const float_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const inet_addr_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const int32_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const list_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const long_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const map_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const reversed_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const set_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const short_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const simple_date_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const time_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const timestamp_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const timeuuid_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const tuple_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const user_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const utf8_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const uuid_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
{ f(*static_cast<const varint_type_impl*>(nullptr)) } -> visit_ret_type<Func>;
template <typename Func> concept CanHandleAllTypes = requires(Func f) {
{ f(*static_cast<const ascii_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const boolean_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const byte_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const bytes_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const counter_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const date_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const decimal_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const double_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const duration_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const empty_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const float_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const inet_addr_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const int32_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const list_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const long_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const map_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const reversed_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const set_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const short_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const simple_date_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const time_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const timestamp_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const timeuuid_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const tuple_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const user_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const utf8_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const uuid_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
{ f(*static_cast<const varint_type_impl*>(nullptr)) } -> std::same_as<visit_ret_type<Func>>;
};
)
template<typename Func>
GCC6_CONCEPT(requires CanHandleAllTypes<Func>)
requires CanHandleAllTypes<Func>
static inline visit_ret_type<Func> visit(const abstract_type& t, Func&& f) {
switch (t.get_kind()) {
case abstract_type::kind::ascii:

View File

@@ -32,6 +32,8 @@ import tempfile
import textwrap
from distutils.spawn import find_executable
curdir = os.getcwd()
tempfile.tempdir = "./build/tmp"
configure_args = str.join(' ', [shlex.quote(x) for x in sys.argv[1:]])
@@ -166,9 +168,27 @@ def maybe_static(flag, libs):
return libs
class Thrift(object):
def __init__(self, source, service):
class Source(object):
def __init__(self, source, hh_prefix, cc_prefix):
self.source = source
self.hh_prefix = hh_prefix
self.cc_prefix = cc_prefix
def headers(self, gen_dir):
return [x for x in self.generated(gen_dir) if x.endswith(self.hh_prefix)]
def sources(self, gen_dir):
return [x for x in self.generated(gen_dir) if x.endswith(self.cc_prefix)]
def objects(self, gen_dir):
return [x.replace(self.cc_prefix, '.o') for x in self.sources(gen_dir)]
def endswith(self, end):
return self.source.endswith(end)
class Thrift(Source):
def __init__(self, source, service):
Source.__init__(self, source, '.h', '.cpp')
self.service = service
def generated(self, gen_dir):
@@ -179,19 +199,6 @@ class Thrift(object):
for ext in ['.cpp', '.h']]
return [os.path.join(gen_dir, file) for file in files]
def headers(self, gen_dir):
return [x for x in self.generated(gen_dir) if x.endswith('.h')]
def sources(self, gen_dir):
return [x for x in self.generated(gen_dir) if x.endswith('.cpp')]
def objects(self, gen_dir):
return [x.replace('.cpp', '.o') for x in self.sources(gen_dir)]
def endswith(self, end):
return self.source.endswith(end)
def default_target_arch():
if platform.machine() in ['i386', 'i686', 'x86_64']:
return 'westmere' # support PCLMUL
@@ -201,9 +208,9 @@ def default_target_arch():
return ''
class Antlr3Grammar(object):
class Antlr3Grammar(Source):
def __init__(self, source):
self.source = source
Source.__init__(self, source, '.hpp', '.cpp')
def generated(self, gen_dir):
basename = os.path.splitext(self.source)[0]
@@ -211,18 +218,12 @@ class Antlr3Grammar(object):
for ext in ['Lexer.cpp', 'Lexer.hpp', 'Parser.cpp', 'Parser.hpp']]
return [os.path.join(gen_dir, file) for file in files]
def headers(self, gen_dir):
return [x for x in self.generated(gen_dir) if x.endswith('.hpp')]
def sources(self, gen_dir):
return [x for x in self.generated(gen_dir) if x.endswith('.cpp')]
def objects(self, gen_dir):
return [x.replace('.cpp', '.o') for x in self.sources(gen_dir)]
def endswith(self, end):
return self.source.endswith(end)
class Json2Code(Source):
def __init__(self, source):
Source.__init__(self, source, '.hh', '.cc')
def generated(self, gen_dir):
return [os.path.join(gen_dir, self.source + '.hh'), os.path.join(gen_dir, self.source + '.cc')]
def find_headers(repodir, excluded_dirs):
walker = os.walk(repodir)
@@ -248,7 +249,7 @@ def find_headers(repodir, excluded_dirs):
modes = {
'debug': {
'cxxflags': '-DDEBUG -DDEBUG_LSA_SANITIZER -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxxflags': '-DDEBUG -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '-Wstack-usage=%s' % (1024*40),
},
'release': {
@@ -269,6 +270,7 @@ scylla_tests = set([
'test/boost/UUID_test',
'test/boost/aggregate_fcts_test',
'test/boost/allocation_strategy_test',
'test/boost/alternator_base64_test',
'test/boost/anchorless_list_test',
'test/boost/auth_passwords_test',
'test/boost/auth_resource_test',
@@ -278,6 +280,7 @@ scylla_tests = set([
'test/boost/broken_sstable_test',
'test/boost/bytes_ostream_test',
'test/boost/cache_flat_mutation_reader_test',
'test/boost/cached_file_test',
'test/boost/caching_options_test',
'test/boost/canonical_mutation_test',
'test/boost/cartesian_product_test',
@@ -326,6 +329,7 @@ scylla_tests = set([
'test/boost/linearizing_input_stream_test',
'test/boost/loading_cache_test',
'test/boost/log_heap_test',
'test/boost/estimated_histogram_test',
'test/boost/logalloc_test',
'test/boost/managed_vector_test',
'test/boost/map_difference_test',
@@ -365,6 +369,7 @@ scylla_tests = set([
'test/boost/schema_changes_test',
'test/boost/sstable_conforms_to_mutation_source_test',
'test/boost/sstable_resharding_test',
'test/boost/sstable_directory_test',
'test/boost/sstable_test',
'test/boost/storage_proxy_test',
'test/boost/top_k_test',
@@ -414,12 +419,13 @@ perf_tests = set([
'test/perf/perf_mutation_fragment',
'test/perf/perf_idl',
'test/perf/perf_vint',
'test/perf/perf_big_decimal',
])
apps = set([
'scylla',
'test/tools/cql_repl',
'tools/scylla_types',
'tools/scylla-types',
])
tests = scylla_tests | perf_tests
@@ -453,8 +459,8 @@ arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='gcc'
help='C compiler path')
arg_parser.add_argument('--with-osv', action='store', dest='with_osv', default='',
help='Shortcut for compile for OSv')
arg_parser.add_argument('--enable-dpdk', action='store_true', dest='dpdk', default=False,
help='Enable dpdk (from seastar dpdk sources)')
add_tristate(arg_parser, name='dpdk', dest='dpdk',
help='Use dpdk (from seastar dpdk sources) (default=True for release builds)')
arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', default='',
help='Path to DPDK SDK target location (e.g. <DPDK SDK dir>/x86_64-native-linuxapp-gcc)')
arg_parser.add_argument('--debuginfo', action='store', dest='debuginfo', type=int, default=1,
@@ -473,8 +479,6 @@ arg_parser.add_argument('--python', action='store', dest='python', default='pyth
help='Python3 path')
arg_parser.add_argument('--split-dwarf', dest='split_dwarf', action='store_true', default=False,
help='use of split dwarf (https://gcc.gnu.org/wiki/DebugFission) to speed up linking')
arg_parser.add_argument('--enable-gcc6-concepts', dest='gcc6_concepts', action='store_true', default=False,
help='enable experimental support for C++ Concepts as implemented in GCC 6')
arg_parser.add_argument('--enable-alloc-failure-injector', dest='alloc_failure_injector', action='store_true', default=False,
help='enable allocation failure injection')
arg_parser.add_argument('--with-antlr3', dest='antlr3_exec', action='store', default=None,
@@ -493,6 +497,7 @@ extra_cxxflags = {}
cassandra_interface = Thrift(source='interface/cassandra.thrift', service='Cassandra')
scylla_core = (['database.cc',
'absl-flat_hash_map.cc',
'table.cc',
'atomic_cell.cc',
'collection_mutation.cc',
@@ -511,13 +516,13 @@ scylla_core = (['database.cc',
'frozen_mutation.cc',
'memtable.cc',
'schema_mutations.cc',
'supervisor.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
'utils/limiting_data_source.cc',
'utils/updateable_value.cc',
'utils/directories.cc',
'utils/generation-number.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
@@ -546,9 +551,11 @@ scylla_core = (['database.cc',
'sstables/integrity_checked_file_impl.cc',
'sstables/prepended_input_stream.cc',
'sstables/m_format_read_helpers.cc',
'sstables/sstable_directory.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
'transport/controller.cc',
'transport/messages/result_message.cc',
'cdc/cdc_partitioner.cc',
'cdc/log.cc',
@@ -571,6 +578,7 @@ scylla_core = (['database.cc',
'cql3/functions/functions.cc',
'cql3/functions/aggregate_fcts.cc',
'cql3/functions/castas_fcts.cc',
'cql3/functions/error_injection_fcts.cc',
'cql3/statements/cf_prop_defs.cc',
'cql3/statements/cf_statement.cc',
'cql3/statements/authentication_statement.cc',
@@ -617,6 +625,7 @@ scylla_core = (['database.cc',
'cql3/role_name.cc',
'thrift/handler.cc',
'thrift/server.cc',
'thrift/controller.cc',
'thrift/thrift_validation.cc',
'utils/runtime.cc',
'utils/murmur_hash.cc',
@@ -674,6 +683,7 @@ scylla_core = (['database.cc',
'db/view/view.cc',
'db/view/view_update_generator.cc',
'db/view/row_locking.cc',
'db/sstables-format-selector.cc',
'index/secondary_index_manager.cc',
'index/secondary_index.cc',
'utils/UUID_gen.cc',
@@ -795,41 +805,41 @@ scylla_core = (['database.cc',
)
api = ['api/api.cc',
'api/api-doc/storage_service.json',
'api/api-doc/lsa.json',
Json2Code('api/api-doc/storage_service.json'),
Json2Code('api/api-doc/lsa.json'),
'api/storage_service.cc',
'api/api-doc/commitlog.json',
Json2Code('api/api-doc/commitlog.json'),
'api/commitlog.cc',
'api/api-doc/gossiper.json',
Json2Code('api/api-doc/gossiper.json'),
'api/gossiper.cc',
'api/api-doc/failure_detector.json',
Json2Code('api/api-doc/failure_detector.json'),
'api/failure_detector.cc',
'api/api-doc/column_family.json',
Json2Code('api/api-doc/column_family.json'),
'api/column_family.cc',
'api/messaging_service.cc',
'api/api-doc/messaging_service.json',
'api/api-doc/storage_proxy.json',
Json2Code('api/api-doc/messaging_service.json'),
Json2Code('api/api-doc/storage_proxy.json'),
'api/storage_proxy.cc',
'api/api-doc/cache_service.json',
Json2Code('api/api-doc/cache_service.json'),
'api/cache_service.cc',
'api/api-doc/collectd.json',
Json2Code('api/api-doc/collectd.json'),
'api/collectd.cc',
'api/api-doc/endpoint_snitch_info.json',
Json2Code('api/api-doc/endpoint_snitch_info.json'),
'api/endpoint_snitch.cc',
'api/api-doc/compaction_manager.json',
Json2Code('api/api-doc/compaction_manager.json'),
'api/compaction_manager.cc',
'api/api-doc/hinted_handoff.json',
Json2Code('api/api-doc/hinted_handoff.json'),
'api/hinted_handoff.cc',
'api/api-doc/utils.json',
Json2Code('api/api-doc/utils.json'),
'api/lsa.cc',
'api/api-doc/stream_manager.json',
Json2Code('api/api-doc/stream_manager.json'),
'api/stream_manager.cc',
'api/api-doc/system.json',
Json2Code('api/api-doc/system.json'),
'api/system.cc',
'api/config.cc',
'api/api-doc/config.json',
'api/error_injection.cc',
'api/api-doc/error_injection.json',
Json2Code('api/api-doc/config.json'),
'api/error_injection.cc',
Json2Code('api/api-doc/error_injection.json'),
]
alternator = [
@@ -895,6 +905,8 @@ scylla_tests_generic_dependencies = [
'test/lib/cql_test_env.cc',
'test/lib/test_services.cc',
'test/lib/log.cc',
'test/lib/reader_permit.cc',
'test/lib/test_utils.cc',
]
scylla_tests_dependencies = scylla_core + idls + scylla_tests_generic_dependencies + [
@@ -911,7 +923,7 @@ deps = {
'scylla': idls + ['main.cc', 'release.cc', 'build_id.cc'] + scylla_core + api + alternator + redis,
'test/tools/cql_repl': idls + ['test/tools/cql_repl.cc'] + scylla_core + scylla_tests_generic_dependencies,
#FIXME: we don't need all of scylla_core here, only the types module, need to modularize scylla_core.
'tools/scylla_types': idls + ['tools/scylla_types.cc'] + scylla_core,
'tools/scylla-types': idls + ['tools/scylla-types.cc'] + scylla_core,
}
pure_boost_tests = set([
@@ -950,6 +962,7 @@ pure_boost_tests = set([
])
tests_not_using_seastar_test_framework = set([
'test/boost/alternator_base64_test',
'test/boost/small_vector_test',
'test/manual/gossip',
'test/manual/message',
@@ -1000,6 +1013,7 @@ deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc',
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']
deps['test/boost/estimated_histogram_test'] = ['test/boost/estimated_histogram_test.cc']
deps['test/boost/anchorless_list_test'] = ['test/boost/anchorless_list_test.cc']
deps['test/perf/perf_fast_forward'] += ['release.cc']
deps['test/perf/perf_simple_query'] += ['release.cc']
@@ -1019,6 +1033,7 @@ deps['test/boost/linearizing_input_stream_test'] = [
]
deps['test/boost/duration_test'] += ['test/lib/exception_utils.cc']
deps['test/boost/alternator_base64_test'] += ['alternator/base64.cc']
deps['utils/gz/gen_crc_combine_table'] = ['utils/gz/gen_crc_combine_table.cc']
@@ -1081,34 +1096,14 @@ else:
# a list element means a list of alternative packages to consider
# the first element becomes the HAVE_pkg define
# a string element is a package name with no alternatives
optional_packages = [['libsystemd', 'libsystemd-daemon']]
optional_packages = [[]]
pkgs = []
# Lua can be provided by lua53 package on Debian-like
# systems and by Lua on others.
pkgs.append('lua53' if have_pkg('lua53') else 'lua')
def setup_first_pkg_of_list(pkglist):
# The HAVE_pkg symbol is taken from the first alternative
upkg = pkglist[0].upper().replace('-', '_')
for pkg in pkglist:
if have_pkg(pkg):
pkgs.append(pkg)
defines.append('HAVE_{}=1'.format(upkg))
return True
return False
for pkglist in optional_packages:
if isinstance(pkglist, str):
pkglist = [pkglist]
if not setup_first_pkg_of_list(pkglist):
if len(pkglist) == 1:
print('Missing optional package {pkglist[0]}'.format(**locals()))
else:
alternatives = ':'.join(pkglist[1:])
print('Missing optional package {pkglist[0]} (or alteratives {alternatives})'.format(**locals()))
pkgs.append('libsystemd')
compiler_test_src = '''
@@ -1181,8 +1176,24 @@ extra_cxxflags["release.cc"] = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\
for m in ['debug', 'release', 'sanitize']:
modes[m]['cxxflags'] += ' ' + dbgflag
get_dynamic_linker_output = subprocess.check_output(['./reloc/get-dynamic-linker.sh'], shell=True)
dynamic_linker = get_dynamic_linker_output.decode('utf-8').strip()
# The relocatable package includes its own dynamic linker. We don't
# know the path it will be installed to, so for now use a very long
# path so that patchelf doesn't need to edit the program headers. The
# kernel imposes a limit of 4096 bytes including the null. The other
# constraint is that the build-id has to be in the first page, so we
# can't use all 4096 bytes for the dynamic linker.
# In here we just guess that 2000 extra / should be enough to cover
# any path we get installed to but not so large that the build-id is
# pushed to the second page.
# At the end of the build we check that the build-id is indeed in the
# first page. At install time we check that patchelf doesn't modify
# the program headers.
gcc_linker_output = subprocess.check_output(['gcc', '-###', '/dev/null', '-o', 't'], stderr=subprocess.STDOUT).decode('utf-8')
original_dynamic_linker = re.search('-dynamic-linker ([^ ]*)', gcc_linker_output).groups()[0]
# gdb has a SO_NAME_MAX_PATH_SIZE of 512, so limit the path size to
# that. The 512 includes the null at the end, hence the 511 bellow.
dynamic_linker = '/' * (511 - len(original_dynamic_linker)) + original_dynamic_linker
forced_ldflags = '-Wl,'
@@ -1198,13 +1209,14 @@ args.user_ldflags = forced_ldflags + ' ' + args.user_ldflags
args.user_cflags += ' -Wno-error=stack-usage='
args.user_cflags += f"-ffile-prefix-map={curdir}=."
seastar_cflags = args.user_cflags
if args.target != '':
seastar_cflags += ' -march=' + args.target
seastar_ldflags = args.user_ldflags
libdeflate_cflags = seastar_cflags
zstd_cflags = seastar_cflags + ' -Wno-implicit-fallthrough'
MODE_TO_CMAKE_BUILD_TYPE = {'release' : 'RelWithDebInfo', 'debug' : 'Debug', 'dev' : 'Dev', 'sanitize' : 'Sanitize' }
@@ -1218,8 +1230,8 @@ def configure_seastar(build_dir, mode):
'-DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON',
'-DSeastar_CXX_FLAGS={}'.format((seastar_cflags + ' ' + modes[mode]['cxx_ld_flags']).replace(' ', ';')),
'-DSeastar_LD_FLAGS={}'.format(seastar_ldflags),
'-DSeastar_CXX_DIALECT=gnu++17',
'-DSeastar_STD_OPTIONAL_VARIANT_STRINGVIEW=ON',
'-DSeastar_CXX_DIALECT=gnu++20',
'-DSeastar_API_LEVEL=4',
'-DSeastar_UNUSED_RESULT_ERROR=ON',
]
@@ -1227,10 +1239,11 @@ def configure_seastar(build_dir, mode):
stack_guards = 'ON' if args.stack_guards else 'OFF'
seastar_cmake_args += ['-DSeastar_STACK_GUARDS={}'.format(stack_guards)]
if args.dpdk:
dpdk = args.dpdk
if dpdk is None:
dpdk = mode == 'release'
if dpdk:
seastar_cmake_args += ['-DSeastar_DPDK=ON', '-DSeastar_DPDK_MACHINE=wsm']
if args.gcc6_concepts:
seastar_cmake_args += ['-DSeastar_GCC6_CONCEPTS=ON']
if args.split_dwarf:
seastar_cmake_args += ['-DSeastar_SPLIT_DWARF=ON']
if args.alloc_failure_injector:
@@ -1238,7 +1251,7 @@ def configure_seastar(build_dir, mode):
seastar_cmd = ['cmake', '-G', 'Ninja', os.path.relpath(args.seastar_path, seastar_build_dir)] + seastar_cmake_args
cmake_dir = seastar_build_dir
if args.dpdk:
if dpdk:
# need to cook first
cmake_dir = args.seastar_path # required by cooking.sh
relative_seastar_build_dir = os.path.join('..', seastar_build_dir) # relative to seastar/
@@ -1271,25 +1284,6 @@ for mode in build_modes:
modes[mode]['seastar_cflags'] = seastar_pc_cflags
modes[mode]['seastar_libs'] = seastar_pc_libs
# We need to use experimental features of the zstd library (to use our own allocators for the (de)compression context),
# which are available only when the library is linked statically.
def configure_zstd(build_dir, mode):
zstd_build_dir = os.path.join(build_dir, mode, 'zstd')
zstd_cmake_args = [
'-DCMAKE_BUILD_TYPE={}'.format(MODE_TO_CMAKE_BUILD_TYPE[mode]),
'-DCMAKE_C_COMPILER={}'.format(args.cc),
'-DCMAKE_CXX_COMPILER={}'.format(args.cxx),
'-DCMAKE_C_FLAGS={}'.format(zstd_cflags),
'-DZSTD_BUILD_PROGRAMS=OFF'
]
zstd_cmd = ['cmake', '-G', 'Ninja', os.path.relpath('zstd/build/cmake', zstd_build_dir)] + zstd_cmake_args
print(zstd_cmd)
os.makedirs(zstd_build_dir, exist_ok=True)
subprocess.check_call(zstd_cmd, shell=False, cwd=zstd_build_dir)
def configure_abseil(build_dir, mode):
abseil_build_dir = os.path.join(build_dir, mode, 'abseil')
@@ -1334,6 +1328,9 @@ args.user_cflags += " " + pkg_config('jsoncpp', '--cflags')
args.user_cflags += ' -march=' + args.target
libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-llz4', '-lz', '-lsnappy', pkg_config('jsoncpp', '--libs'),
' -lstdc++fs', ' -lcrypt', ' -lcryptopp', ' -lpthread',
# Must link with static version of libzstd, since
# experimental APIs that we use are only present there.
maybe_static(True, '-lzstd'),
maybe_static(args.staticboost, '-lboost_date_time -lboost_regex -licuuc'), ])
pkgconfig_libs = [
@@ -1388,9 +1385,6 @@ if args.ragel_exec:
else:
ragel_exec = "ragel"
for mode in build_modes:
configure_zstd(outdir, mode)
for mode in build_modes:
configure_abseil(outdir, mode)
@@ -1417,7 +1411,7 @@ with open(buildfile_tmp, 'w') as f:
command = echo -e $text > $out
description = GEN $out
rule swagger
command = {args.seastar_path}/scripts/seastar-json2code.py -f $in -o $out
command = {args.seastar_path}/scripts/seastar-json2code.py --create-cc -f $in -o $out
description = SWAGGER $out
rule serializer
command = {python} ./idl-compiler.py --ns ser -f $in -o $out
@@ -1439,6 +1433,10 @@ with open(buildfile_tmp, 'w') as f:
description = COPY $out
rule package
command = scripts/create-relocatable-package.py --mode $mode $out
rule rpmbuild
command = reloc/build_rpm.sh --reloc-pkg $in --builddir $out
rule debbuild
command = reloc/build_deb.sh --reloc-pkg $in --builddir $out
''').format(**globals()))
for mode in build_modes:
modeval = modes[mode]
@@ -1446,7 +1444,7 @@ with open(buildfile_tmp, 'w') as f:
f.write(textwrap.dedent('''\
cxx_ld_flags_{mode} = {cxx_ld_flags}
ld_flags_{mode} = $cxx_ld_flags_{mode}
cxxflags_{mode} = $cxx_ld_flags_{mode} {cxxflags} -I. -I $builddir/{mode}/gen
cxxflags_{mode} = $cxx_ld_flags_{mode} {cxxflags} -iquote. -iquote $builddir/{mode}/gen
libs_{mode} = -l{fmt_lib}
seastar_libs_{mode} = {seastar_libs}
rule cxx.{mode}
@@ -1503,7 +1501,7 @@ with open(buildfile_tmp, 'w') as f:
)
)
compiles = {}
swaggers = {}
swaggers = set()
serializers = {}
thrifts = set()
ragels = {}
@@ -1525,12 +1523,13 @@ with open(buildfile_tmp, 'w') as f:
objs += dep.objects('$builddir/' + mode + '/gen')
if isinstance(dep, Antlr3Grammar):
objs += dep.objects('$builddir/' + mode + '/gen')
if isinstance(dep, Json2Code):
objs += dep.objects('$builddir/' + mode + '/gen')
if binary.endswith('.a'):
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
objs.extend(['$builddir/' + mode + '/' + artifact for artifact in [
'libdeflate/libdeflate.a',
'zstd/lib/libzstd.a',
] + [
'abseil/' + x for x in abseil_libs
]])
@@ -1565,8 +1564,7 @@ with open(buildfile_tmp, 'w') as f:
hh = '$builddir/' + mode + '/gen/' + src.replace('.idl.hh', '.dist.hh')
serializers[hh] = src
elif src.endswith('.json'):
hh = '$builddir/' + mode + '/gen/' + src + '.hh'
swaggers[hh] = src
swaggers.add(src)
elif src.endswith('.rl'):
hh = '$builddir/' + mode + '/gen/' + src.replace('.rl', '.hh')
ragels[hh] = src
@@ -1608,12 +1606,14 @@ with open(buildfile_tmp, 'w') as f:
)
)
gen_dir = '$builddir/{}/gen'.format(mode)
gen_headers = []
for th in thrifts:
gen_headers += th.headers('$builddir/{}/gen'.format(mode))
for g in antlr3_grammars:
gen_headers += g.headers('$builddir/{}/gen'.format(mode))
gen_headers += list(swaggers.keys())
for g in swaggers:
gen_headers += g.headers('$builddir/{}/gen'.format(mode))
gen_headers += list(serializers.keys())
gen_headers += list(ragels.keys())
gen_headers_dep = ' '.join(gen_headers)
@@ -1623,9 +1623,13 @@ with open(buildfile_tmp, 'w') as f:
f.write('build {}: cxx.{} {} || {} {}\n'.format(obj, mode, src, seastar_dep, gen_headers_dep))
if src in extra_cxxflags:
f.write(' cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[src], **modeval))
for hh in swaggers:
src = swaggers[hh]
f.write('build {}: swagger {} | {}/scripts/seastar-json2code.py\n'.format(hh, src, args.seastar_path))
for swagger in swaggers:
hh = swagger.headers(gen_dir)[0]
cc = swagger.sources(gen_dir)[0]
obj = swagger.objects(gen_dir)[0]
src = swagger.source
f.write('build {} | {} : swagger {} | {}/scripts/seastar-json2code.py\n'.format(hh, cc, src, args.seastar_path))
f.write('build {}: cxx.{} {}\n'.format(obj, mode, cc))
for hh in serializers:
src = serializers[hh]
f.write('build {}: serializer {} | idl-compiler.py\n'.format(hh, src))
@@ -1674,17 +1678,20 @@ with open(buildfile_tmp, 'w') as f:
f.write(textwrap.dedent('''\
build build/{mode}/iotune: copy build/{mode}/seastar/apps/iotune/iotune
''').format(**locals()))
f.write('build build/{mode}/scylla-package.tar.gz: package build/{mode}/scylla build/{mode}/iotune build/SCYLLA-RELEASE-FILE build/SCYLLA-VERSION-FILE | always\n'.format(**locals()))
f.write('build build/{mode}/scylla-package.tar.gz: package build/{mode}/scylla build/{mode}/iotune build/SCYLLA-RELEASE-FILE build/SCYLLA-VERSION-FILE build/debian/debian | always\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
f.write(' mode = {mode}\n'.format(**locals()))
f.write(f'build build/dist/{mode}/redhat: rpmbuild build/{mode}/scylla-package.tar.gz\n')
f.write(f' pool = submodule_pool\n')
f.write(f' mode = {mode}\n')
f.write(f'build build/dist/{mode}/debian: debbuild build/{mode}/scylla-package.tar.gz\n')
f.write(f' pool = submodule_pool\n')
f.write(f' mode = {mode}\n')
f.write(f'build dist-server-{mode}: phony build/dist/{mode}/redhat build/dist/{mode}/debian\n')
f.write('rule libdeflate.{mode}\n'.format(**locals()))
f.write(' command = make -C libdeflate BUILD_DIR=../build/{mode}/libdeflate/ CFLAGS="{libdeflate_cflags}" CC={args.cc} ../build/{mode}/libdeflate//libdeflate.a\n'.format(**locals()))
f.write('build build/{mode}/libdeflate/libdeflate.a: libdeflate.{mode}\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
f.write('build build/{mode}/zstd/lib/libzstd.a: ninja\n'.format(**locals()))
f.write(' pool = submodule_pool\n')
f.write(' subdir = build/{mode}/zstd\n'.format(**locals()))
f.write(' target = libzstd.a\n'.format(**locals()))
for lib in abseil_libs:
f.write('build build/{mode}/abseil/{lib}: ninja\n'.format(**locals()))
@@ -1702,6 +1709,65 @@ with open(buildfile_tmp, 'w') as f:
'build check: phony {}\n'.format(' '.join(['{mode}-check'.format(mode=mode) for mode in modes]))
)
f.write(textwrap.dedent(f'''\
build dist-server-deb: phony {' '.join(['build/dist/{mode}/debian'.format(mode=mode) for mode in build_modes])}
build dist-server-rpm: phony {' '.join(['build/dist/{mode}/redhat'.format(mode=mode) for mode in build_modes])}
build dist-server: phony dist-server-rpm dist-server-deb
rule build-submodule-reloc
command = cd $reloc_dir && ./reloc/build_reloc.sh
rule build-submodule-rpm
command = cd $dir && ./reloc/build_rpm.sh --reloc-pkg $artifact
rule build-submodule-deb
command = cd $dir && ./reloc/build_deb.sh --reloc-pkg $artifact
build scylla-jmx/build/scylla-jmx-package.tar.gz: build-submodule-reloc
reloc_dir = scylla-jmx
build dist-jmx-rpm: build-submodule-rpm scylla-jmx/build/scylla-jmx-package.tar.gz
dir = scylla-jmx
artifact = build/scylla-jmx-package.tar.gz
build dist-jmx-deb: build-submodule-deb scylla-jmx/build/scylla-jmx-package.tar.gz
dir = scylla-jmx
artifact = build/scylla-jmx-package.tar.gz
build dist-jmx: phony dist-jmx-rpm dist-jmx-deb
build scylla-tools/build/scylla-tools-package.tar.gz: build-submodule-reloc
reloc_dir = scylla-tools
build dist-tools-rpm: build-submodule-rpm scylla-tools/build/scylla-tools-package.tar.gz
dir = scylla-tools
artifact = build/scylla-tools-package.tar.gz
build dist-tools-deb: build-submodule-deb scylla-tools/build/scylla-tools-package.tar.gz
dir = scylla-tools
artifact = build/scylla-tools-package.tar.gz
build dist-tools: phony dist-tools-rpm dist-tools-deb
rule build-python-reloc
command = ./reloc/python3/build_reloc.sh
rule build-python-rpm
command = ./reloc/python3/build_rpm.sh
rule build-python-deb
command = ./reloc/python3/build_deb.sh
build build/release/scylla-python3-package.tar.gz: build-python-reloc
build dist-python-rpm: build-python-rpm build/release/scylla-python3-package.tar.gz
build dist-python-deb: build-python-deb build/release/scylla-python3-package.tar.gz
build dist-python: phony dist-python-rpm dist-python-deb
build dist-deb: phony dist-server-deb dist-python-deb dist-jmx-deb dist-tools-deb
build dist-rpm: phony dist-server-rpm dist-python-rpm dist-jmx-rpm dist-tools-rpm
build dist: phony dist-server dist-python dist-jmx dist-tools
'''))
f.write(textwrap.dedent(f'''\
build dist-check: phony {' '.join(['dist-check-{mode}'.format(mode=mode) for mode in build_modes])}
rule dist-check
command = ./tools/testing/dist-check/dist-check.sh --mode $mode
'''))
for mode in build_modes:
f.write(textwrap.dedent(f'''\
build dist-check-{mode}: dist-check
mode = {mode}
'''))
f.write(textwrap.dedent('''\
rule configure
command = {python} configure.py $configure_args
@@ -1726,6 +1792,9 @@ with open(buildfile_tmp, 'w') as f:
rule scylla_version_gen
command = ./SCYLLA-VERSION-GEN
build build/SCYLLA-RELEASE-FILE build/SCYLLA-VERSION-FILE: scylla_version_gen
rule debian_files_gen
command = ./dist/debian/debian_files_gen.py
build build/debian/debian: debian_files_gen | always
''').format(modes_list=' '.join(build_modes), **globals()))
os.rename(buildfile_tmp, buildfile)

View File

@@ -73,7 +73,9 @@ public:
return counter_id(utils::make_random_uuid());
}
};
static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type");
static_assert(
std::is_standard_layout_v<counter_id> && std::is_trivial_v<counter_id>,
"counter_id should be a POD type");
std::ostream& operator<<(std::ostream& os, const counter_id& id);
@@ -154,10 +156,10 @@ private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
template<typename T>
GCC6_CONCEPT(requires requires(T shard) {
{ shard.value() } -> int64_t;
{ shard.logical_clock() } -> int64_t;
})
requires requires(T shard) {
{ shard.value() } -> std::same_as<int64_t>;
{ shard.logical_clock() } -> std::same_as<int64_t>;
}
counter_shard& do_apply(T&& other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {

View File

@@ -106,7 +106,7 @@ using namespace cql3::statements;
using namespace cql3::selection;
using cql3::cql3_type;
using conditions_type = std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>,lw_shared_ptr<cql3::column_condition::raw>>>;
using operations_type = std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>,::shared_ptr<cql3::operation::raw_update>>>;
using operations_type = std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>, std::unique_ptr<cql3::operation::raw_update>>>;
// ANTLR forces us to define a default-initialized return value
// for every rule (e.g. [returns ut_name name]), but not every type
@@ -255,8 +255,8 @@ struct uninitialized {
return to_lower(s) == "true";
}
void add_raw_update(std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>,::shared_ptr<cql3::operation::raw_update>>>& operations,
::shared_ptr<cql3::column_identifier::raw> key, ::shared_ptr<cql3::operation::raw_update> update)
void add_raw_update(std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>, std::unique_ptr<cql3::operation::raw_update>>>& operations,
::shared_ptr<cql3::column_identifier::raw> key, std::unique_ptr<cql3::operation::raw_update> update)
{
for (auto&& p : operations) {
if (*p.first == *key && !p.second->is_compatible_with(update)) {
@@ -532,7 +532,7 @@ updateStatement returns [std::unique_ptr<raw::update_statement> expr]
@init {
bool if_exists = false;
auto attrs = std::make_unique<cql3::attributes::raw>();
std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>, ::shared_ptr<cql3::operation::raw_update>>> operations;
std::vector<std::pair<::shared_ptr<cql3::column_identifier::raw>, std::unique_ptr<cql3::operation::raw_update>>> operations;
}
: K_UPDATE cf=columnFamilyName
( usingClause[attrs] )?
@@ -563,7 +563,7 @@ updateConditions returns [conditions_type conditions]
deleteStatement returns [std::unique_ptr<raw::delete_statement> expr]
@init {
auto attrs = std::make_unique<cql3::attributes::raw>();
std::vector<::shared_ptr<cql3::operation::raw_deletion>> column_deletions;
std::vector<std::unique_ptr<cql3::operation::raw_deletion>> column_deletions;
bool if_exists = false;
}
: K_DELETE ( dels=deleteSelection { column_deletions = std::move(dels); } )?
@@ -581,15 +581,15 @@ deleteStatement returns [std::unique_ptr<raw::delete_statement> expr]
}
;
deleteSelection returns [std::vector<::shared_ptr<cql3::operation::raw_deletion>> operations]
deleteSelection returns [std::vector<std::unique_ptr<cql3::operation::raw_deletion>> operations]
: t1=deleteOp { $operations.emplace_back(std::move(t1)); }
(',' tN=deleteOp { $operations.emplace_back(std::move(tN)); })*
;
deleteOp returns [::shared_ptr<cql3::operation::raw_deletion> op]
: c=cident { $op = ::make_shared<cql3::operation::column_deletion>(std::move(c)); }
| c=cident '[' t=term ']' { $op = ::make_shared<cql3::operation::element_deletion>(std::move(c), std::move(t)); }
| c=cident '.' field=ident { $op = ::make_shared<cql3::operation::field_deletion>(std::move(c), std::move(field)); }
deleteOp returns [std::unique_ptr<cql3::operation::raw_deletion> op]
: c=cident { $op = std::make_unique<cql3::operation::column_deletion>(std::move(c)); }
| c=cident '[' t=term ']' { $op = std::make_unique<cql3::operation::element_deletion>(std::move(c), std::move(t)); }
| c=cident '.' field=ident { $op = std::make_unique<cql3::operation::field_deletion>(std::move(c), std::move(field)); }
;
usingClauseDelete[std::unique_ptr<cql3::attributes::raw>& attrs]
@@ -1416,12 +1416,12 @@ normalColumnOperation[operations_type& operations, ::shared_ptr<cql3::column_ide
: t=term ('+' c=cident )?
{
if (!c) {
add_raw_update(operations, key, ::make_shared<cql3::operation::set_value>(t));
add_raw_update(operations, key, std::make_unique<cql3::operation::set_value>(t));
} else {
if (*key != *c) {
add_recognition_error("Only expressions of the form X = <value> + X are supported.");
}
add_raw_update(operations, key, ::make_shared<cql3::operation::prepend>(t));
add_raw_update(operations, key, std::make_unique<cql3::operation::prepend>(t));
}
}
| c=cident sig=('+' | '-') t=term
@@ -1429,11 +1429,11 @@ normalColumnOperation[operations_type& operations, ::shared_ptr<cql3::column_ide
if (*key != *c) {
add_recognition_error("Only expressions of the form X = X " + $sig.text + "<value> are supported.");
}
shared_ptr<cql3::operation::raw_update> op;
std::unique_ptr<cql3::operation::raw_update> op;
if ($sig.text == "+") {
op = make_shared<cql3::operation::addition>(t);
op = std::make_unique<cql3::operation::addition>(t);
} else {
op = make_shared<cql3::operation::subtraction>(t);
op = std::make_unique<cql3::operation::subtraction>(t);
}
add_raw_update(operations, key, std::move(op));
}
@@ -1444,11 +1444,11 @@ normalColumnOperation[operations_type& operations, ::shared_ptr<cql3::column_ide
// We don't yet allow a '+' in front of an integer, but we could in the future really, so let's be future-proof in our error message
add_recognition_error("Only expressions of the form X = X " + sstring($i.text[0] == '-' ? "-" : "+") + " <value> are supported.");
}
add_raw_update(operations, key, make_shared<cql3::operation::addition>(cql3::constants::literal::integer($i.text)));
add_raw_update(operations, key, std::make_unique<cql3::operation::addition>(cql3::constants::literal::integer($i.text)));
}
| K_SCYLLA_COUNTER_SHARD_LIST '(' t=term ')'
{
add_raw_update(operations, key, ::make_shared<cql3::operation::set_counter_value_from_tuple_list>(t));
add_raw_update(operations, key, std::make_unique<cql3::operation::set_counter_value_from_tuple_list>(t));
}
;
@@ -1458,7 +1458,7 @@ collectionColumnOperation[operations_type& operations,
bool by_uuid]
: '=' t=term
{
add_raw_update(operations, key, make_shared<cql3::operation::set_element>(k, t, by_uuid));
add_raw_update(operations, key, std::make_unique<cql3::operation::set_element>(k, t, by_uuid));
}
;
@@ -1467,7 +1467,7 @@ udtColumnOperation[operations_type& operations,
shared_ptr<cql3::column_identifier> field]
: '=' t=term
{
add_raw_update(operations, std::move(key), make_shared<cql3::operation::set_field>(std::move(field), std::move(t)));
add_raw_update(operations, std::move(key), std::make_unique<cql3::operation::set_field>(std::move(field), std::move(t)));
}
;

View File

@@ -87,7 +87,7 @@ abstract_marker::raw::raw(int32_t bind_index)
return ::make_shared<constants::marker>(_bind_index, receiver);
}
assignment_testable::test_result abstract_marker::raw::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const {
assignment_testable::test_result abstract_marker::raw::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}

View File

@@ -72,7 +72,7 @@ public:
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override;
virtual sstring to_string() const override;
};

View File

@@ -70,7 +70,7 @@ public:
// Test all elements of toTest for assignment. If all are exact match, return exact match. If any is not assignable,
// return not assignable. Otherwise, return weakly assignable.
template <typename AssignmentTestablePtrRange>
static test_result test_all(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver,
static test_result test_all(database& db, const sstring& keyspace, const column_specification& receiver,
AssignmentTestablePtrRange&& to_test) {
test_result res = test_result::EXACT_MATCH;
for (auto&& rt : to_test) {
@@ -99,7 +99,7 @@ public:
* Most caller should just call the isAssignable() method on the result, though functions have a use for
* testing "strong" equality to decide the most precise overload to pick when multiple could match.
*/
virtual test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const = 0;
virtual test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const = 0;
// for error reporting
virtual sstring assignment_testable_source_context() const = 0;

View File

@@ -139,16 +139,6 @@ static inline
return def.column_specification->name;
}
static inline
std::vector<::shared_ptr<column_identifier>> to_identifiers(const std::vector<const column_definition*>& defs) {
std::vector<::shared_ptr<column_identifier>> r;
r.reserve(defs.size());
for (auto&& def : defs) {
r.push_back(to_identifier(*def));
}
return r;
}
}
namespace std {

View File

@@ -82,9 +82,9 @@ constants::literal::parsed_value(data_type validator) const
}
assignment_testable::test_result
constants::literal::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const
constants::literal::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const
{
auto receiver_type = receiver->type->as_cql3_type();
auto receiver_type = receiver.type->as_cql3_type();
if (receiver_type.is_collection() || receiver_type.is_user_type()) {
return test_result::NOT_ASSIGNABLE;
}
@@ -157,7 +157,7 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, lw_sh
::shared_ptr<term>
constants::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const
{
if (!is_assignable(test_assignment(db, keyspace, receiver))) {
if (!is_assignable(test_assignment(db, keyspace, *receiver))) {
throw exceptions::invalid_request_exception(format("Invalid {} constant ({}) for \"{}\" of type {}",
_type, _text, *receiver->name, receiver->type->as_cql3_type().to_string()));
}

View File

@@ -88,7 +88,7 @@ public:
public:
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override {
if (!is_assignable(test_assignment(db, keyspace, receiver))) {
if (!is_assignable(test_assignment(db, keyspace, *receiver))) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment/decrement");
}
return NULL_VALUE;
@@ -96,8 +96,8 @@ public:
virtual assignment_testable::test_result test_assignment(database& db,
const sstring& keyspace,
lw_shared_ptr<column_specification> receiver) const override {
return receiver->type->is_counter()
const column_specification& receiver) const override {
return receiver.type->is_counter()
? assignment_testable::test_result::NOT_ASSIGNABLE
: assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
@@ -161,7 +161,7 @@ public:
return _text;
}
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const;
virtual sstring to_string() const override {
return _type == type::STRING ? sstring(format("'{}'", _text)) : _text;

View File

@@ -95,10 +95,6 @@ public:
return _name.keyspace == ks_name && _name.name == function_name;
}
virtual bool has_reference_to(function& f) const override {
return false;
}
virtual sstring column_name(const std::vector<sstring>& column_names) const override {
return format("{}({})", _name, join(", ", column_names));
}

View File

@@ -144,10 +144,6 @@ public:
return false;
}
virtual bool has_reference_to(function& f) const override {
return false;
}
virtual sstring column_name(const std::vector<sstring>& column_names) const override {
return "[json]";
}

View File

@@ -0,0 +1,122 @@
/*
* Copyright (C) 2019 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "error_injection_fcts.hh"
#include "utils/error_injection.hh"
#include "types/list.hh"
namespace cql3
{
namespace functions
{
namespace error_injection
{
namespace
{
template <typename Func, bool Pure>
class failure_injection_function_for : public failure_injection_function {
Func _func;
public:
failure_injection_function_for(sstring name,
data_type return_type,
const std::vector<data_type> arg_types,
Func&& func)
: failure_injection_function(std::move(name), std::move(return_type), std::move(arg_types))
, _func(std::forward<Func>(func)) {}
bool is_pure() const override {
return Pure;
}
bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
return _func(sf, parameters);
}
};
template <bool Pure, typename Func>
shared_ptr<function>
make_failure_injection_function(sstring name,
data_type return_type,
std::vector<data_type> args_type,
Func&& func) {
return ::make_shared<failure_injection_function_for<Func, Pure>>(std::move(name),
std::move(return_type),
std::move(args_type),
std::forward<Func>(func));
}
} // anonymous namespace
shared_ptr<function> make_enable_injection_function() {
return make_failure_injection_function<false>("enable_injection", empty_type, { ascii_type, ascii_type },
[] (cql_serialization_format, const std::vector<bytes_opt>& parameters) {
sstring injection_name = ascii_type->get_string(parameters[0].value());
const bool one_shot = ascii_type->get_string(parameters[1].value()) == "true";
smp::invoke_on_all([injection_name, one_shot] () mutable {
utils::get_local_injector().enable(injection_name, one_shot);
}).get0();
return std::nullopt;
});
}
shared_ptr<function> make_disable_injection_function() {
return make_failure_injection_function<false>("disable_injection", empty_type, { ascii_type },
[] (cql_serialization_format, const std::vector<bytes_opt>& parameters) {
sstring injection_name = ascii_type->get_string(parameters[0].value());
smp::invoke_on_all([injection_name] () mutable {
utils::get_local_injector().disable(injection_name);
}).get0();
return std::nullopt;
});
}
shared_ptr<function> make_enabled_injections_function() {
const auto list_type_inst = list_type_impl::get_instance(ascii_type, false);
return make_failure_injection_function<true>("enabled_injections", list_type_inst, {},
[list_type_inst] (cql_serialization_format, const std::vector<bytes_opt>&) -> bytes {
return seastar::map_reduce(smp::all_cpus(), [] (unsigned) {
return make_ready_future<std::vector<sstring>>(utils::get_local_injector().enabled_injections());
}, std::vector<data_value>(),
[](std::vector<data_value> a, std::vector<sstring>&& b) -> std::vector<data_value> {
for (auto&& x : b) {
if (a.end() == std::find(a.begin(), a.end(), x)) {
a.push_back(data_value(std::move(x)));
}
}
return a;
}).then([list_type_inst](std::vector<data_value> const& active_injections) {
auto list_val = make_list_value(list_type_inst, active_injections);
return list_type_inst->decompose(list_val);
}).get0();
});
}
} // namespace error_injection
} // namespace functions
} // namespace cql3

View File

@@ -0,0 +1,56 @@
/*
* Copyright (C) 2019 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "native_scalar_function.hh"
namespace cql3
{
namespace functions
{
namespace error_injection
{
class failure_injection_function : public native_scalar_function {
protected:
failure_injection_function(sstring name, data_type return_type, std::vector<data_type> args_type)
: native_scalar_function(std::move(name), std::move(return_type), std::move(args_type)) {
}
bool requires_thread() const override {
return true;
}
};
shared_ptr<function> make_enable_injection_function();
shared_ptr<function> make_disable_injection_function();
shared_ptr<function> make_enabled_injections_function();
} // namespace error_injection
} // namespace functions
} // namespace cql3

View File

@@ -82,7 +82,6 @@ public:
virtual void print(std::ostream& os) const = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const = 0;
virtual bool has_reference_to(function& f) const = 0;
/**
* Returns the name of the function to use within a ResultSet.

View File

@@ -79,7 +79,7 @@ public:
// All parameters must be terminal
static bytes_opt execute(scalar_function& fun, std::vector<shared_ptr<term>> parameters);
public:
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override;
virtual sstring to_string() const override;
};
};

View File

@@ -37,6 +37,8 @@
#include "concrete_types.hh"
#include "as_json_function.hh"
#include "error_injection_fcts.hh"
namespace std {
std::ostream& operator<<(std::ostream& os, const std::vector<data_type>& arg_types) {
for (size_t i = 0; i < arg_types.size(); ++i) {
@@ -107,6 +109,10 @@ functions::init() {
declare(make_blob_as_varchar_fct());
add_agg_functions(ret);
declare(error_injection::make_enable_injection_function());
declare(error_injection::make_disable_injection_function());
declare(error_injection::make_enabled_injections_function());
// also needed for smp:
#if 0
MigrationManager.instance.register(new FunctionsMigrationListener());
@@ -152,11 +158,6 @@ functions::make_arg_spec(const sstring& receiver_ks, const sstring& receiver_cf,
fun.arg_types()[i]);
}
int
functions::get_overload_count(const function_name& name) {
return _declared.count(name);
}
inline
shared_ptr<function>
make_to_json_function(data_type t) {
@@ -187,7 +188,7 @@ functions::get(database& db,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf,
lw_shared_ptr<column_specification> receiver) {
const column_specification* receiver) {
static const function_name TOKEN_FUNCTION_NAME = function_name::native_function("token");
static const function_name TO_JSON_FUNCTION_NAME = function_name::native_function("tojson");
@@ -370,7 +371,7 @@ functions::validate_types(database& db,
}
auto&& expected = make_arg_spec(receiver_ks, receiver_cf, *fun, i);
if (!is_assignable(provided->test_assignment(db, keyspace, expected))) {
if (!is_assignable(provided->test_assignment(db, keyspace, *expected))) {
throw exceptions::invalid_request_exception(
format("Type error: {} cannot be passed as argument {:d} of function {} of type {}",
provided, i, fun->name(), expected->type->as_cql3_type()));
@@ -397,7 +398,7 @@ functions::match_arguments(database& db, const sstring& keyspace,
continue;
}
auto&& expected = make_arg_spec(receiver_ks, receiver_cf, *fun, i);
auto arg_res = provided->test_assignment(db, keyspace, expected);
auto arg_res = provided->test_assignment(db, keyspace, *expected);
if (arg_res == assignment_testable::test_result::NOT_ASSIGNABLE) {
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
@@ -514,7 +515,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, lw_shared_ptr
[] (auto&& x) -> shared_ptr<assignment_testable> {
return x;
});
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name, receiver);
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name, receiver.get());
if (!fun) {
throw exceptions::invalid_request_exception(format("Unknown function {} called", _name));
}
@@ -572,16 +573,16 @@ function_call::raw::execute(scalar_function& fun, std::vector<shared_ptr<term>>
}
assignment_testable::test_result
function_call::raw::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const {
function_call::raw::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const {
// Note: Functions.get() will return null if the function doesn't exist, or throw is no function matching
// the arguments can be found. We may get one of those if an undefined/wrong function is used as argument
// of another, existing, function. In that case, we return true here because we'll throw a proper exception
// later with a more helpful error message that if we were to return false here.
try {
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name, receiver);
if (fun && receiver->type == fun->return_type()) {
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver.ks_name, receiver.cf_name, &receiver);
if (fun && receiver.type == fun->return_type()) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (!fun || receiver->type->is_value_compatible_with(*fun->return_type())) {
} else if (!fun || receiver.type->is_value_compatible_with(*fun->return_type())) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} else {
return assignment_testable::test_result::NOT_ASSIGNABLE;

View File

@@ -69,7 +69,6 @@ private:
public:
static lw_shared_ptr<column_specification> make_arg_spec(const sstring& receiver_ks, const sstring& receiver_cf,
const function& fun, size_t i);
static int get_overload_count(const function_name& name);
public:
static shared_ptr<function> get(database& db,
const sstring& keyspace,
@@ -77,7 +76,7 @@ public:
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf,
lw_shared_ptr<column_specification> receiver = nullptr);
const column_specification* receiver = nullptr);
template <typename AssignmentTestablePtrRange>
static shared_ptr<function> get(database& db,
const sstring& keyspace,
@@ -85,7 +84,7 @@ public:
AssignmentTestablePtrRange&& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf,
lw_shared_ptr<column_specification> receiver = nullptr) {
const column_specification* receiver = nullptr) {
const std::vector<shared_ptr<assignment_testable>> args(std::begin(provided_args), std::end(provided_args));
return get(db, keyspace, name, args, receiver_ks, receiver_cf, receiver);
}

View File

@@ -93,7 +93,7 @@ lists::literal::validate_assignable_to(database& db, const sstring keyspace, con
}
auto&& value_spec = value_spec_of(receiver);
for (auto rt : _elements) {
if (!is_assignable(rt->test_assignment(db, keyspace, value_spec))) {
if (!is_assignable(rt->test_assignment(db, keyspace, *value_spec))) {
throw exceptions::invalid_request_exception(format("Invalid list literal for {}: value {} is not of type {}",
*receiver.name, *rt, value_spec->type->as_cql3_type()));
}
@@ -101,8 +101,8 @@ lists::literal::validate_assignable_to(database& db, const sstring keyspace, con
}
assignment_testable::test_result
lists::literal::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const {
if (!dynamic_pointer_cast<const list_type_impl>(receiver->type)) {
lists::literal::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const {
if (!dynamic_pointer_cast<const list_type_impl>(receiver.type)) {
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
@@ -111,11 +111,11 @@ lists::literal::test_assignment(database& db, const sstring& keyspace, lw_shared
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
auto&& value_spec = value_spec_of(*receiver);
auto&& value_spec = value_spec_of(receiver);
std::vector<shared_ptr<assignment_testable>> to_test;
to_test.reserve(_elements.size());
std::copy(_elements.begin(), _elements.end(), std::back_inserter(to_test));
return assignment_testable::test_all(db, keyspace, value_spec, to_test);
return assignment_testable::test_all(db, keyspace, *value_spec, to_test);
}
sstring

View File

@@ -68,7 +68,7 @@ public:
private:
void validate_assignable_to(database& db, const sstring keyspace, const column_specification& receiver) const;
public:
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override;
virtual sstring to_string() const override;
};

View File

@@ -104,31 +104,31 @@ maps::literal::validate_assignable_to(database& db, const sstring& keyspace, con
auto&& key_spec = maps::key_spec_of(receiver);
auto&& value_spec = maps::value_spec_of(receiver);
for (auto&& entry : entries) {
if (!is_assignable(entry.first->test_assignment(db, keyspace, key_spec))) {
if (!is_assignable(entry.first->test_assignment(db, keyspace, *key_spec))) {
throw exceptions::invalid_request_exception(format("Invalid map literal for {}: key {} is not of type {}", *receiver.name, *entry.first, key_spec->type->as_cql3_type()));
}
if (!is_assignable(entry.second->test_assignment(db, keyspace, value_spec))) {
if (!is_assignable(entry.second->test_assignment(db, keyspace, *value_spec))) {
throw exceptions::invalid_request_exception(format("Invalid map literal for {}: value {} is not of type {}", *receiver.name, *entry.second, value_spec->type->as_cql3_type()));
}
}
}
assignment_testable::test_result
maps::literal::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const {
if (!dynamic_pointer_cast<const map_type_impl>(receiver->type)) {
maps::literal::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const {
if (!dynamic_pointer_cast<const map_type_impl>(receiver.type)) {
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
// If there is no elements, we can't say it's an exact match (an empty map if fundamentally polymorphic).
if (entries.empty()) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
auto key_spec = maps::key_spec_of(*receiver);
auto value_spec = maps::value_spec_of(*receiver);
auto key_spec = maps::key_spec_of(receiver);
auto value_spec = maps::value_spec_of(receiver);
// It's an exact match if all are exact match, but is not assignable as soon as any is non assignable.
auto res = assignment_testable::test_result::EXACT_MATCH;
for (auto entry : entries) {
auto t1 = entry.first->test_assignment(db, keyspace, key_spec);
auto t2 = entry.second->test_assignment(db, keyspace, value_spec);
auto t1 = entry.first->test_assignment(db, keyspace, *key_spec);
auto t2 = entry.second->test_assignment(db, keyspace, *value_spec);
if (t1 == assignment_testable::test_result::NOT_ASSIGNABLE || t2 == assignment_testable::test_result::NOT_ASSIGNABLE)
return assignment_testable::test_result::NOT_ASSIGNABLE;
if (t1 != assignment_testable::test_result::EXACT_MATCH || t2 != assignment_testable::test_result::EXACT_MATCH)

View File

@@ -70,7 +70,7 @@ public:
private:
void validate_assignable_to(database& db, const sstring& keyspace, const column_specification& receiver) const;
public:
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override;
virtual sstring to_string() const override;
};

View File

@@ -87,10 +87,10 @@ operation::set_element::prepare(database& db, const sstring& keyspace, const col
}
bool
operation::set_element::is_compatible_with(shared_ptr<raw_update> other) const {
operation::set_element::is_compatible_with(const std::unique_ptr<raw_update>& other) const {
// TODO: we could check that the other operation is not setting the same element
// too (but since the index/key set may be a bind variables we can't always do it at this point)
return !dynamic_pointer_cast<set_value>(std::move(other));
return !dynamic_cast<const set_value*>(other.get());
}
sstring
@@ -120,13 +120,13 @@ operation::set_field::prepare(database& db, const sstring& keyspace, const colum
}
bool
operation::set_field::is_compatible_with(shared_ptr<raw_update> other) const {
auto x = dynamic_pointer_cast<set_field>(other);
operation::set_field::is_compatible_with(const std::unique_ptr<raw_update>& other) const {
auto x = dynamic_cast<const set_field*>(other.get());
if (x) {
return _field != x->_field;
}
return !dynamic_pointer_cast<set_value>(std::move(other));
return !dynamic_cast<const set_value*>(other.get());
}
const column_identifier::raw&
@@ -185,8 +185,8 @@ operation::addition::prepare(database& db, const sstring& keyspace, const column
}
bool
operation::addition::is_compatible_with(shared_ptr<raw_update> other) const {
return !dynamic_pointer_cast<set_value>(other);
operation::addition::is_compatible_with(const std::unique_ptr<raw_update>& other) const {
return !dynamic_cast<const set_value*>(other.get());
}
sstring
@@ -227,8 +227,8 @@ operation::subtraction::prepare(database& db, const sstring& keyspace, const col
}
bool
operation::subtraction::is_compatible_with(shared_ptr<raw_update> other) const {
return !dynamic_pointer_cast<set_value>(other);
operation::subtraction::is_compatible_with(const std::unique_ptr<raw_update>& other) const {
return !dynamic_cast<const set_value*>(other.get());
}
sstring
@@ -250,8 +250,8 @@ operation::prepend::prepare(database& db, const sstring& keyspace, const column_
}
bool
operation::prepend::is_compatible_with(shared_ptr<raw_update> other) const {
return !dynamic_pointer_cast<set_value>(other);
operation::prepend::is_compatible_with(const std::unique_ptr<raw_update>& other) const {
return !dynamic_cast<const set_value*>(other.get());
}
@@ -356,7 +356,7 @@ operation::set_counter_value_from_tuple_list::prepare(database& db, const sstrin
};
bool
operation::set_value::is_compatible_with(::shared_ptr <raw_update> other) const {
operation::set_value::is_compatible_with(const std::unique_ptr<raw_update>& other) const {
// We don't allow setting multiple time the same column, because 1)
// it's stupid and 2) the result would seem random to the user.
return false;

View File

@@ -168,7 +168,7 @@ public:
* @return whether this operation can be applied alongside the {@code
* other} update (in the same UPDATE statement for the same column).
*/
virtual bool is_compatible_with(::shared_ptr<raw_update> other) const = 0;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const = 0;
};
/**
@@ -181,7 +181,7 @@ public:
*/
class raw_deletion {
public:
~raw_deletion() {}
virtual ~raw_deletion() = default;
/**
* The name of the column affected by this delete operation.
@@ -218,7 +218,7 @@ public:
virtual shared_ptr<operation> prepare(database& db, const sstring& keyspace, const column_definition& receiver) const override;
virtual bool is_compatible_with(shared_ptr<raw_update> other) const override;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const override;
};
// Set a single field inside a user-defined type.
@@ -234,7 +234,7 @@ public:
virtual shared_ptr<operation> prepare(database& db, const sstring& keyspace, const column_definition& receiver) const override;
virtual bool is_compatible_with(shared_ptr<raw_update> other) const override;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const override;
};
// Delete a single field inside a user-defined type.
@@ -263,7 +263,7 @@ public:
virtual shared_ptr<operation> prepare(database& db, const sstring& keyspace, const column_definition& receiver) const override;
virtual bool is_compatible_with(shared_ptr<raw_update> other) const override;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const override;
};
class subtraction : public raw_update {
@@ -277,7 +277,7 @@ public:
virtual shared_ptr<operation> prepare(database& db, const sstring& keyspace, const column_definition& receiver) const override;
virtual bool is_compatible_with(shared_ptr<raw_update> other) const override;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const override;
};
class prepend : public raw_update {
@@ -291,7 +291,7 @@ public:
virtual shared_ptr<operation> prepare(database& db, const sstring& keyspace, const column_definition& receiver) const override;
virtual bool is_compatible_with(shared_ptr<raw_update> other) const override;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const override;
};
class column_deletion;

View File

@@ -65,7 +65,7 @@ public:
}
#endif
virtual bool is_compatible_with(::shared_ptr <raw_update> other) const override;
virtual bool is_compatible_with(const std::unique_ptr<raw_update>& other) const override;
};
class operation::set_counter_value_from_tuple_list : public set_value {

View File

@@ -41,7 +41,7 @@
#pragma once
#include <seastar/util/gcc6-concepts.hh>
#include <concepts>
#include "timestamp.hh"
#include "bytes.hh"
#include "db/consistency_level_type.hh"
@@ -97,11 +97,11 @@ private:
* @param values_ranges a vector of values ranges for each statement in the batch.
*/
template<typename OneMutationDataRange>
GCC6_CONCEPT( requires requires (OneMutationDataRange range) {
requires requires (OneMutationDataRange range) {
std::begin(range);
std::end(range);
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> raw_value_view; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> raw_value; } ) )
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> std::convertible_to<raw_value_view>; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> std::convertible_to<raw_value>; } )
explicit query_options(query_options&& o, std::vector<OneMutationDataRange> values_ranges);
public:
@@ -145,11 +145,11 @@ public:
* @param values_ranges a vector of values ranges for each statement in the batch.
*/
template<typename OneMutationDataRange>
GCC6_CONCEPT( requires requires (OneMutationDataRange range) {
requires requires (OneMutationDataRange range) {
std::begin(range);
std::end(range);
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> raw_value_view; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> raw_value; } ) )
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> std::convertible_to<raw_value_view>; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> std::convertible_to<raw_value>; } )
static query_options make_batch_options(query_options&& o, std::vector<OneMutationDataRange> values_ranges) {
return query_options(std::move(o), std::move(values_ranges));
}
@@ -251,11 +251,11 @@ private:
};
template<typename OneMutationDataRange>
GCC6_CONCEPT( requires requires (OneMutationDataRange range) {
requires requires (OneMutationDataRange range) {
std::begin(range);
std::end(range);
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> raw_value_view; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> raw_value; } ) )
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> std::convertible_to<raw_value_view>; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> std::convertible_to<raw_value>; } )
query_options::query_options(query_options&& o, std::vector<OneMutationDataRange> values_ranges)
: query_options(std::move(o))
{

View File

@@ -562,27 +562,6 @@ query_processor::prepare(sstring query_string, const service::client_state& clie
}
}
::shared_ptr<cql_transport::messages::result_message::prepared>
query_processor::get_stored_prepared_statement(
const std::string_view& query_string,
const sstring& keyspace,
bool for_thrift) {
using namespace cql_transport::messages;
if (for_thrift) {
return get_stored_prepared_statement_one<result_message::prepared::thrift>(
query_string,
keyspace,
compute_thrift_id,
prepared_cache_key_type::thrift_id);
} else {
return get_stored_prepared_statement_one<result_message::prepared::cql>(
query_string,
keyspace,
compute_id,
prepared_cache_key_type::cql_id);
}
}
static std::string hash_target(std::string_view query_string, std::string_view keyspace) {
std::string ret(keyspace);
ret += query_string;

View File

@@ -414,28 +414,6 @@ private:
});
});
};
template <typename ResultMsgType, typename KeyGenerator, typename IdGetter>
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement_one(
const std::string_view& query_string,
const sstring& keyspace,
KeyGenerator&& key_gen,
IdGetter&& id_getter) {
auto cache_key = key_gen(query_string, keyspace);
auto it = _prepared_cache.find(cache_key);
if (it == _prepared_cache.end()) {
return ::shared_ptr<cql_transport::messages::result_message::prepared>();
}
return ::make_shared<ResultMsgType>(id_getter(cache_key), *it);
}
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement(
const std::string_view& query_string,
const sstring& keyspace,
bool for_thrift);
};
class query_processor::migration_subscriber : public service::migration_listener {

View File

@@ -50,7 +50,6 @@
#include "result_generator.hh"
#include <seastar/util/gcc6-concepts.hh>
namespace cql3 {
@@ -150,17 +149,13 @@ public:
const std::vector<uint16_t>& partition_key_bind_indices() const;
};
GCC6_CONCEPT(
template<typename Visitor>
concept bool ResultVisitor = requires(Visitor& visitor) {
concept ResultVisitor = requires(Visitor& visitor) {
visitor.start_row();
visitor.accept_value(std::optional<query::result_bytes_view>());
visitor.end_row();
};
)
class result_set {
::shared_ptr<metadata> _metadata;
std::deque<std::vector<bytes_opt>> _rows;
@@ -199,7 +194,7 @@ public:
const std::deque<std::vector<bytes_opt>>& rows() const;
template<typename Visitor>
GCC6_CONCEPT(requires ResultVisitor<Visitor>)
requires ResultVisitor<Visitor>
void visit(Visitor&& visitor) const {
auto column_count = get_metadata().column_count();
for (auto& row : _rows) {
@@ -264,7 +259,7 @@ public:
}
template<typename Visitor>
GCC6_CONCEPT(requires ResultVisitor<Visitor>)
requires ResultVisitor<Visitor>
void visit(Visitor&& visitor) const {
if (_result_set) {
_result_set->visit(std::forward<Visitor>(visitor));

View File

@@ -107,8 +107,8 @@ public:
*/
virtual void reset() = 0;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override {
auto t1 = receiver->type->underlying_type();
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override {
auto t1 = receiver.type->underlying_type();
auto t2 = get_type()->underlying_type();
// We want columns of `counter_type' to be served by underlying type's overloads
// (here: `counter_cell_view::total_value_type()') with an `EXACT_MATCH'.

View File

@@ -98,17 +98,17 @@ sets::literal::validate_assignable_to(database& db, const sstring& keyspace, con
auto&& value_spec = value_spec_of(receiver);
for (shared_ptr<term::raw> rt : _elements) {
if (!is_assignable(rt->test_assignment(db, keyspace, value_spec))) {
if (!is_assignable(rt->test_assignment(db, keyspace, *value_spec))) {
throw exceptions::invalid_request_exception(format("Invalid set literal for {}: value {} is not of type {}", *receiver.name, *rt, value_spec->type->as_cql3_type()));
}
}
}
assignment_testable::test_result
sets::literal::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const {
if (!dynamic_pointer_cast<const set_type_impl>(receiver->type)) {
sets::literal::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const {
if (!dynamic_pointer_cast<const set_type_impl>(receiver.type)) {
// We've parsed empty maps as a set literal to break the ambiguity so handle that case now
if (dynamic_pointer_cast<const map_type_impl>(receiver->type) && _elements.empty()) {
if (dynamic_pointer_cast<const map_type_impl>(receiver.type) && _elements.empty()) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
@@ -120,10 +120,10 @@ sets::literal::test_assignment(database& db, const sstring& keyspace, lw_shared_
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
auto&& value_spec = value_spec_of(*receiver);
auto&& value_spec = value_spec_of(receiver);
// FIXME: make assignment_testable::test_all() accept ranges
std::vector<shared_ptr<assignment_testable>> to_test(_elements.begin(), _elements.end());
return assignment_testable::test_all(db, keyspace, value_spec, to_test);
return assignment_testable::test_all(db, keyspace, *value_spec, to_test);
}
sstring

View File

@@ -67,7 +67,7 @@ public:
virtual shared_ptr<term> prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
void validate_assignable_to(database& db, const sstring& keyspace, const column_specification& receiver) const;
assignment_testable::test_result
test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const;
test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const;
virtual sstring to_string() const override;
};

View File

@@ -108,10 +108,6 @@ public:
return _entity;
}
::shared_ptr<term::raw> get_map_key() {
return _map_key;
}
::shared_ptr<term::raw> get_value() {
return _value;
}

View File

@@ -294,6 +294,12 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
}
const auto& ks = db.find_keyspace(keyspace());
auto replication_type = ks.get_replication_strategy().get_type();
if (is_local_only && replication_type != locator::replication_strategy_type::local) {
throw std::logic_error(format("Internal queries should not try to alter table schema for non-local tables, because it leads to inconsistencies: {}.{}",
s->ks_name(), s->cf_name()));
}
auto cfm = schema_builder(s);
if (_properties->get_id()) {

View File

@@ -161,7 +161,7 @@ void batch_statement::validate()
|| (boost::distance(_statements
| boost::adaptors::transformed([] (auto&& s) { return s.statement->column_family(); })
| boost::adaptors::uniqued) != 1))) {
throw exceptions::invalid_request_exception("Batch with conditions cannot span multiple tables");
throw exceptions::invalid_request_exception("BATCH with conditions cannot span multiple tables");
}
std::optional<bool> raw_counter;
for (auto& s : _statements) {

View File

@@ -146,6 +146,10 @@ void cf_prop_defs::validate(const database& db, const schema::extensions_map& sc
cp.validate();
}
if (auto caching_options = get_caching_options(); caching_options && !caching_options->enabled() && !db.features().cluster_supports_per_table_caching()) {
throw exceptions::configuration_exception(KW_CACHING + " can't contain \"'enabled':false\" unless whole cluster supports it");
}
auto cdc_options = get_cdc_options(schema_extensions);
if (cdc_options && cdc_options->enabled() && !db.features().cluster_supports_cdc()) {
throw exceptions::configuration_exception("CDC not supported by the cluster");
@@ -200,6 +204,21 @@ std::optional<utils::UUID> cf_prop_defs::get_id() const {
return std::nullopt;
}
std::optional<caching_options> cf_prop_defs::get_caching_options() const {
auto value = get(KW_CACHING);
if (!value) {
return {};
}
return std::visit(make_visitor(
[] (const property_definitions::map_type& map) {
return map.empty() ? std::nullopt : std::optional<caching_options>(caching_options::from_map(map));
},
[] (const sstring& str) {
return std::optional<caching_options>(caching_options::from_sstring(str));
}
), *value);
}
const cdc::options* cf_prop_defs::get_cdc_options(const schema::extensions_map& schema_exts) const {
auto it = schema_exts.find(cdc::cdc_extension::NAME);
if (it == schema_exts.end()) {
@@ -286,11 +305,10 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
builder.set_compressor_params(compression_parameters(*compression_options));
}
#if 0
CachingOptions cachingOptions = getCachingOptions();
if (cachingOptions != null)
cfm.caching(cachingOptions);
#endif
auto caching_options = get_caching_options();
if (caching_options) {
builder.set_caching_options(std::move(*caching_options));
}
// for extensions that are not altered, keep the old ones
auto& old_exts = builder.get_extensions();

View File

@@ -95,6 +95,7 @@ public:
std::map<sstring, sstring> get_compaction_options() const;
std::optional<std::map<sstring, sstring>> get_compression_options() const;
const cdc::options* get_cdc_options(const schema::extensions_map&) const;
std::optional<caching_options> get_caching_options() const;
#if 0
public CachingOptions getCachingOptions() throws SyntaxException, ConfigurationException
{

View File

@@ -122,7 +122,7 @@ delete_statement::prepare_internal(database& db, schema_ptr schema, variable_spe
delete_statement::delete_statement(::shared_ptr<cf_name> name,
std::unique_ptr<attributes::raw> attrs,
std::vector<::shared_ptr<operation::raw_deletion>> deletions,
std::vector<std::unique_ptr<operation::raw_deletion>> deletions,
std::vector<::shared_ptr<relation>> where_clause,
conditions_vector conditions,
bool if_exists)

View File

@@ -90,7 +90,7 @@ cql3::statements::list_users_statement::execute(service::storage_proxy& proxy, s
return do_for_each(sorted_roles, [&as, &results](const sstring& role) {
return when_all_succeed(
as.has_superuser(role),
as.underlying_role_manager().can_login(role)).then([&results, &role](bool super, bool login) {
as.underlying_role_manager().can_login(role)).then_unpack([&results, &role](bool super, bool login) {
if (login) {
results->add_column_value(utf8_type->decompose(role));
results->add_column_value(boolean_type->decompose(super));

View File

@@ -51,7 +51,6 @@
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/indirected.hpp>
#include "db/config.hh"
#include "service/storage_service.hh"
#include "transport/messages/result_message.hh"
#include "database.hh"
#include <seastar/core/execution_stage.hh>
@@ -266,7 +265,7 @@ dht::partition_range_vector
modification_statement::build_partition_keys(const query_options& options, const json_cache_opt& json_cache) const {
auto keys = _restrictions->get_partition_key_restrictions()->bounds_ranges(options);
for (auto const& k : keys) {
validation::validate_cql_key(s, *k.start()->value().key());
validation::validate_cql_key(*s, *k.start()->value().key());
}
return keys;
}

View File

@@ -109,6 +109,13 @@ bool property_definitions::has_property(const sstring& name) const {
return _properties.find(name) != _properties.end();
}
std::optional<property_definitions::value_type> property_definitions::get(const sstring& name) const {
if (auto it = _properties.find(name); it != _properties.end()) {
return it->second;
}
return std::nullopt;
}
sstring property_definitions::get_string(sstring key, sstring default_value) const {
auto value = get_simple(key);
if (value) {

View File

@@ -86,6 +86,8 @@ protected:
public:
bool has_property(const sstring& name) const;
std::optional<value_type> get(const sstring& name) const;
sstring get_string(sstring key, sstring default_value) const;
// Return a property value, typed as a Boolean

View File

@@ -55,12 +55,12 @@ namespace raw {
class delete_statement : public modification_statement {
private:
std::vector<::shared_ptr<operation::raw_deletion>> _deletions;
std::vector<std::unique_ptr<operation::raw_deletion>> _deletions;
std::vector<::shared_ptr<relation>> _where_clause;
public:
delete_statement(::shared_ptr<cf_name> name,
std::unique_ptr<attributes::raw> attrs,
std::vector<::shared_ptr<operation::raw_deletion>> deletions,
std::vector<std::unique_ptr<operation::raw_deletion>> deletions,
std::vector<::shared_ptr<relation>> where_clause,
conditions_vector conditions,
bool if_exists);

View File

@@ -62,7 +62,7 @@ namespace raw {
class update_statement : public raw::modification_statement {
private:
// Provided for an UPDATE
std::vector<std::pair<::shared_ptr<column_identifier::raw>, ::shared_ptr<operation::raw_update>>> _updates;
std::vector<std::pair<::shared_ptr<column_identifier::raw>, std::unique_ptr<operation::raw_update>>> _updates;
std::vector<relation_ptr> _where_clause;
public:
/**
@@ -76,7 +76,7 @@ public:
*/
update_statement(::shared_ptr<cf_name> name,
std::unique_ptr<attributes::raw> attrs,
std::vector<std::pair<::shared_ptr<column_identifier::raw>, ::shared_ptr<operation::raw_update>>> updates,
std::vector<std::pair<::shared_ptr<column_identifier::raw>, std::unique_ptr<operation::raw_update>>> updates,
std::vector<relation_ptr> where_clause,
conditions_vector conditions, bool if_exists);
protected:

View File

@@ -375,7 +375,7 @@ list_roles_statement::execute(service::storage_proxy&, service::query_state& sta
return when_all_succeed(
rm.can_login(role),
rm.is_superuser(role),
a.query_custom_options(role)).then([&results, &role](
a.query_custom_options(role)).then_unpack([&results, &role](
bool login,
bool super,
auth::custom_options os) {

View File

@@ -59,6 +59,7 @@
#include "db/timeout_clock.hh"
#include "db/consistency_level_validations.hh"
#include "database.hh"
#include "test/lib/select_statement_utils.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
bool is_system_keyspace(const sstring& name);
@@ -67,6 +68,8 @@ namespace cql3 {
namespace statements {
static constexpr int DEFAULT_INTERNAL_PAGING_SIZE = select_statement::DEFAULT_COUNT_PAGE_SIZE;
thread_local int internal_paging_size = DEFAULT_INTERNAL_PAGING_SIZE;
thread_local const lw_shared_ptr<const select_statement::parameters> select_statement::_default_parameters = make_lw_shared<select_statement::parameters>();
select_statement::parameters::parameters()
@@ -333,7 +336,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
const bool nonpaged_filtering = restrictions_need_filtering && page_size <= 0;
if (aggregate || nonpaged_filtering) {
page_size = DEFAULT_COUNT_PAGE_SIZE;
page_size = internal_paging_size;
}
auto key_ranges = _restrictions->get_partition_key_ranges(options);
@@ -360,7 +363,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto timeout_duration = options.get_timeout_config().*get_timeout_config_selector();
auto p = service::pager::query_pagers::pager(_schema, _selection,
state, options, command, std::move(key_ranges), _stats, restrictions_need_filtering ? _restrictions : nullptr);
state, options, command, std::move(key_ranges), restrictions_need_filtering ? _restrictions : nullptr);
if (aggregate || nonpaged_filtering) {
return do_with(
@@ -372,10 +375,11 @@ select_statement::do_execute(service::storage_proxy& proxy,
auto timeout = db::timeout_clock::now() + timeout_duration;
return p->fetch_page(builder, page_size, now, timeout);
}
).then([this, &builder, restrictions_need_filtering] {
return builder.with_thread_if_needed([this, &builder, restrictions_need_filtering] {
).then([this, p, &builder, restrictions_need_filtering] {
return builder.with_thread_if_needed([this, p, &builder, restrictions_need_filtering] {
auto rs = builder.build();
if (restrictions_need_filtering) {
_stats.filtered_rows_read_total += p->stats().rows_read_total;
_stats.filtered_rows_matched_total += rs->size();
}
update_stats_rows_read(rs->size());
@@ -419,6 +423,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
}
if (restrictions_need_filtering) {
_stats.filtered_rows_read_total += p->stats().rows_read_total;
_stats.filtered_rows_matched_total += rs->size();
}
update_stats_rows_read(rs->size());
@@ -428,9 +433,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
}
template<typename KeyType>
GCC6_CONCEPT(
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key_prefix>)
)
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key_prefix>)
static KeyType
generate_base_key_from_index_pk(const partition_key& index_pk, const std::optional<clustering_key>& index_ck, const schema& base_schema, const schema& view_schema) {
const auto& base_columns = std::is_same_v<KeyType, partition_key> ? base_schema.partition_key_columns() : base_schema.clustering_key_columns();
@@ -530,13 +533,29 @@ indexed_table_select_statement::do_execute_base_query(
if (old_paging_state && concurrency == 1) {
auto base_pk = generate_base_key_from_index_pk<partition_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
auto row_ranges = command->slice.default_row_ranges();
if (old_paging_state->get_clustering_key() && _schema->clustering_key_size() > 0) {
auto base_ck = generate_base_key_from_index_pk<clustering_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
command->slice.set_range(*_schema, base_pk,
std::vector<query::clustering_range>{query::clustering_range::make_starting_with(range_bound<clustering_key>(base_ck, false))});
query::trim_clustering_row_ranges_to(*_schema, row_ranges, base_ck, false);
command->slice.set_range(*_schema, base_pk, row_ranges);
} else {
command->slice.set_range(*_schema, base_pk, std::vector<query::clustering_range>{query::clustering_range::make_open_ended_both_sides()});
// There is no clustering key in old_paging_state and/or no clustering key in
// _schema, therefore read an entire partition (whole clustering range).
//
// The only exception to applying no restrictions on clustering key
// is a case when we have a secondary index on the first column
// of clustering key. In such a case we should not read the
// entire clustering range - only a range in which first column
// of clustering key has the correct value.
//
// This means that we should not set a open_ended_both_sides
// clustering range on base_pk, instead intersect it with
// _row_ranges (which contains the restrictions neccessary for the
// case described above). The result of such intersection is just
// _row_ranges, which we explicity set on base_pk.
command->slice.set_range(*_schema, base_pk, row_ranges);
}
}
concurrency *= 2;
@@ -844,9 +863,7 @@ indexed_table_select_statement::indexed_table_select_statement(schema_ptr schema
}
template<typename KeyType>
GCC6_CONCEPT(
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key_prefix>)
)
requires (std::is_same_v<KeyType, partition_key> || std::is_same_v<KeyType, clustering_key_prefix>)
static void append_base_key_to_index_ck(std::vector<bytes_view>& exploded_index_ck, const KeyType& base_key, const column_definition& index_cdef) {
auto key_view = base_key.view();
auto begin = key_view.begin();
@@ -976,36 +993,41 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
if (aggregate) {
const bool restrictions_need_filtering = _restrictions->need_filtering();
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format()), std::make_unique<cql3::query_options>(cql3::query_options(options)),
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format(), *_group_by_cell_indices), std::make_unique<cql3::query_options>(cql3::query_options(options)),
[this, &options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] (cql3::selection::result_set_builder& builder, std::unique_ptr<cql3::query_options>& internal_options) {
// page size is set to the internal count page size, regardless of the user-provided value
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), DEFAULT_COUNT_PAGE_SIZE));
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), internal_paging_size));
return repeat([this, &builder, &options, &internal_options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] () {
auto consume_results = [this, &builder, &options, &internal_options, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
auto consume_results = [this, &builder, &options, &internal_options, &proxy, &state, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd, lw_shared_ptr<const service::pager::paging_state> paging_state) {
if (paging_state) {
paging_state = generate_view_paging_state_from_base_query_results(paging_state, results, proxy, state, options);
}
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
if (restrictions_need_filtering) {
_stats.filtered_rows_read_total += *results->row_count();
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection,
cql3::selection::result_set_builder::restrictions_filter(_restrictions, options, cmd->row_limit, _schema, cmd->slice.partition_row_limit())));
} else {
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection));
}
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
return stop_iteration(!has_more_pages);
};
if (whole_partitions || partition_slices) {
return find_index_partition_ranges(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (dht::partition_range_vector partition_ranges, lw_shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, paging_state)
.then([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
});
});
} else {
return find_index_clustering_rows(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (std::vector<primary_key> primary_keys, lw_shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, paging_state)
.then([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
});
});
}
@@ -1172,7 +1194,7 @@ indexed_table_select_statement::read_posting_list(service::storage_proxy& proxy,
}
auto p = service::pager::query_pagers::pager(_view_schema, selection,
state, options, cmd, std::move(partition_ranges), _stats, nullptr);
state, options, cmd, std::move(partition_ranges), nullptr);
return p->fetch_page(options.get_page_size(), now, timeout).then([p, &options, limit, now] (std::unique_ptr<cql3::result_set> rs) {
rs->get_metadata().set_paging_state(p->state());
return ::make_shared<cql_transport::messages::result_message::rows>(result(std::move(rs)));
@@ -1662,6 +1684,16 @@ std::vector<size_t> select_statement::prepare_group_by(const schema& schema, sel
}
future<> set_internal_paging_size(int paging_size) {
return seastar::smp::invoke_on_all([paging_size] {
internal_paging_size = paging_size;
});
}
future<> reset_internal_paging_size() {
return set_internal_paging_size(DEFAULT_INTERNAL_PAGING_SIZE);
}
}
namespace util {

View File

@@ -379,7 +379,7 @@ insert_json_statement::prepare_internal(database& db, schema_ptr schema,
update_statement::update_statement(::shared_ptr<cf_name> name,
std::unique_ptr<attributes::raw> attrs,
std::vector<std::pair<::shared_ptr<column_identifier::raw>, ::shared_ptr<operation::raw_update>>> updates,
std::vector<std::pair<::shared_ptr<column_identifier::raw>, std::unique_ptr<operation::raw_update>>> updates,
std::vector<relation_ptr> where_clause,
conditions_vector conditions, bool if_exists)
: raw::modification_statement(std::move(name), std::move(attrs), std::move(conditions), false, if_exists)

View File

@@ -82,15 +82,15 @@ public:
auto&& value = _elements[i];
auto&& spec = component_spec_of(receiver, i);
if (!assignment_testable::is_assignable(value->test_assignment(db, keyspace, spec))) {
if (!assignment_testable::is_assignable(value->test_assignment(db, keyspace, *spec))) {
throw exceptions::invalid_request_exception(format("Invalid tuple literal for {}: component {:d} is not of type {}", receiver.name, i, spec->type->as_cql3_type()));
}
}
}
public:
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override {
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override {
try {
validate_assignable_to(db, keyspace, *receiver);
validate_assignable_to(db, keyspace, receiver);
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} catch (exceptions::invalid_request_exception& e) {
return assignment_testable::test_result::NOT_ASSIGNABLE;

View File

@@ -53,10 +53,10 @@ public:
}
virtual shared_ptr<term> prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override {
if (!is_assignable(_term->test_assignment(db, keyspace, casted_spec_of(db, keyspace, *receiver)))) {
if (!is_assignable(_term->test_assignment(db, keyspace, *casted_spec_of(db, keyspace, *receiver)))) {
throw exceptions::invalid_request_exception(format("Cannot cast value {} to type {}", _term, _type));
}
if (!is_assignable(test_assignment(db, keyspace, receiver))) {
if (!is_assignable(test_assignment(db, keyspace, *receiver))) {
throw exceptions::invalid_request_exception(format("Cannot assign value {} to {} of type {}", *this, receiver->name, receiver->type->as_cql3_type()));
}
return _term->prepare(db, keyspace, receiver);
@@ -67,12 +67,12 @@ private:
::make_shared<column_identifier>(to_string(), true), _type->prepare(db, keyspace).get_type());
}
public:
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override {
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override {
try {
auto&& casted_type = _type->prepare(db, keyspace).get_type();
if (receiver->type == casted_type) {
if (receiver.type == casted_type) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (receiver->type->is_value_compatible_with(*casted_type)) {
} else if (receiver.type->is_value_compatible_with(*casted_type)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} else {
return assignment_testable::test_result::NOT_ASSIGNABLE;

View File

@@ -47,14 +47,14 @@
#include "result_set.hh"
#include "transport/messages/result_message.hh"
cql3::untyped_result_set_row::untyped_result_set_row(const std::unordered_map<sstring, bytes_opt>& data)
cql3::untyped_result_set_row::untyped_result_set_row(const map_t& data)
: _data(data)
{}
cql3::untyped_result_set_row::untyped_result_set_row(const std::vector<lw_shared_ptr<column_specification>>& columns, std::vector<bytes_opt> data)
: _columns(columns)
, _data([&columns, data = std::move(data)] () mutable {
std::unordered_map<sstring, bytes_opt> tmp;
map_t tmp;
std::transform(columns.begin(), columns.end(), data.begin(), std::inserter(tmp, tmp.end()), [](lw_shared_ptr<column_specification> c, bytes_opt& d) {
return std::make_pair<sstring, bytes_opt>(c->name->to_string(), std::move(d));
});
@@ -62,7 +62,7 @@ cql3::untyped_result_set_row::untyped_result_set_row(const std::vector<lw_shared
}())
{}
bool cql3::untyped_result_set_row::has(const sstring& name) const {
bool cql3::untyped_result_set_row::has(std::string_view name) const {
auto i = _data.find(name);
return i != _data.end() && i->second;
}

View File

@@ -47,6 +47,8 @@
#include "types/list.hh"
#include "types/set.hh"
#include "transport/messages/result_message_base.hh"
#include "column_specification.hh"
#include "absl-flat_hash_map.hh"
#pragma once
@@ -55,26 +57,27 @@ namespace cql3 {
class untyped_result_set_row {
private:
const std::vector<lw_shared_ptr<column_specification>> _columns;
const std::unordered_map<sstring, bytes_opt> _data;
using map_t = flat_hash_map<sstring, bytes_opt>;
const map_t _data;
public:
untyped_result_set_row(const std::unordered_map<sstring, bytes_opt>&);
untyped_result_set_row(const map_t&);
untyped_result_set_row(const std::vector<lw_shared_ptr<column_specification>>&, std::vector<bytes_opt>);
untyped_result_set_row(untyped_result_set_row&&) = default;
untyped_result_set_row(const untyped_result_set_row&) = delete;
bool has(const sstring&) const;
bytes_view get_view(const sstring& name) const {
bool has(std::string_view) const;
bytes_view get_view(std::string_view name) const {
return *_data.at(name);
}
bytes get_blob(const sstring& name) const {
bytes get_blob(std::string_view name) const {
return bytes(get_view(name));
}
template<typename T>
T get_as(const sstring& name) const {
T get_as(std::string_view name) const {
return value_cast<T>(data_type_for<T>()->deserialize(get_view(name)));
}
template<typename T>
std::optional<T> get_opt(const sstring& name) const {
std::optional<T> get_opt(std::string_view name) const {
return has(name) ? get_as<T>(name) : std::optional<T>{};
}
bytes_view_opt get_view_opt(const sstring& name) const {
@@ -84,13 +87,13 @@ public:
return std::nullopt;
}
template<typename T>
T get_or(const sstring& name, T t) const {
T get_or(std::string_view name, T t) const {
return has(name) ? get_as<T>(name) : t;
}
// this could maybe be done as an overload of get_as (or something), but that just
// muddles things for no real gain. Let user (us) attempt to know what he is doing instead.
template<typename K, typename V, typename Iter>
void get_map_data(const sstring& name, Iter out, data_type keytype =
void get_map_data(std::string_view name, Iter out, data_type keytype =
data_type_for<K>(), data_type valtype =
data_type_for<V>()) const {
auto vec =
@@ -103,7 +106,7 @@ public:
});
}
template<typename K, typename V, typename ... Rest>
std::unordered_map<K, V, Rest...> get_map(const sstring& name,
std::unordered_map<K, V, Rest...> get_map(std::string_view name,
data_type keytype = data_type_for<K>(), data_type valtype =
data_type_for<V>()) const {
std::unordered_map<K, V, Rest...> res;
@@ -111,7 +114,7 @@ public:
return res;
}
template<typename V, typename Iter>
void get_list_data(const sstring& name, Iter out, data_type valtype = data_type_for<V>()) const {
void get_list_data(std::string_view name, Iter out, data_type valtype = data_type_for<V>()) const {
auto vec =
value_cast<list_type_impl::native_type>(
list_type_impl::get_instance(valtype, false)->deserialize(
@@ -119,13 +122,13 @@ public:
std::transform(vec.begin(), vec.end(), out, [](auto& v) { return value_cast<V>(v); });
}
template<typename V, typename ... Rest>
std::vector<V, Rest...> get_list(const sstring& name, data_type valtype = data_type_for<V>()) const {
std::vector<V, Rest...> get_list(std::string_view name, data_type valtype = data_type_for<V>()) const {
std::vector<V, Rest...> res;
get_list_data<V>(name, std::back_inserter(res), valtype);
return res;
}
template<typename V, typename Iter>
void get_set_data(const sstring& name, Iter out, data_type valtype =
void get_set_data(std::string_view name, Iter out, data_type valtype =
data_type_for<V>()) const {
auto vec =
value_cast<set_type_impl::native_type>(
@@ -137,7 +140,7 @@ public:
});
}
template<typename V, typename ... Rest>
std::unordered_set<V, Rest...> get_set(const sstring& name,
std::unordered_set<V, Rest...> get_set(std::string_view name,
data_type valtype =
data_type_for<V>()) const {
std::unordered_set<V, Rest...> res;

View File

@@ -122,15 +122,15 @@ void user_types::literal::validate_assignable_to(database& db, const sstring& ke
}
const shared_ptr<term::raw>& value = _entries.at(field);
auto&& field_spec = field_spec_of(receiver, i);
if (!assignment_testable::is_assignable(value->test_assignment(db, keyspace, field_spec))) {
if (!assignment_testable::is_assignable(value->test_assignment(db, keyspace, *field_spec))) {
throw exceptions::invalid_request_exception(format("Invalid user type literal for {}: field {} is not of type {}", receiver.name, field, field_spec->type->as_cql3_type()));
}
}
}
assignment_testable::test_result user_types::literal::test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const {
assignment_testable::test_result user_types::literal::test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const {
try {
validate_assignable_to(db, keyspace, *receiver);
validate_assignable_to(db, keyspace, receiver);
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} catch (exceptions::invalid_request_exception& e) {
return assignment_testable::test_result::NOT_ASSIGNABLE;

View File

@@ -67,7 +67,7 @@ public:
private:
void validate_assignable_to(database& db, const sstring& keyspace, const column_specification& receiver) const;
public:
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const override;
virtual assignment_testable::test_result test_assignment(database& db, const sstring& keyspace, const column_specification& receiver) const override;
virtual sstring assignment_testable_source_context() const override;
virtual sstring to_string() const override;
};

View File

@@ -292,12 +292,10 @@ public:
/// \arg data needs to remain valid as long as the writer is in use.
/// \returns imr::WriterAllocator for cell::structure.
template<typename FragmentRange, typename = std::enable_if_t<is_fragment_range_v<std::decay_t<FragmentRange>>>>
GCC6_CONCEPT(
requires std::is_nothrow_move_constructible_v<std::decay_t<FragmentRange>> &&
std::is_nothrow_copy_constructible_v<std::decay_t<FragmentRange>> &&
std::is_nothrow_copy_assignable_v<std::decay_t<FragmentRange>> &&
std::is_nothrow_move_assignable_v<std::decay_t<FragmentRange>>
)
requires std::is_nothrow_move_constructible_v<std::decay_t<FragmentRange>> &&
std::is_nothrow_copy_constructible_v<std::decay_t<FragmentRange>> &&
std::is_nothrow_copy_assignable_v<std::decay_t<FragmentRange>> &&
std::is_nothrow_move_assignable_v<std::decay_t<FragmentRange>>
static auto make_collection(FragmentRange data) noexcept {
return [data = std::move(data)] (auto&& serializer, auto&& allocations) noexcept {
return serializer

View File

@@ -86,12 +86,10 @@ public:
{ }
template<typename Serializer, typename Allocator>
GCC6_CONCEPT(
requires (imr::is_sizer_for_v<cell::variable_value::structure, Serializer>
&& std::is_same_v<Allocator, imr::alloc::object_allocator::sizer>)
|| (imr::is_serializer_for_v<cell::variable_value::structure, Serializer>
&& std::is_same_v<Allocator, imr::alloc::object_allocator::serializer>)
)
requires (imr::is_sizer_for_v<cell::variable_value::structure, Serializer>
&& std::is_same_v<Allocator, imr::alloc::object_allocator::sizer>)
|| (imr::is_serializer_for_v<cell::variable_value::structure, Serializer>
&& std::is_same_v<Allocator, imr::alloc::object_allocator::serializer>)
auto operator()(Serializer serializer, Allocator allocations) {
auto after_size = serializer.serialize(_value_size);
if (_force_internal || _value_size <= cell::maximum_internal_storage_length) {
@@ -134,14 +132,14 @@ public:
inline value_writer<empty_fragment_range> cell::variable_value::write(size_t value_size, bool force_internal) noexcept
{
GCC6_CONCEPT(static_assert(imr::WriterAllocator<value_writer<empty_fragment_range>, structure>));
static_assert(imr::WriterAllocator<value_writer<empty_fragment_range>, structure>);
return value_writer<empty_fragment_range>(empty_fragment_range(), value_size, force_internal);
}
template<typename FragmentRange>
inline value_writer<std::decay_t<FragmentRange>> cell::variable_value::write(FragmentRange&& value, bool force_internal) noexcept
{
GCC6_CONCEPT(static_assert(imr::WriterAllocator<value_writer<std::decay_t<FragmentRange>>, structure>));
static_assert(imr::WriterAllocator<value_writer<std::decay_t<FragmentRange>>, structure>);
return value_writer<std::decay_t<FragmentRange>>(std::forward<FragmentRange>(value), value.size_bytes(), force_internal);
}

Some files were not shown because too many files have changed in this diff Show More