Compare commits

...

156 Commits

Author SHA1 Message Date
Amos Kong
44ec73cfc4 schema.cc/describe: fix invalid compaction options in schema
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.

    AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}

map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.

Fixes #7741

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7734

(cherry picked from commit 6b1659ee80)
2021-03-24 12:58:11 +02:00
Tomasz Grabiec
df6f9a200f sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>

(cherry picked from commit 9272e74e8c)
2021-03-24 10:42:11 +02:00
Nadav Har'El
2f4a3c271c storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
2021-03-21 10:51:36 +02:00
Raphael S. Carvalho
6a11c20b4a LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
2021-03-18 19:20:10 +02:00
Raphael S. Carvalho
cccdd6aaae compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
2021-03-18 14:29:38 +02:00
Raphael S. Carvalho
92871a88c3 compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)

Includes fixup patch:

compaction_manager: Fix use-after-free in rewrite_sstables()

Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
2021-03-11 08:24:56 +02:00
Benny Halevy
85bbf6751d repair: repair_writer: do not capture lw_shared_ptr cross-shard
The shared_from_this lw_shared_ptr must not be accessed
across shards.  Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.

This was introduced in 289a08072a.

The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it.  It is already held
by the continuations until the end of the chain.

Fixes #7535

Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
(cherry picked from commit f93fb55726)
2021-03-03 21:27:44 +02:00
Hagit Segev
0ac069fdcc release: prepare for 4.2.4 2021-03-02 14:52:31 +02:00
Avi Kivity
738f8eaccd Update seastar submodule
* seastar 1266e42c82...0fba7da929 (1):
  > io_queue: Fix "delay" metrics

Fixes #8166.
2021-03-01 13:59:02 +02:00
Avi Kivity
5d32e91e16 Update seastar submodule
* seastar f760efe0a0...1266e42c82 (1):
  > rpc: streaming sink: order outgoing messages

Fixes #7552.
2021-03-01 12:22:17 +02:00
Benny Halevy
6c5f6b3f69 large_data_handler: disable deletion of large data entries
Currently we decide whether to delete large data entries
based on the overall sstable data_size, since the entries
themselves are typically much smaller than the whole sstable
(especially cells and rows), this causes overzealous
deletions (#7668) and inefficiency in the rows cache
due to the large number of range tombstones created.

Refs #7575

Test: sstable_3_x_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

This patch is targetted for branch-4.3 or earlier.
In 4.4, the problem was fixed in #7669, but the fix
is out of scope for backporting.

Branch: 4.3
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201203130018.1920271-1-bhalevy@scylladb.com>
(cherry picked from commit bb99d7ced6)
2021-03-01 10:54:33 +02:00
Raphael S. Carvalho
fba26b78d2 sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
2021-02-28 16:43:02 +02:00
Pavel Solodovnikov
06e785994f large_data_handler: fix segmentation fault when constructing data_value from a nullptr
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:

In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.

These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.

This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.

What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.

A regression test is provided for the issue (written in
`cql-pytest` framework).

Tests: test/cql-pytest/test_large_cells_rows.py

Fixes: #6780

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 219ac2bab5)
2021-02-23 12:14:12 +02:00
Takuya ASADA
5bc48673aa scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040

(cherry picked from commit 32d4ec6b8a)
2021-02-21 16:23:45 +02:00
Nadav Har'El
59a01b2981 alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
2021-02-21 10:06:50 +02:00
Nadav Har'El
5dd49788c1 alternator: fix UpdateItem ADD for non-existent attribute
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).

We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.

Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(

Fixes #7763.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
(cherry picked from commit a8fdbf31cd)
2021-02-21 08:58:49 +02:00
Benny Halevy
56cbc9f3ed stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
(cherry picked from commit d01e7e7b58)
2021-02-14 13:11:43 +02:00
Avi Kivity
7469896017 table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
on_compaction_completion() updates _sstables_compacted_but_not_deleted
through a temporary to avoid an exception causing a partial update:

  1. copy _sstables_compacted_but_not_deleted to a temporary
  2. update temporary
  3. do dangerous stuff
  4. move temporary to _sstables_compacted_but_not_deleted

This is racy when we have parallel compactions, since step 3 yields.
We can have two invocations running in parallel, taking snapshots
of the same _sstables_compacted_but_not_deleted in step 1, each
modifying it in different ways, and only one of them winning the
race and assigning in step 4. With the right timing we can end
with extra sstables in _sstables_compacted_but_not_deleted.

Before a5369881b3, this was a benign race (only resulting in
deleted file space not being reclaimed until the service is shut
down), but afterwards, extra sstable references result in the service
refusing to shut down. This was observed in database_test in debug
mode, where the race more or less reliably happens for system.truncated.

Fix by using a different method to protect
_sstables_compacted_but_not_deleted. We unconditionally update it,
and also unconditionally fix it up (on success or failure) using
seastar::defer(). The fixup includes a call to rebuild_statistics()
which must happen every time we touch the sstable list.

Ref #7331.
Fixes #8038.

BACKPORT NOTES:
- Turns out this race prevented deletion of expired sstables because the leaked
deleted sstables would be accounted when checking if an expired sstable can
be purged.
- Switch to unordered_set<>::count() as it's not supported by older compilers.

(cherry picked from commit a43d5079f3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
2021-02-14 11:35:57 +02:00
Piotr Wojtczak
c7e2711dd4 Validate ascii values when creating from CQL
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.

Fixes #5421

Closes #7532

(cherry picked from commit caa3c471c0)
2021-02-10 19:37:56 +02:00
Piotr Dulikowski
a2355a35db hinted handoff: use default timeout for sending orphaned hints
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.

Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.

Fixes: #7051
(cherry picked from commit b111fa98ca)
2021-02-10 10:15:01 +02:00
Piotr Sarna
9e225ab447 Merge 'select_statement: Fix aggregate results on indexed selects (timeouts fixed) ' from Piotr Grabowski
Overview
Fixes #7355.

Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).

Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.

It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).

GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
 pk
----
  1
  1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.

Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).

The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
 count
-------
     2
```
The second commit fixes this issue by properly trimming `row_ranges`.

The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.

The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).

The second commit fixes the first two failing examples from issue #7355.

Tests (fourth commit)
Extensively tests queries on tables with secondary indices with  aggregates and `GROUP BY`s.

Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).

I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.

Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000).  This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.

Closes #7497

* github.com:scylladb/scylla:
  tests: Add secondary index aggregates tests
  select_statement: Introduce internal_paging_size
  select_statement: Fix paging on indexed selects
  select_statement: Fix GROUP BY on indexed select

(cherry picked from commit 8c645f74ce)
2021-02-08 20:32:36 +02:00
Amnon Heiman
e1205d1d5b API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007

(cherry picked from commit 4498bb0a48)
2021-02-08 17:04:27 +02:00
Avi Kivity
a78402efae Merge 'Add waiting for flushes on table drops' from Piotr Sarna
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush  (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault

Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)

Fixes #7792

Closes #7798

* github.com:scylladb/scylla:
  database: add flushes to waiting for pending operations
  table: unify waiting for pending operations
  database: add a phaser for flush operations
  database: add waiting for pending streams on table drop

(cherry picked from commit 7636799b18)
2021-02-02 17:23:34 +02:00
Avi Kivity
9fcf790234 row_cache: linearize key in cache_entry::do_read()
do_read() does not linearize cache_entry::_key; this can cause a crash
with keys larger than 13k.

Fixes #7897.

Closes #7898

(cherry picked from commit d508a63d4b)
2021-01-17 09:30:44 +02:00
Hagit Segev
24346215c2 release: prepare for 4.2.3 2021-01-04 19:51:12 +02:00
Benny Halevy
918ec5ecb3 compaction: compaction_writer: destroy shared_sstable after the sstable_writer
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
 (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
 (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```

What happens here is that:

    compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
            : _out(std::move(out))
            , _compression_metadata(cm)
            , _offsets(_compression_metadata->offsets.get_writer())
            , _compression(lc)
            , _full_checksum(ChecksumType::init_checksum())

_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.

Fixes #7821

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
(cherry picked from commit 8a745a0ee0)
2021-01-04 15:04:34 +02:00
Avi Kivity
7457683328 Revert "Merge 'Move temporaries to value view' from Piotr S"
This reverts commit d1fa0adcbe. It causes
a regression when processing some bind variables.

Fixes #7761.
2020-12-24 12:40:46 +02:00
Gleb Natapov
567889d283 mutation_writer: pass exceptions through feed_writer
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.

Fixes #7482
Message-Id: <20201216090038.GB3244976@scylladb.com>

(cherry picked from commit 61520a33d6)
2020-12-16 17:20:11 +02:00
Aleksandr Bykov
c605ed73bf dist: scylla_util: fix aws_instance.ebs_disks method
aws_instance.ebs_disks() method should return ebs disk
instead of ephemeral

Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com>

Closes #7780

(cherry picked from commit e74dc311e7)
2020-12-16 11:58:47 +02:00
Takuya ASADA
d0530d8ac2 node_exporter_install: stop service before force installing
Stop node-exporter.service before re-install it, to avoid 'Text file busy' error.

Fixes #6782

(cherry picked from commit ef05ea8e91)
2020-12-15 16:28:25 +02:00
Hagit Segev
696ef24226 release: prepare for 4.2.2 2020-12-13 20:34:03 +02:00
Avi Kivity
b8fe144301 dist: rpm: uninstall tuned when installing scylla-kernel-conf
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).

Not needed for .deb, since debian/ubunto do not install tuned by
default.

Fixes #7696

Closes #7776

(cherry picked from commit 615b8e8184)
2020-12-12 14:30:38 +02:00
Nadav Har'El
62f783be87 alternator: fix broken Scan/Query paging with bytes keys
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).

This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.

Fixes #7768

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
(cherry picked from commit 86779664f4)
2020-12-09 15:16:41 +02:00
Piotr Sarna
863e784951 db: fix getting local ranges for size estimates table
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.

Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:

(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
            _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
      _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
            _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}

Closes #7764

(cherry picked from commit 1cc4ed50c1)
2020-12-09 15:16:14 +02:00
Nadav Har'El
e5a6199b4d alternator, test: make test_fetch_from_system_tables faster
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.

This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.

As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.

Fixes #7706

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
(cherry picked from commit 220d6dde17)
2020-12-09 15:15:15 +02:00
Nadav Har'El
abaf6c192a alternator: fix query with both projection and filtering
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.

The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.

The two tests

 test_query_filter.py::test_query_filter_and_attributes_to_get
 test_filter_expression.py::test_filter_expression_and_projection_expression

Which failed before this patch now pass so we drop their "xfail" tag.

Fixes #6951.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 282742a469)
2020-12-09 14:39:17 +02:00
Eliran Sinvani
ef2f5ed434 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296

(cherry picked from commit 925cdc9ae1)
2020-11-29 16:45:14 +02:00
Raphael S. Carvalho
bac40e2512 sstable_directory: Fix 50% space requirement for resharding
This is a regression caused by aebd965f0.

After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.

Fixes #7463.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
(cherry picked from commit 6f805bd123)
2020-11-29 15:26:14 +02:00
Asias He
681c4d77bb repair: Make repair_writer a shared pointer
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:

class repair_writer {
   _writer_done[node_idx] =
      mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
         ...
      }).handle_exception([this] {
         ...
      });
}

The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.

To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.

Fixes #7406

Closes #7430

(cherry picked from commit 289a08072a)
2020-11-29 13:30:49 +02:00
Pavel Emelyanov
8572ee9da2 query_pager: Fix continuation handling for noop visitor
Before updating the _last_[cp]key (for subsequent .fetch_page())
the pager checks is 'if the pager is not exhausted OR the result
has data'.

The check seems broken: if the pager is not exhausted, but the
result is empty the call for keys will unconditionally try to
reference the last element from empty vector. The not exhausted
condition for empty result can happen if the short_read is set,
which, in turn, unconditionally happens upon meeting partition
end when visiting the partition with result builder.

The correct check should be 'if the pager is not exhausted AND
the result has data': the _last_[pc]key-s should be taken for
continuation (not exhausted), but can be taken if the result is
not empty (has data).

fixes: #7263
tests: unit(dev), but tests don't trigger this corner case

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921124329.21209-1-xemul@scylladb.com>
(cherry picked from commit 550fc734d9)
2020-11-29 12:01:37 +02:00
Takuya ASADA
95dbac56e5 install.sh: set PATH for relocatable CLI tools in python thunk
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.

Fixes #7350

(cherry picked from commit 5867af4edd)
2020-11-29 11:54:42 +02:00
Bentsi Magidovich
eeadeff0dc scylla_util.py: fix exception handling in curl
Retry mechanism didn't work when URLError happend. For example:

  urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>

Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.

Fixes: #7569

Closes #7567

(cherry picked from commit 956b97b2a8)
2020-11-29 11:48:30 +02:00
Takuya ASADA
62f3caab18 dist/redhat: packaging dependencies.conf as normal file, not ghost
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.

Fixes #7703

Closes #7704

(cherry picked from commit ba4d54efa3)
2020-11-29 11:40:22 +02:00
Takuya ASADA
1a4869231a install.sh: apply sysctl.d files on non-packaging installation
We don't apply sysctl.d files on non-packaging installation, apply them
just like rpm/deb taking care of that.

Fixes #7702

Closes #7705

(cherry picked from commit 5f81f97773)
2020-11-29 11:35:37 +02:00
Avi Kivity
3568d0cbb6 dist: sysctl: configure more inotify instances
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.

This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.

Increase to 1200, which is enough for 6 instances * 200 shards.

Fixes #7700.

Closes #7701

(cherry picked from commit 390e07d591)
2020-11-29 11:04:45 +02:00
Raphael S. Carvalho
030c2e3270 compaction: Make sure a partition is filtered out only by producer
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9

To fix this problem, let's make sure that partition filtering will only
happen on the producer side.

Fixes #7590.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
(cherry picked from commit 13fa2bec4c)
2020-11-19 14:08:25 +02:00
Piotr Dulikowski
37a5e9ab15 hints: don't read hint files when it's not allowed to send
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:

- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
  target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
  this operation after sleeping 1 second.

This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.

This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.

First observed in #6964

Tests:
- unit(dev)
- hinted handoff dtests

Closes #7407

(cherry picked from commit 77a0f1a153)
2020-11-16 14:30:07 +02:00
Botond Dénes
a15b5d514d mutation_reader: queue_reader: don't set EOS flag on abort
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.

Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.

Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
(cherry picked from commit f5323b29d9)
2020-11-15 11:07:38 +02:00
Botond Dénes
064f8f8bcf types: validate(): linearize values lazily
Instead of eagerly linearizing all values as they are passed to
validate(), defer linearization to those validators that actually need
linearized values. Linearizing large values puts pressure on the memory
allocator with large contiguous allocation requests. This is something
we are trying to actively avoid, especially if it is not really neaded.
Turns out the types, whose validators really want linearized values are
a minority, as most validators just look at the size of the value, and
some like bytes don't need validation at all, while usually having large
values.

This is achieved by templating the validator struct on the view and
using the FragmentedRange concept to treat all passed in views
(`bytes_view` and `fragmented_temporary_buffer_view`) uniformly.
This patch makes no attempt at converting existing validators to work
with fragmented buffers, only trivial cases are converted. The major
offenders still left are ascii/utf8 and collections.

Fixes: #7318

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201007054524.909420-1-bdenes@scylladb.com>
(cherry picked from commit db56ae695c)
2020-11-11 10:55:54 +02:00
Amnon Heiman
04fe0a7395 scyllatop/livedata.py: Safe iteration over metrics
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.

Fixes #7488

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 52db99f25f)
2020-11-08 19:16:13 +02:00
Calle Wilund
ef26d90868 partition_version: Change range_tombstones() to return chunked_vector
Refs #7364

The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.

Closes #7433

(cherry picked from commit 4b65d67a1a)
2020-11-08 14:38:32 +02:00
Tomasz Grabiec
790f51c210 sstables: ka/la: Fix abort when next_partition() is called with certain reader state
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().

The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:

  scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.

This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.

The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.

Fixes #7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit fb9b5cae05)
2020-11-08 14:25:47 +02:00
Yaron Kaikov
4fb8ebccff release: prepare for 4.2.1 2020-11-08 12:41:06 +02:00
Avi Kivity
d1fa0adcbe Merge 'Move temporaries to value view' from Piotr S
"
Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream.

Tests: unit(dev)
Fixes https://github.com/scylladb/scylla/issues/7019
"

7055297649 ("cql3: remove query_options::linearize and _temporaries")
is reverted from this backport since linearize() is still used in
this branch.

* psarna-move_temporaries_to_value_view:
  cql3: remove query_options::linearize and _temporaries
  cql3: remove make_temporary helper function
  cql3: store temporaries in-place instead of in query_options
  cql3: add temporary_value to value view
  cql3: allow moving data out of raw_value
  cql3: split values.hh into a .cc file

(cherry picked from commit 2b308a973f)
2020-11-05 19:24:23 +02:00
Piotr Sarna
46b56d885e schema_tables: fix fixing old secondary index schemas
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.

Fixes #7515

Closes #7516

(cherry picked from commit b66c285f94)
2020-11-05 17:53:08 +02:00
Yaron Kaikov
94597e38e2 release: prepare for 4.2.0 2020-10-25 09:12:38 +02:00
Piotr Sarna
c74ba1bc36 Merge 'Backport PR #7469 to 4.2' from Eliran Sinvani
This is a backport of PR #7469 that did not apply cleanly to 4.2 with a trivial conflict, another commit that touched one of the files but in a completely different region.

Closes #7480

* github.com:scylladb/scylla:
  materialized views: add a base table reference if missing
  view info: support partial match between base and view for only reading from view.
  view info: guard against null dereference of the base info
2020-10-23 17:18:02 +02:00
Eliran Sinvani
06cfc63c59 materialized views: add a base table reference if missing
schema pointers can be obtained from two distinct entities,
one is the database, those schema are obtained from the table
objects and the other is from the schema registry.
When a schema or a new schema is attached to a table object that
represents a base table for views, all of the corresponding attached
view schemas are guarantied to have their base info in sync.
However if an older schema is inserted into the registry by the
migratrion manager i.e loaded from other node, it will be
missing this info.
This becomes a problem when this schema is published through the
schema registry as it can be obtained for an obsolete read command
for example and then eventually cause a segmentation fault by null
dereferencing the _base_info ptr.

Refs #7420
2020-10-23 18:09:45 +03:00
Eliran Sinvani
56d25930ec view info: support partial match between base and view for
only reading from view.

The current implementation of materialized views does
no keep the version to which a specific version of materialized
view schema corresponds to. This complicate things especially on
old views versions that the schema doesn't support anymore. However,
the views, being also an independent table should allow reading from
them as long as they exist even if the base table changed since then.
For the reading purpose, we don't need to know the exact composition
of view primary key columns that are not part of the base primary
key, we only need to know that there are any, and this is a much
looser constrain on the schema.
We can rely on a table invariants such as the fact that pk columns are
not going to disappear on newer version of the table.
This means that if we don't find a view column in the base table, it is
not a part of the base table primary key.
This information is enough for us to perform read on the view.
This commit adds support for being able to rely on such partial
information along with a validation that it is not going to be used for
writes. If it is, we simply abort since this means that our schema
integrity is compromised.
2020-10-23 18:08:56 +03:00
Eliran Sinvani
fa1cd048d7 view info: guard against null dereference of the base info
The change's purpose is to guard against segfault that is the
result of dereferencing the _base_info member when it is
uninitialized. We already know this can happen (#7420).
The only purpose of this change is to treat this condition as
an internal error, the reason is that it indicates a schema integrity
problem.
Besides this change, other measures should be taken to ensure that
the _base_table member is initialized before calling methods that
rely on it.
We call the internal_error as a last resort.
2020-10-23 18:08:56 +03:00
Nadav Har'El
94b754eee5 alternator: change name of Alternator's SSL options
When Alternator is enabled over HTTPS - by setting the
"alternator_https_port" option - it needs to know some SSL-related options,
most importantly where to pick up the certificate and key.

Before this patch, we used the "server_encryption_options" option for that.
However, this was a mistake: Although it sounds like these are the "server's
options", in fact prior to Alternator this option was only used when
communicating with other servers - i.e., connections between Scylla nodes.
For CQL connections with the client, we used a different option -
"client_encryption_options".

This patch introduces a third option "alternator_encryption_options", which
controls only Alternator's HTTPS server. Making it separate from the
existing CQL "client_encryption_options" allows both Alternator and CQL to
be active at the same time but with different certificates (if the user
so wishes).

For backward compatibility, we temporarily continue to allow
server_encryption_options to control the Alternator HTTPS server if
alternator_encryption_options is not specified. However, this generates
a warning in the log, urging the user to switch. This temporary workaround
should be removed in a future version.

This patch also:
1. fixes the test run code (which has an "--https" option to test over
   https) to use the new name of the option.
2. Adds documentation of the new option in alternator.md and protocols.md -
   previously the information on how to control the location of the
   certificate was missing from these documents.

Fixes #7204.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930123027.213587-1-nyh@scylladb.com>
(cherry picked from commit 509a41db04)
2020-10-18 18:05:04 +03:00
Takuya ASADA
e02defb3bf install.sh: set LC_ALL=en_US.UTF-8 on python3 thunk
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.

Fixes #7408

Closes #7414

(cherry picked from commit ff129ee030)
2020-10-18 15:02:32 +03:00
Botond Dénes
95e712e244 reader_permit: reader_resources: make true RAII class
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.

Fixes: #7256

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
(cherry picked from commit a0107ba1c6)
2020-10-18 15:00:47 +03:00
Avi Kivity
a9109f068c Update seastar submodule
* seastar 61b88d1da4...f760efe0a0 (1):
  > append_challenged_posix_file_impl: allow destructing file with no queued work

Fixes #7285.
2020-10-12 15:11:46 +03:00
Piotr Sarna
2e71779970 Merge 'Fix view_builder lockup and crash on shutdown' from Pavel
The lockup:

When view_builder starts all shards at some point get to a
barrier waiting for each other to pass. If any shard misses
this checkpoint, all others stuck forever. As this barrier
lives inside the _started future, which in turn is waited
on stop, the stop stucks as well.

Reasons to miss the barrier -- exception in the middle of the
fun^w start or explicit abort request while waiting for the
schema agreement.

Fix the "exception" case by unlocking the barrier promise with
exception and fix the "abort request" case by turning it into
an exception.

The bug can be reproduced by hands if making one shard never
see the schema agreement and continue looping until the abort
request.

The crash:

If the background start up fails, then the _started future is
resolved into exception. The view_builder::stop then turns this
future into a real exception caught-and-rethrown by main.cc.

This seems wrong that a failure in a background fiber aborts
the regular shutdown that may proceed otherwise.

tests: unit(dev), manual start-stop
branch: https://github.com/xemul/scylla/tree/br-view-builder-shutdown-fix-3
fixes: #7077

Patch #5 leaves the seastar::async() in the 1-st phase of the
start() although can also be tuned not to produce a thread.
However, there's one more (painless) issue with the _sem usage,
so this change appears too large for the part of the bug-fix
and will come as a followup.

* 'br-view-builder-shutdown-fix-3' of git://github.com/xemul/scylla:
  view_builder: Add comment about builder instances life-times
  view_builder: Do sleep abortable
  view_builder: Wakeup barrier on exception
  view_builder: Always resolve started future to success
  view_builder: Re-futurize start
  view_builder: Split calculate_shard_build_step into two
  view_builder: Populate the view_builder_init_state
  view_builder: Fix indentation after previous patch
  view_builder: Introduce view_builder_init_state

(cherry picked from commit ca9422ca73)
2020-10-07 15:05:12 +03:00
Gleb Natapov
5ed7de81ad lwt: do not return unavailable exception from the 'learn' stage
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.

Fixes #7258

Message-Id: <20201001130724.GA2283830@scylladb.com>
(cherry picked from commit 3e8dbb3c09)
2020-10-07 10:59:30 +02:00
Juliusz Stasiewicz
5e21c9bd8a tracing: Fix error on slow batches
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```

In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```

Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.

Fixes #5843

Closes #7293

(cherry picked from commit 0afa738a8f)
2020-10-04 18:04:22 +03:00
Avi Kivity
8e22fddc9e Merge 'Fix ignoring cells after null in appending hash' from Piotr Sarna
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible

(cherry picked from commit 0e03c979d2)
2020-10-01 23:23:00 +02:00
Avi Kivity
6cf2d998e3 Merge "Fix race in schema version recalculation leading to stale schema version in gossip" from Tomasz
"
Migration manager installs several cluster feature change listeners.
The listeners will call update_schema_version_and_announce() when cluster
features are enabled, which does this:

    return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
        return announce_schema_version(uuid);
    });

It first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.

The fix is to serialize schema digest calculation and publishing.

Refs #7200

This problem also brought my attention to initialization code, which could be
prone to the same problem.

The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.

First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.

Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.

Maybe we're not doing concurrent schema changes/recalculations now,
but it is easy to imagine that this could change for whatever reason
in the future.

The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.

Tests:

  - unit (dev)
  - manual (3 node ccm)
"

Fixes #7291

* tag 'fix-schema-digest-calculation-race-v1' of github.com:tgrabiec/scylla:
  db, schema: Hide update_schema_version_and_announce()
  db, storage_service: Do not call into gossiper from the database layer
  db: Make schema version observable
  utils: updateable_value_source: Introduce as_observable()
  schema: Fix race in schema version recalculation leading to stale schema version in gossip

(cherry picked from commit dcaf4ea4dd)
2020-10-01 17:44:37 +02:00
Hagit Segev
5fcc1f205c release: prepare for 4.2.rc5 2020-09-30 20:40:44 +03:00
Avi Kivity
08c35c1aad Revert "Revert "config: Do not enable repair based node operations by default""
This reverts commit 71d0d58f8c. Repair-based
node operations stil have a significant regression (See #7249).
2020-09-30 14:18:37 +03:00
Tomasz Grabiec
54a913d452 Merge "evictable_reader: validate buffer on reader recreation" from Botond
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader.
The intent is to prevent corrupt data from getting into the system as
well as to help catch these bugs as close to the source as possible.

Fixes: #7208

Tests: unit(dev), mutation_reader_test:debug (v4)

* botond/evictable-reader-validate-buffer/v5:
  mutation_reader_test: add unit test for evictable reader self-validation
  evictable_reader: validate buffer after recreation the underlying
  evictable_reader: update_next_position(): only use peek'd position on partition boundary
  mutation_reader_test: add unit test for evictable reader range tombstone trimming
  evictable_reader: trim range tombstones to the read clustering range
  position_in_partition_view: add position_in_partition_view before_key() overload
  flat_mutation_reader: add buffer() accessor

(cherry picked from commit 97c99ea9f3)
2020-09-30 13:13:09 +02:00
Piotr Dulikowski
8f9cd98c45 hinted handoff: fix race - decomission vs. endpoint mgr init
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.

The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.

In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.

End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.

To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.

Fixes #7257

Closes #7278

(cherry picked from commit 39771967bb)
2020-09-29 14:18:48 +03:00
Avi Kivity
2893f6e43b Update seastar submodule
* seastar 0c289412a9...61b88d1da4 (1):
  > lz4_fragmented_compressor: Fix buffer requirements

Fixes #6925.
2020-09-23 11:04:22 +03:00
Avi Kivity
18d6c27b05 Merge 'storage_proxy: add a separate smp_group for hints' from Eliran
Hints writes are handled by storage_proxy in the exact same way
regular writes are, which in turn means that the same smp service
group is used for both. The problem is that it can lead to a priority
inversion where writes of the lower priority  kind occupies a lot of
the semaphores units making the higher priority writes wait for an
empty slot.
This series adds a separate smp group for hints as well as a field
to pass the correct smp group to mutate_locally functions, and
then uses this field to properly classify the writes.

Fixes #7177

* eliransin-hint_priority_inversion:
  Storage proxy: use hints smp group in mutate locally
  Storage proxy: add a dedicated smp group for hints

(cherry picked from commit c075539fea)
2020-09-22 14:06:14 +03:00
Pavel Solodovnikov
97d7f6990c storage_proxy: un-hardcode force sync flag for mutate_locally(mutation) overload
Corresponding overload of `storage_proxy::mutate_locally`
was hardcoded to pass `db::commitlog::force_sync::no` to the
`database::apply`. Unhardcode it and substitute `force_sync::no`
to all existing call sites (as it were before).

`force_sync::yes` will be used later for paxos learn writes
when trying to apply mutations upgraded from an obsolete
schema version (similar to the current case when applying
locally a `frozen_mutation` stored in accepted proposal).

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 5ff5df1afd)

Prerequisite for #7177.
2020-09-22 14:05:39 +03:00
Nadav Har'El
9855e18c0d alternator: fix corruption of PutItem operation in case of contention
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).

To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.

The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.

The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().

The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.

Fixes #7218.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
(cherry picked from commit 5e8bdf6877)
2020-09-16 19:17:52 +03:00
Avi Kivity
5d5ddd3539 Merge "materialized views: Fix undefined behavior on base table schema changes" from Tomasz
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.

The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).

Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.

Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.

Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.

It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.

Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.

Fixes #7061.

Tests:

  - unit (dev)
  - [v1] manual (reproduced using scylla binary and cqlsh)
"

* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
  db: view: Refactor view_info::initialize_base_dependent_fields()
  tests: mv: Test dropping columns from base table
  db: view: Fix incorrect schema access during view building after base table schema changes
  schema: Call on_internal_error() when out of range id is passed to column_at()
  db: views: Fix undefined behavior on base table schema changes
  db: views: Introduce has_base_non_pk_columns_in_view_pk()

(cherry picked from commit 3daa49f098)
2020-09-16 16:42:02 +03:00
Benny Halevy
0a72893fef test: cql_query_test: test_cache_bypass: use table stats
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.

Use the table's row_cache stats instead.

Fixes #6773

Test: cql_query_test.test_cache_bypass(dev, debug)

Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
(cherry picked from commit 6deba1d0b4)
2020-09-16 16:05:53 +03:00
Dejan Mircevski
1e45557d2a cql3: Fix NULL reference in get_column_defs_for_filtering
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing.  Add a test exposing the NULL
dereference and fix the typo.

Tests: unit (dev)

Fixes #7198.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 9d02f10c71)
2020-09-16 15:46:58 +03:00
Avi Kivity
96e1e95c1d reconcilable_result_builder: don't aggrevate out-of-memory condition during recovery
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.

During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.

Fix by making _result the last member, so it is freed first.

Fixes #7240.

(cherry picked from commit 9421cfded4)
2020-09-16 15:40:40 +03:00
Raphael S. Carvalho
338196eab6 storage_service: Fix use-after-free when calculating effective ownership
Use-after-free happens because we take a ref to keyspace_name, which
is stack allocated, and ceases to exist after the next deferring
action.

Fixes #7209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200909210741.104397-1-raphaelsc@scylladb.com>
(cherry picked from commit 86b9ea6fb2)
2020-09-12 13:58:45 +03:00
Asias He
71cbec966b storage_service: Fix a TOKENS update race for replace operation
In commit 7d86a3b208 (storage_service:
Make replacing node take writes), application state of TOKENS of the
replacing node is added into gossip and propagated to the cluster after
the initial start of gossip service. This can cause a race below

1. The replacing node replaces the old dead node with the same ip address
2. The replacing node starts gossip without application state of the TOKENS
3. Other nodes in the cluster replace the application states of old dead node's
   version with the new replacing node's version
4. replacing node dies
5. replace operation is performed again, the TOKENS application state is
   not preset and replace operation fails.

To fix, we can always add TOKENS application state when the
gossip service starts.

Fixes: #7166
Backports: 4.1 and 4.2
(cherry picked from commit 3ba6e3d264)
2020-09-10 13:12:56 +03:00
Avi Kivity
067a065553 Merge "Fix repair stalls in get_sync_boundary and apply_rows_on_master_in_thread" from Asias
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.

Fixes #6940
Fixes #6975
Fixes #6976
"

* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
  repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
  repair: Use clear_gently in get_sync_boundary to avoid stall
  utils: Add clear_gently
  repair: Use merge_to_gently to merge two lists
  utils: Add merge_to_gently

(cherry picked from commit 4547949420)
2020-09-10 13:12:53 +03:00
Avi Kivity
e00bdc4f57 repair: apply_rows_on_follower(): remove copy of repair_rows list
We copy a list, which was reported to generate a 15ms stall.

This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.

Fixes #7115.

(cherry picked from commit 6ff12b7f79)
2020-09-10 11:53:05 +03:00
Juliusz Stasiewicz
ad40f9222c cdc: Retry generation fetching after read_failure_exception
While fetching CDC generations, various exceptions can occur. They
are divided into "fatal" and "nonfatal", where "fatal" ones prevent
retrying of the fetch operation.

This patch makes `read_failure_exception` "non-fatal", because such
error may appear during restart. In general this type of error can
mean a few different things (e.g. an error code in a response from
replica, but also a broken connection) so retrying seems reasonable.

Fixes #6804

(cherry picked from commit d1dec3fcd7)
2020-09-09 15:10:50 +03:00
Kamil Braun
5d90fa17d6 cdc: fix deadlock inside check_and_repair_cdc_streams
check_and_repair_cdc_streams, in case it decides to create a new CDC
generation, updates the STATUS application state so that other nodes
gossiped with pick up the generation change.

The node which runs check_and_repair_cdc_streams also learns about a
generation change: the STATUS update causes a notification change.
This happens during add_local_application_state call
which caused the STATUS update; it would lead to calling
handle_cdc_generation, which detects a generation change and calls
add_local_application_state with the new generation's timestamp.

Thus, we get a recursive add_local_application_state call. Unforunately,
the function takes a lock before doing on_change notifications, so we
get a deadlock.

This commit prevents the deadlock.
We update the local variable which stores the generation timestamp
before updating STATUS, so handle_cdc_generation won't consider
the observed generation to be new, hence it won't perform the recursive
add_local_application_state call.

(cherry picked from commit 42fb4fe37c)
2020-09-09 10:14:18 +03:00
Yaron Kaikov
bf0c493c28 release: prepare for 4.2.rc4 2020-09-07 14:56:32 +03:00
Raphael S. Carvalho
26cb0935f0 sstables/LCS: increase per-level overlapping tolerance in reshape
LCS can have its overlapping invariant broken after operations that can
proceed in parallel to regular compaction like cleanup. That's because
there could be two compactions in parallel placing data in overlapping
token ranges of a given level > 0.
After reshape, the whole table will be rewritten, on restart, if a
given level has more than (fan_out*2)=20 overlaps.
That may sound like enough, but that's not taking into account the
exponential growth in # of SSTables per level, so 20 overlaps may
sound like a lot for level 2 which can afford 100 sstables, but it's
only 2% of level 3, and 0.2% of level 4. So let's change the
overlapping tolerance from the constant of fan_out*2 to 10% of level
limit on # of SSTables, or fan_out, whichever is higher.

Refs #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200810154510.32794-1-raphaelsc@scylladb.com>
(cherry picked from commit 7d7f9e1c54)
2020-09-06 18:28:55 +03:00
Raphael S. Carvalho
4e97d562eb compaction: Prevent non-regular compaction from picking compacting SSTables
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.

When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.

Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.

Fixes #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
(cherry picked from commit 11df96718a)
2020-09-06 18:26:43 +03:00
Takuya ASADA
3f1b932c04 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6991

(cherry picked from commit 7cccb018b8)
2020-09-06 18:21:12 +03:00
Avi Kivity
b9498ab947 Update seastar submodule
* seastar 7816796dd1...0c289412a9 (1):
  > TLS: Use "known" (precalculated) DH parameters if available

Fixes #6191.
2020-09-06 17:38:05 +03:00
Avi Kivity
e1b3d6d0a2 Update seastar submodule
* seastar adaabdfbc...7816796dd (1):
  > core/reactor: complete_timers(): restore previous scheduling group

Fixes #7117.
2020-09-03 23:47:22 +03:00
Avi Kivity
67378cda03 Merge "Fix TWCS compaction aggressiveness due to data segregation" from Raphael
"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.

Fixes #6928.
"

* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
  compaction/twcs: improve further debug messages
  compaction/twcs: Improve debug log which shows all windows
  test: Check that TWCS properly performs size-tiered compaction on past windows
  compaction/twcs: Make task estimation take into account the size-tiered behavior
  compaction/stcs: Export static function that estimates pending tasks
  compaction/stcs: Make get_buckets() static
  compact/twcs: Perform size-tiered compaction on past time windows
  compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
  compaction/twcs: Make newest_bucket() non-static
  compaction/twcs: Move TWCS implementation into source file

(cherry picked from commit 6f986df458)
2020-09-02 12:53:45 +03:00
Nadav Har'El
6ab3965465 redis: fix another use-after-free crash in "exists" command
Never trust Occam's Razor - it turns out that the use-after-free bug in the
"exists" command was caused by two separate bugs. We fixed one in commit
9636a33993, but there is a second one fixed in
this patch.

The problem fixed here was that a "service_permit" object, which is designed to
be copied around from place to place (it contains a shared pointer, so is cheap
to copy), was saved by reference, and the reference was to a function argument
and was destroyed prematurely.

This time I tested *many times* that that test_strings.py passes on both dev and
debug builds.

Note that test/run/redis still fails in a debug build, but due to a different
problem.

Fixes #6469

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200825183313.120331-1-nyh@scylladb.com>
(cherry picked from commit 868194cd17)
2020-08-27 12:16:19 +03:00
Nadav Har'El
ca22461a9b redis: fix use-after-free crash in "exists" command
A missing "&" caused the key stored in a long-living command to be copied
and the copy quickly freed - and then used after freed.
This caused the test test_strings.py::test_exists_multiple_existent_key for
this feature to frequently crash.

Fixes #6469

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200823190141.88816-1-nyh@scylladb.com>
(cherry picked from commit 9636a33993)
2020-08-27 12:16:19 +03:00
Asias He
b3d83ad073 compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:

scylla: Reactor stalled for 16262 ms on shard 4.

| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
|  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
|  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) operator() at ./sstables/compaction_manager.cc:691
|  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get local ranges and sstables.

In addition, this patch removes the dependency to global storage_service object.

Fixes #6662

(cherry picked from commit 07e253542d)
2020-08-27 12:16:19 +03:00
Raphael S. Carvalho
7e6f47fbce sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
(cherry picked from commit cf352e7c14)
2020-08-27 12:16:16 +03:00
Asias He
9ca49cba6b abstract_replication_strategy: Add get_ranges_in_thread
Add a version that runs inside a seastar thread. The benefit is that
get_ranges can yield to avoid stalls.

Refs #6662

(cherry picked from commit 94995acedb)
2020-08-27 12:15:33 +03:00
Raphael S. Carvalho
6da8ba2d3f sstables: export needs_cleanup()
May be needed elsewhere, like in an unit test.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-1-raphaelsc@scylladb.com>
(cherry picked from commit a9eebdc778)
2020-08-27 12:15:29 +03:00
Asias He
366c1c2c59 gossip: Fix race between shutdown message handler and apply_state_locally
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
   in apply_state_locally and calls gossiper::handle_major_state_change
   and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
   node2 will skip to mark node1 as alive because the status of node1 is
   SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.

To fix, we serialize the two operations.

Fixes #7032

(cherry picked from commit e6ceec1685)
2020-08-27 11:15:48 +03:00
Nadav Har'El
05cdb173f3 Alternator: allow CreateTable with SSESpecification explicitly disabled
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.

This patch also adds a test for this case, which failed on Alternator
before this patch.

Fixes #7031.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
(cherry picked from commit 4c73d43153)
2020-08-26 20:15:19 +03:00
Nadav Har'El
8c929a96cf alternator: CreateTable with bad Tags shouldn't create a table
Currently, if a user tries to CreateTable with a forbidden set of tags,
e.g., the Tags list is too long or contains an invalid value for
system:write_isolation, then the CreateTable request fails but the table
is still created. Without the tag of course.

This patch fixes this bug, and adds two test cases for it that fail
before this patch, and succeed with it. One of the test cases is
scylla_only because it checks the Scylla-specific system:write_isolation
tag, but the second test case works on DynamoDB as well.

What this patch does is to split the update_tags() function into two
parts - the first part just parses the Tags, validates them, and builds
a map. Only the second part actually writes the tags to the schema.
CreateTable now does the first part early, before creating the table,
so failure in parsing or validating the Tags will not leave a created
table behind.

Fixes #6809.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713120611.767736-1-nyh@scylladb.com>
(cherry picked from commit 35f7048228)
2020-08-26 19:52:49 +03:00
Avi Kivity
a9aa10e8de Merge "Unregister RPC verbs on stop" from Pavel E
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.

Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.

In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.

The set brings the RPC handlers unregistration in sync with the
registration part.

tests: unit (dev)
       dtest (dev: simple_boot_shutdown, repair)
       start-stop by hands (dev)
fixes: #6904
"

* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
  main: Add missing calls to unregister RPC hanlers
  messaging: Add missing per-service unregistering methods
  messaging: Add missing handlers unregistration helpers
  streaming: Do not use db->invoke_on_all in vain
  storage_proxy: Detach rpc unregistration from stop
  main: Shorten call to storage_proxy::init_messaging_service

(cherry picked from commit 01b838e291)
2020-08-26 14:41:04 +03:00
Raphael S. Carvalho
989d8fe636 cql3/statements: verify that counter column cannot be added into non-counter table
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.

Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.

The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.

Fixes #7065.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
(cherry picked from commit 1c29f0a43d)
2020-08-25 18:44:42 +03:00
Takuya ASADA
48d79a1d9f dist/debian: disable debuginfo compression on .deb
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.

Fixes #6982

(cherry picked from commit 75c2362c95)
2020-08-23 19:01:00 +03:00
Botond Dénes
4c65413413 scylla-gdb.py: find_db(): don't return current shard's database for shard=0
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.

Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
(cherry picked from commit 4cfab59eb1)
2020-08-23 18:56:19 +03:00
Hagit Segev
e931d28673 release: prepare for 4.2.rc3 2020-08-19 14:39:08 +03:00
Botond Dénes
ec71688ff2 view_update_generator: fix race between registering and processing sstables
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.

This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.

Fixes: #6892

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
(cherry picked from commit 22a6493716)
2020-08-19 00:11:48 +03:00
Botond Dénes
9710a91100 table: get_sstables_by_partition_key(): don't make a copy of selected sstables
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.

Fixes: #7060

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
(cherry picked from commit 78f94ba36a)
2020-08-19 00:01:36 +03:00
Nadav Har'El
b052f3f5ce Update Seastar submodule module
> http: add "Expect: 100-continue" handling

Refs #6844.
2020-08-11 13:06:03 +03:00
Calle Wilund
d70cab0444 database: Do not assert on replay positions if truncate does not flush
Fixes #6995

In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.

This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.

Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.

(cherry picked from commit 9620755c7f)
2020-08-11 00:00:43 +03:00
Avi Kivity
0ce3799187 Update seastar submodule
* seastar 4641f4f2d3...2775a54dcb (1):
  > memory: fix small aligned free memory corruption

Fixes #6831
2020-08-09 18:35:44 +03:00
Avi Kivity
ee113eca52 Merge 'hinted handoff: fix commitlog memory leak' from Piotr D
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.

This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.

Fixes: #6409, Fixes #6776
"

* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
  hinted handoff: disable warnings about segments left on disk
  hinted handoff: release memory on commitlog termination

(cherry picked from commit 4c221855a1)
2020-08-09 17:25:20 +03:00
Tomasz Grabiec
be11514985 thrift: Fix crash on unsorted column names in SlicePredicate
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.

It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:

scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.

Fixes #6486.

Tests:

   - added a new dtest to thrift_tests.py which reproduces the problem

Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit bfd129cffe)
2020-08-08 19:47:57 +03:00
Rafael Ávila de Espíndola
ec874bdc31 alternator: Fix use after return
Avoid a copy of timeout so that we don't end up with a reference to a
stack allocated variable.

Fixes #6897

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200721184939.111665-1-espindola@scylladb.com>
(cherry picked from commit e83e91e352)
2020-08-03 22:24:12 +03:00
Nadav Har'El
43169ffa2c alternator: fix Expected's "NULL" operator with missing AttributeValueList
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.

This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.

Fixes #6816.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
(cherry picked from commit f549d147ea)
2020-08-03 20:39:01 +03:00
Yaron Kaikov
c5ed14bff6 release: prepare for 4.2.rc2 2020-08-03 16:50:38 +03:00
Takuya ASADA
8366eda943 scylla_util.py: always use relocatable CLI tools
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.

Related #6954

(cherry picked from commit a19a62e6f6)
2020-08-03 10:39:26 +03:00
Takuya ASADA
5d0b0dd4c4 create-relocatable-package.py: add lsblk for relocatable CLI tools
We need latest version of lsblk that supported partition type UUID.

Fixes #6954

(cherry picked from commit 6ba2a6c42e)
2020-08-03 10:39:07 +03:00
Juliusz Stasiewicz
6f259be5f1 aggregate_fcts: Use per-type comparators for dynamic types
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.

This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.

Fixes #6768

(cherry picked from commit 5b438e79be)
2020-08-03 10:26:02 +03:00
Calle Wilund
16e512e21c cql3::lists: Fix setter_by_uuid not handing null value
Fixes #6828

When using the scylla list index from UUID extension,
null values were not handled properly causing throws
from underlying layer.

(cherry picked from commit 3b74b9585f)
2020-08-03 10:19:13 +03:00
Avi Kivity
c61dc4e87d tools: toolchain: regenerate for gcc 10.2
Fixes #6813.

As a side effect, this also brings in xxhash 0.7.4.

(matches commit 66c2b4c8bf)
2020-07-31 08:48:12 +03:00
Takuya ASADA
af76a3ba79 scylla_post_install.sh: generate memory.conf for CentOS7
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.

Fixes #6783

(cherry picked from commit 3a25e7285b)
2020-07-30 16:41:10 +03:00
Tomasz Grabiec
8fb5ebb2c6 commitlog: Fix use-after-free on mutation object during replay
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.

This will typically manifest as a segfault.

Fixes #6953

Introduced in 79935df

Tests:
  - manual using scylla binary. Reproduced the problem then verified the fix makes it go away

Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3486eba1ce)
2020-07-30 16:36:42 +03:00
Takuya ASADA
bfb11defdd scylla_setup: skip boot partition
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.

It's better to exclude such partiotion from unsed disk list.

Fixes #6636

(cherry picked from commit d7de9518fe)
2020-07-29 09:48:10 +03:00
Asias He
2d1ddcbb6a repair: Fix race between create_writer and wait_for_writer_done
We saw scylla hit user after free in repair with the following procedure during tests:

- n1 and n2 in the cluster

- n2 ran decommission

- n2 sent data to n1 using repair

- n2 was killed forcely

- n1 tried to remove repair_meta for n1

- n1 hit use after free on repair_meta object

This was what happened on n1:

1) data was received -> do_apply_rows was called -> yield before create_writer() was called

2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
   with _writer_done[node_idx] not engaged

3) step 1 resumed, create_writer() was called and _repair_writer object was referenced

4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed

5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object

To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.

Fixes: #6853
Fixes: #6868
Backports: 4.0, 4.1, 4.2
(cherry picked from commit e6f640441a)
2020-07-29 09:48:10 +03:00
Raphael S. Carvalho
4c560b63f0 sstable: index_reader: Make sure streams are all properly closed on failure
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.

Fixes #6924.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
(cherry picked from commit 0d70efa58e)
2020-07-29 09:48:10 +03:00
Nadav Har'El
00155e32b1 merge: db/view: view_update_generator: make staging reader evictable
Merged patch set by Botond Dénes:

The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The

staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.

This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.

  test/boost: view_build_test: add test_view_update_generator_buffering
  test/boost: view_build_test: add test test_view_update_generator_deadlock
  reader_permit: reader_resources: add operator- and operator+
  reader_concurrency_semaphore: add initial_resources()
  test: cql_test_env: allow overriding database_config
  mutation_reader: expose new_reader_base_cost
  db/view: view_updating_consumer: allow passing custom update pusher
  db/view: view_update_generator: make staging reader evictable
  db/view: view_updating_consumer: move implementation from table.cc to view.cc
  database: add make_restricted_range_sstable_reader()

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

(cherry picked from commit f488eaebaf)

Fixes #6892.
2020-07-28 17:02:09 +03:00
Avi Kivity
b06dffcc19 Merge "messaging: make verb handler registering independent of current scheduling group" from Botond
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.

This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.

This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.

In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.

This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.

I manually checked that each individual defense added is already
preventing the crash from happening.

Fixes: #6613
Fixes: #6907
Fixes: #6908

Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"

* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: use cached semaphore
  messaging: make verb handler registering independent of current scheduling group
  multishard_mutation_query: validate the semaphore of the looked-up reader
  reader_concurrency_semaphore: make inactive read handles unique across semaphores
  reader_concurrency_semaphore: add name() accessor
  reader_concurrency_semaphore: allow passing name to no-limit constructor

(cherry picked from commit 3f84d41880)
2020-07-27 17:41:51 +03:00
Botond Dénes
508e58ef9e sstables: clamp estimated_partitions to [1, +inf) in writers
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.

Fixes #6913

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
(cherry picked from commit fe127a2155)
2020-07-27 15:00:00 +03:00
Piotr Sarna
776faa809f Merge 'view_update_generator: use partitioned sstable set'
from Botond.

Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.

Refs: #6603
Fixes: #6707

Tests: unit(dev)

* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
  db/view: view_update_generator: use partitioned sstable set
  sstables: make_partitioned_sstable_set(): return an sstable_set

(cherry picked from commit e4b74356bb)
2020-07-21 15:40:02 +03:00
Raphael S. Carvalho
7037f43a17 table: Fix Staging SSTables being incorrectly added or removed from the backlog tracker
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.

Fixes #6798.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
(cherry picked from commit b67066cae2)
2020-07-21 12:57:09 +03:00
Avi Kivity
bd713959ce Update seastar submodule
* seastar 8aad24a5f8...4641f4f2d3 (4):
  > httpd: Don't warn on ECONNABORTED
  > httpd: Avoid calling future::then twice on the same future
Fixes #6709.
  > httpd: Use handle_exception instead of then_wrapped
  > httpd: Use std::unique_ptr instead of a raw pointer
2020-07-19 11:49:02 +03:00
Rafael Ávila de Espíndola
b7c5a918cb mutation_reader_test: Wait for a future
Nothing was waiting for this future. Found while testing another
patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630183929.1704908-1-espindola@scylladb.com>
(cherry picked from commit 6fe7706fce)

Fixes #6858.
2020-07-16 14:44:31 +03:00
Asias He
fb2ae9e66b repair: Relax node selection in bootstrap when nodes are less than RF
Consider a cluster with two nodes:

 - n1 (dc1)
 - n2 (dc2)

A third node is bootstrapped:

 - n3 (dc2)

The n3 fails to bootstrap as follows:

 [shard 0] init - Startup failed: std::runtime_error
 (bootstrap_with_repair: keyspace=system_distributed,
 range=(9183073555191895134, 9196226903124807343], no existing node in
 local dc)

The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.

Fixes: #6744
Backports: 4.0 4.1, 4.2
(cherry picked from commit 38d964352d)
2020-07-16 12:02:38 +03:00
Asias He
7a7ed8c65d repair: Relax size check of get_row_diff and set_diff
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.

For example,

Node1 (Repair master):
row1  -> hash1
row2  -> hash2
row3  -> hash3
row3' -> hash3

Node2 (Repair follower):
row1  -> hash1
row2  -> hash2

We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:

   repair - Got error in row level repair: std::runtime_error
   (row_diff.size() != set_diff.size())

In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.

Refs: #6252
(cherry picked from commit a00ab8688f)
2020-07-15 14:48:49 +03:00
Nadav Har'El
7b9be752ec alternator test: configurable temporary directory
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.

But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.

The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.

Fixes #6750

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
(cherry picked from commit 8e3be5e7d6)
2020-07-14 12:34:26 +03:00
Konstantin Osipov
903e967a16 Export TMPDIR pointing at subdir of testlog/
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.

By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.

Fixes #6062, "test.py uses tmpfs"

(cherry picked from commit e628da863d)
2020-07-14 12:34:06 +03:00
Avi Kivity
b84946895c Update seastar submodule
* seastar 1e762652c4...8aad24a5f8 (2):
  > futures: Add a test for a broken promise in a parallel_for_each
  > future: Call set_to_broken_promise earlier

Fixes #6749 (probably).
2020-07-13 20:08:16 +03:00
Asias He
a27188886a repair: Switch to btree_set for repair_hash.
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.

I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.

- unordered_set:
hash_sets=295634, time=333029199 ns

- flat_hash_set:
hash_sets=295634, time=312484711 ns

- node_hash_set:
hash_sets=295634, time=346195835 ns

- btree_set:
hash_sets=295634, time=341379801 ns

The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.

To fix, switch to absl btree_set container.

Fixes #6190

(cherry picked from commit 67f6da6466)
2020-07-13 10:09:23 +03:00
Dmitry Kropachev
51d4efc321 dist/common/scripts/scylla-housekeeping: wrap urllib.request with try ... except
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
	Now you will get legit error message in the case.

	Fixes #6690

(cherry picked from commit de82b3efae)
2020-07-09 18:24:55 +03:00
Avi Kivity
0847eea8d6 Update seastar submodule
* seastar 11e86172ba...1e762652c4 (1):
  > sharded: Do not hang on never set freed promise

Fixes #6606.
2020-07-09 15:52:26 +03:00
Avi Kivity
35ad57cb9c Point seastar submodule at scylla-seastar.git
This allows us to backport seastar patches to 4.2.
2020-07-09 15:50:25 +03:00
Hagit Segev
42b0b9ad08 release: prepare for 4.2.rc1 2020-07-08 23:01:10 +03:00
Dejan Mircevski
68b95bf2ac cql/restrictions: Handle WHERE a>0 AND a<0
WHERE clauses with start point above the end point were handled
incorrectly.  When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0.  This is clearly incorrect, as the user's intent was to filter out
all possible values of a.

Fix it by explicitly short-circuiting to false when start > end.  Add
a test case.

Fixes #5799.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 921dbd0978)
2020-07-08 13:20:10 +03:00
Botond Dénes
fea83f6ae0 db/view: view_update_generator: re-balance wait/signal on the register semaphore
The view update generator has a semaphore to limit concurrency. This
semaphore is waited on in `register_staging_sstable()` and later the
unit is returned after the sstable is processed in the loop inside
`start()`.
This was broken by 4e64002, which changed the loop inside `start()` to
process sstables in per table batches, however didn't change the
`signal()` call to return the amount of units according to the number of
sstables processed. This can cause the semaphore units to dry up, as the
loop can process multiple sstables per table but return just a single
unit. This can also block callers of `register_staging_sstable()`
indefinitely as some waiters will never be released as under the right
circumstances the units on the semaphore can permanently go below 0.
In addition to this, 4e64002 introduced another bug: table entries from
the `_sstables_with_tables` are never removed, so they are processed
every turn. If the sstable list is empty, there won't be any update
generated but due to the unconditional `signal()` described above, this
can cause the units on the semaphore to grow to infinity, allowing
future staging sstables producers to register a huge amount of sstables,
causing memory problems due to the amount of sstable readers that have
to be opened (#6603, #6707).
Both outcomes are equally bad. This patch fixes both issues and modifies
the `test_view_update_generator` unit test to reproduce them and hence
to verify that this doesn't happen in the future.

Fixes: #6774
Refs: #6707
Refs: #6603

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200706135108.116134-1-bdenes@scylladb.com>
(cherry picked from commit 5ebe2c28d1)
2020-07-08 11:13:24 +03:00
Takuya ASADA
76618a7e06 scylla_setup: don't add same disk device twice
We shouldn't accept adding same disk twice for RAID prompt.

Fixes #6711

(cherry picked from commit 835e76fdfc)
2020-07-07 13:07:59 +03:00
Takuya ASADA
189a08ac72 scylla_setup: follow hugepages package name change on Ubuntu 20.04LTS
hugepages package now renamed to libhugetlbfs-bin, we need to follow
the change.

Fixes #6673

(cherry picked from commit 03ce19d53a)
2020-07-05 14:41:33 +03:00
Takuya ASADA
a3e9915a83 dist/debian: apply generated package version for .orig.tar.gz file
We currently does not able to apply version number fixup for .orig.tar.gz file,
even we applied correct fixup on debian/changelog, becuase it just reading
SCYLLA-VERSION-FILE.
We should parse debian/{changelog,control} instead.

Fixes #6736

(cherry picked from commit a107f086bc)
2020-07-05 14:08:37 +03:00
Asias He
e4bc14ec1a boot_strapper: Ignore node to be replaced explicitly as stream source
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.

Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.

To fix, filter out the node being replaced as a source node explicitly.

Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
(cherry picked from commit e338028b7e22b0a80be7f80c337c52f958bfe1d7)
2020-07-01 14:36:43 +03:00
Takuya ASADA
972acb6d56 scylla_swap_setup: handle <1GB environment
Show better error message and exit with non-zero status when memory size <1GB.

Fixes #6659

(cherry picked from commit a9de438b1f)
2020-07-01 12:40:25 +03:00
Yaron Kaikov
7fbfedf025 dist/docker/redhat/Dockerfile: update 4.2 params
Set SCYLLA_REPO and VERSION values for scylla-4.2
2020-06-30 13:09:06 +03:00
Avi Kivity
5f175f8103 Merge "Fix handling of decimals with negative scales" from Rafael
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.

Fixes #6720
"

* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
  big_decimal: Add a test for a corner case
  big_decimal: Correctly handle negative scales
  big_decimal: Add a as_rational member function
  big_decimal: Move constructors out of line

(cherry picked from commit 3e2eeec83a)
2020-06-29 12:05:17 +03:00
Benny Halevy
674ad6656a comapction: restore % in compaction completion message
The % sign fell off in c4841fa735

Fixes #6727.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200625151352.736561-1-bhalevy@scylladb.com>
(cherry picked from commit a843945115)
2020-06-28 12:10:21 +03:00
Hagit Segev
58498b4b6c release: prepare for 4.2.rc0 2020-06-26 13:06:07 +03:00
167 changed files with 4724 additions and 1011 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=666.development
VERSION=4.2.4
if test -f version
then

View File

@@ -129,7 +129,7 @@ future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string us
auth::meta::roles_table::qualified_name(), auth::meta::roles_table::role_col_name);
auto cl = auth::password_authenticator::consistency_for_user(username);
auto timeout = auth::internal_distributed_timeout_config();
auto& timeout = auth::internal_distributed_timeout_config();
return qp.execute_internal(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto res = f.get0();
auto salted_hash = std::optional<sstring>();

View File

@@ -98,6 +98,11 @@ struct nonempty : public size_check {
// Check that array has the expected number of elements
static void verify_operand_count(const rjson::value* array, const size_check& expected, const rjson::value& op) {
if (!array && expected(0)) {
// If expected() allows an empty AttributeValueList, it is also fine
// that it is missing.
return;
}
if (!array || !array->IsArray()) {
throw api_error("ValidationException", "With ComparisonOperator, AttributeValueList must be given and an array");
}
@@ -154,23 +159,40 @@ static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException", format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error("ValidationException", format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error("ValidationException", "begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error("ValidationException", format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error("ValidationException", "begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error("ValidationException", format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -274,24 +296,38 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error("ValidationException", format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error("ValidationException", format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -305,7 +341,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -336,57 +373,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error("ValidationException", "between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error(
"ValidationException",
if (bounds_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -433,19 +484,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -457,7 +508,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -569,7 +621,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -600,13 +653,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -52,6 +52,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

View File

@@ -626,11 +626,8 @@ void rmw_operation::set_default_write_isolation(std::string_view value) {
default_write_isolation = parse_write_isolation(value);
}
// FIXME: Updating tags currently relies on updating schema, which may be subject
// to races during concurrent updates of the same table. Once Scylla schema updates
// are fixed, this issue will automatically get fixed as well.
enum class update_tags_action { add_tags, delete_tags };
static future<> update_tags(service::migration_manager& mm, const rjson::value& tags, schema_ptr schema, std::map<sstring, sstring>&& tags_map, update_tags_action action) {
static void update_tags_map(const rjson::value& tags, std::map<sstring, sstring>& tags_map, update_tags_action action) {
if (action == update_tags_action::add_tags) {
for (auto it = tags.Begin(); it != tags.End(); ++it) {
const rjson::value& key = (*it)["Key"];
@@ -652,28 +649,20 @@ static future<> update_tags(service::migration_manager& mm, const rjson::value&
}
if (tags_map.size() > 50) {
return make_exception_future<>(api_error("ValidationException", "Number of Tags exceed the current limit for the provided ResourceArn"));
throw api_error("ValidationException", "Number of Tags exceed the current limit for the provided ResourceArn");
}
validate_tags(tags_map);
}
// FIXME: Updating tags currently relies on updating schema, which may be subject
// to races during concurrent updates of the same table. Once Scylla schema updates
// are fixed, this issue will automatically get fixed as well.
static future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map) {
schema_builder builder(schema);
builder.set_extensions(schema::extensions_map{{sstring(tags_extension::NAME), ::make_shared<tags_extension>(std::move(tags_map))}});
return mm.announce_column_family_update(builder.build(), false, std::vector<view_ptr>(), false);
}
static future<> add_tags(service::migration_manager& mm, service::storage_proxy& proxy, schema_ptr schema, rjson::value& request_info) {
const rjson::value* tags = rjson::find(request_info, "Tags");
if (!tags || !tags->IsArray()) {
return make_exception_future<>(api_error("ValidationException", format("Cannot parse tags")));
}
if (tags->Size() < 1) {
return make_exception_future<>(api_error("ValidationException", "The number of tags must be at least 1"));
}
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
return update_tags(mm, rjson::copy(*tags), schema, std::move(tags_map), update_tags_action::add_tags);
}
future<executor::request_return_type> executor::tag_resource(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.tag_resource++;
@@ -683,7 +672,16 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
return api_error("AccessDeniedException", "Incorrect resource identifier");
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
add_tags(_mm, _proxy, schema, request).get();
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
const rjson::value* tags = rjson::find(request, "Tags");
if (!tags || !tags->IsArray()) {
return api_error("ValidationException", format("Cannot parse tags"));
}
if (tags->Size() < 1) {
return api_error("ValidationException", "The number of tags must be at least 1") ;
}
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
update_tags(_mm, schema, std::move(tags_map)).get();
return json_string("");
});
}
@@ -704,7 +702,8 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
update_tags(_mm, *tags, schema, std::move(tags_map), update_tags_action::delete_tags).get();
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
update_tags(_mm, schema, std::move(tags_map)).get();
return json_string("");
});
}
@@ -891,9 +890,22 @@ future<executor::request_return_type> executor::create_table(client_state& clien
view_builders.emplace_back(std::move(view_builder));
}
}
if (rjson::find(request, "SSESpecification")) {
return make_ready_future<request_return_type>(api_error("ValidationException", "SSESpecification: configuring encryption-at-rest is not yet supported."));
// We don't yet support configuring server-side encryption (SSE) via the
// SSESpecifiction attribute, but an SSESpecification with Enabled=false
// is simply the default, and should be accepted:
rjson::value* sse_specification = rjson::find(request, "SSESpecification");
if (sse_specification && sse_specification->IsObject()) {
rjson::value* enabled = rjson::find(*sse_specification, "Enabled");
if (!enabled || !enabled->IsBool()) {
return make_ready_future<request_return_type>(api_error("ValidationException", "SSESpecification needs boolean Enabled"));
}
if (enabled->GetBool()) {
// TODO: full support for SSESpecification
return make_ready_future<request_return_type>(api_error("ValidationException", "SSESpecification: configuring encryption-at-rest is not yet supported."));
}
}
// We don't yet support streams (CDC), but a StreamSpecification asking
// *not* to use streams should be accepted:
rjson::value* stream_specification = rjson::find(request, "StreamSpecification");
@@ -908,6 +920,14 @@ future<executor::request_return_type> executor::create_table(client_state& clien
}
}
// Parse the "Tags" parameter early, so we can avoid creating the table
// at all if this parsing failed.
const rjson::value* tags = rjson::find(request, "Tags");
std::map<sstring, sstring> tags_map;
if (tags && tags->IsArray()) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
}
builder.set_extensions(schema::extensions_map{{sstring(tags_extension::NAME), ::make_shared<tags_extension>()}});
schema_ptr schema = builder.build();
auto where_clause_it = where_clauses.begin();
@@ -928,14 +948,14 @@ future<executor::request_return_type> executor::create_table(client_state& clien
return create_keyspace(keyspace_name).handle_exception_type([] (exceptions::already_exists_exception&) {
// Ignore the fact that the keyspace may already exist. See discussion in #6340
}).then([this, table_name, request = std::move(request), schema, view_builders = std::move(view_builders)] () mutable {
return futurize_invoke([&] { return _mm.announce_new_column_family(schema, false); }).then([this, table_info = std::move(request), schema, view_builders = std::move(view_builders)] () mutable {
}).then([this, table_name, request = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
return futurize_invoke([&] { return _mm.announce_new_column_family(schema, false); }).then([this, table_info = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
return parallel_for_each(std::move(view_builders), [this, schema] (schema_builder builder) {
return _mm.announce_new_view(view_ptr(builder.build()));
}).then([this, table_info = std::move(table_info), schema] () mutable {
}).then([this, table_info = std::move(table_info), schema, tags_map = std::move(tags_map)] () mutable {
future<> f = make_ready_future<>();
if (rjson::find(table_info, "Tags")) {
f = add_tags(_mm, _proxy, schema, table_info);
if (!tags_map.empty()) {
f = update_tags(_mm, schema, std::move(tags_map));
}
return f.then([this] {
return wait_for_schema_agreement(_mm, db::timeout_clock::now() + 10s);
@@ -963,15 +983,24 @@ class attribute_collector {
void add(bytes&& name, atomic_cell&& cell) {
collected.emplace(std::move(name), std::move(cell));
}
void add(const bytes& name, atomic_cell&& cell) {
collected.emplace(name, std::move(cell));
}
public:
attribute_collector() : collected(attrs_type()->get_keys_type()->as_less_comparator()) { }
void put(bytes&& name, bytes&& val, api::timestamp_type ts) {
add(std::move(name), atomic_cell::make_live(*bytes_type, ts, std::move(val), atomic_cell::collection_member::yes));
void put(bytes&& name, const bytes& val, api::timestamp_type ts) {
add(std::move(name), atomic_cell::make_live(*bytes_type, ts, val, atomic_cell::collection_member::yes));
}
void put(const bytes& name, const bytes& val, api::timestamp_type ts) {
add(name, atomic_cell::make_live(*bytes_type, ts, val, atomic_cell::collection_member::yes));
}
void del(bytes&& name, api::timestamp_type ts) {
add(std::move(name), atomic_cell::make_dead(ts, gc_clock::now()));
}
void del(const bytes& name, api::timestamp_type ts) {
add(name, atomic_cell::make_dead(ts, gc_clock::now()));
}
collection_mutation_description to_mut() {
collection_mutation_description ret;
for (auto&& e : collected) {
@@ -1048,7 +1077,7 @@ public:
put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item);
// put_or_delete_item doesn't keep a reference to schema (so it can be
// moved between shards for LWT) so it needs to be given again to build():
mutation build(schema_ptr schema, api::timestamp_type ts);
mutation build(schema_ptr schema, api::timestamp_type ts) const;
const partition_key& pk() const { return _pk; }
const clustering_key& ck() const { return _ck; }
};
@@ -1077,7 +1106,7 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
}
}
mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) {
mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) const {
mutation m(schema, _pk);
// If there's no clustering key, a tombstone should be created directly
// on a partition, not on a clustering row - otherwise it will look like
@@ -1099,7 +1128,7 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) {
for (auto& c : *_cells) {
const column_definition* cdef = schema->get_column_definition(c.column_name);
if (!cdef) {
attrs_collector.put(std::move(c.column_name), std::move(c.value), ts);
attrs_collector.put(c.column_name, c.value, ts);
} else {
row.cells().apply(*cdef, atomic_cell::make_live(*cdef->type, ts, std::move(c.value)));
}
@@ -1410,7 +1439,7 @@ public:
check_needs_read_before_write(_condition_expression) ||
_returnvalues == returnvalues::ALL_OLD;
}
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) override {
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const override {
if (!verify_expected(_request, previous_item.get()) ||
!verify_condition_expression(_condition_expression, previous_item.get())) {
// If the update is to be cancelled because of an unfulfilled Expected
@@ -1420,6 +1449,8 @@ public:
}
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
_return_attributes = std::move(*previous_item);
} else {
_return_attributes = {};
}
return _mutation_builder.build(_schema, ts);
}
@@ -1493,7 +1524,7 @@ public:
check_needs_read_before_write(_condition_expression) ||
_returnvalues == returnvalues::ALL_OLD;
}
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) override {
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const override {
if (!verify_expected(_request, previous_item.get()) ||
!verify_condition_expression(_condition_expression, previous_item.get())) {
// If the update is to be cancelled because of an unfulfilled Expected
@@ -1503,6 +1534,8 @@ public:
}
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
_return_attributes = std::move(*previous_item);
} else {
_return_attributes = {};
}
return _mutation_builder.build(_schema, ts);
}
@@ -1577,7 +1610,7 @@ public:
virtual ~put_or_delete_item_cas_request() = default;
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override {
std::optional<mutation> ret;
for (put_or_delete_item& mutation_builder : _mutation_builders) {
for (const put_or_delete_item& mutation_builder : _mutation_builders) {
// We assume all these builders have the same partition.
if (ret) {
ret->apply(mutation_builder.build(schema, ts));
@@ -1762,7 +1795,8 @@ static std::string get_item_type_string(const rjson::value& v) {
// calculate_attrs_to_get() takes either AttributesToGet or
// ProjectionExpression parameters (having both is *not* allowed),
// and returns the list of cells we need to read.
// and returns the list of cells we need to read, or an empty set when
// *all* attributes are to be returned.
// In our current implementation, only top-level attributes are stored
// as cells, and nested documents are stored serialized as JSON.
// So this function currently returns only the the top-level attributes
@@ -1906,7 +1940,7 @@ public:
update_item_operation(service::storage_proxy& proxy, rjson::value&& request);
virtual ~update_item_operation() = default;
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) override;
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const override;
bool needs_read_before_write() const;
};
@@ -1984,7 +2018,7 @@ update_item_operation::needs_read_before_write() const {
}
std::optional<mutation>
update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) {
update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const {
if (!verify_expected(_request, previous_item.get()) ||
!verify_condition_expression(_condition_expression, previous_item.get())) {
// If the update is to be cancelled because of an unfulfilled
@@ -2100,19 +2134,30 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value result;
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
// An ADD can be used to create a new attribute (when
// v1.IsNull()) or to add to a pre-existing attribute:
if (v1.IsNull()) {
std::string v2_type = get_item_type_string(v2);
if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
result = v2;
} else {
throw api_error("ValidationException", format("An operand in the update expression has an incorrect data type: {}", v2));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error("ValidationException", format("An operand in the update expression has an incorrect data type: {}", v1));
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error("ValidationException", format("An operand in the update expression has an incorrect data type: {}", v1));
}
}
do_update(to_bytes(column_name), result);
},
@@ -2426,6 +2471,10 @@ public:
std::unordered_set<std::string>& used_attribute_values);
bool check(const rjson::value& item) const;
bool filters_on(std::string_view attribute) const;
// for_filters_on() runs the given function on the attributes that the
// filter works on. It may run for the same attribute more than once if
// used more than once in the filter.
void for_filters_on(const noncopyable_function<void(std::string_view)>& func) const;
operator bool() const { return bool(_imp); }
};
@@ -2506,10 +2555,26 @@ bool filter::filters_on(std::string_view attribute) const {
}, *_imp);
}
void filter::for_filters_on(const noncopyable_function<void(std::string_view)>& func) const {
if (_imp) {
std::visit(overloaded_functor {
[&] (const conditions_filter& f) -> void {
for (auto it = f.conditions.MemberBegin(); it != f.conditions.MemberEnd(); ++it) {
func(rjson::to_string_view(it->name));
}
},
[&] (const expression_filter& f) -> void {
return for_condition_expression_on(f.expression, func);
}
}, *_imp);
}
}
class describe_items_visitor {
typedef std::vector<const column_definition*> columns_t;
const columns_t& _columns;
const std::unordered_set<std::string>& _attrs_to_get;
std::unordered_set<std::string> _extra_filter_attrs;
const filter& _filter;
typename columns_t::const_iterator _column_it;
rjson::value _item;
@@ -2525,7 +2590,20 @@ public:
, _item(rjson::empty_object())
, _items(rjson::empty_array())
, _scanned_count(0)
{ }
{
// _filter.check() may need additional attributes not listed in
// _attrs_to_get (i.e., not requested as part of the output).
// We list those in _extra_filter_attrs. We will include them in
// the JSON but take them out before finally returning the JSON.
if (!_attrs_to_get.empty()) {
_filter.for_filters_on([&] (std::string_view attr) {
std::string a(attr); // no heterogenous maps searches :-(
if (!_attrs_to_get.contains(a)) {
_extra_filter_attrs.emplace(std::move(a));
}
});
}
}
void start_row() {
_column_it = _columns.begin();
@@ -2539,7 +2617,7 @@ public:
result_bytes_view->with_linearized([this] (bytes_view bv) {
std::string column_name = (*_column_it)->name_as_text();
if (column_name != executor::ATTRS_COLUMN_NAME) {
if (_attrs_to_get.empty() || _attrs_to_get.count(column_name) > 0) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(column_name) || _extra_filter_attrs.contains(column_name)) {
if (!_item.HasMember(column_name.c_str())) {
rjson::set_with_string_name(_item, column_name, rjson::empty_object());
}
@@ -2551,7 +2629,7 @@ public:
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
if (_attrs_to_get.empty() || _attrs_to_get.count(attr_name) > 0) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name) || _extra_filter_attrs.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
rjson::set_with_string_name(_item, attr_name, deserialize_item(value));
}
@@ -2563,6 +2641,11 @@ public:
void end_row() {
if (_filter.check(_item)) {
// Remove the extra attributes _extra_filter_attrs which we had
// to add just for the filter, and not requested to be returned:
for (const auto& attr : _extra_filter_attrs) {
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
}
_item = rjson::empty_object();
@@ -2597,7 +2680,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.partition_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_pk_it)));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_pk_it, cdef));
++exploded_pk_it;
}
auto ck = paging_state.get_clustering_key();
@@ -2607,7 +2690,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_ck_it)));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
++exploded_ck_it;
}
}

View File

@@ -348,6 +348,39 @@ bool condition_expression_on(const parsed::condition_expression& ce, std::string
}, ce._expression);
}
// for_condition_expression_on() runs a given function over all the attributes
// mentioned in the expression. If the same attribute is mentioned more than
// once, the function will be called more than once for the same attribute.
static void for_value_on(const parsed::value& v, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::constant& c) { },
[&] (const parsed::value::function_call& f) {
for (const parsed::value& value : f._parameters) {
for_value_on(value, func);
}
},
[&] (const parsed::path& p) {
func(p.root());
}
}, v._value);
}
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) {
for (const parsed::value& value : cond._values) {
for_value_on(value, func);
}
},
[&] (const parsed::condition_expression::condition_list& list) {
for (const parsed::condition_expression& cond : list.conditions) {
for_condition_expression_on(cond, func);
}
}
}, ce._expression);
}
// The following calculate_value() functions calculate, or evaluate, a parsed
// expression. The parsed expression is assumed to have been "resolved", with
// the matching resolve_* function.
@@ -570,52 +603,8 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {

View File

@@ -27,6 +27,8 @@
#include <unordered_set>
#include <string_view>
#include <seastar/util/noncopyable_function.hh>
#include "expressions_types.hh"
#include "rjson.hh"
@@ -59,6 +61,11 @@ void validate_value(const rjson::value& v, const char* caller);
bool condition_expression_on(const parsed::condition_expression& ce, std::string_view attribute);
// for_condition_expression_on() runs the given function on the attributes
// that the expression uses. It may run for the same attribute more than once
// if the same attribute is used more than once in the expression.
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func);
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:

View File

@@ -87,7 +87,11 @@ protected:
// When _returnvalues != NONE, apply() should store here, in JSON form,
// the values which are to be returned in the "Attributes" field.
// The default null JSON means do not return an Attributes field at all.
rjson::value _return_attributes;
// This field is marked "mutable" so that the const apply() can modify
// it (see explanation below), but note that because apply() may be
// called more than once, if apply() will sometimes set this field it
// must set it (even if just to the default empty value) every time.
mutable rjson::value _return_attributes;
public:
// The constructor of a rmw_operation subclass should parse the request
// and try to discover as many input errors as it can before really
@@ -100,7 +104,12 @@ public:
// conditional expression, apply() should return an empty optional.
// apply() may throw if it encounters input errors not discovered during
// the constructor.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) = 0;
// apply() may be called more than once in case of contention, so it must
// not change the state saved in the object (issue #7218 was caused by
// violating this). We mark apply() "const" to let the compiler validate
// this for us. The output-only field _return_attributes is marked
// "mutable" above so that apply() can still write to it.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;
// Convert the above apply() into the signature needed by cas_request:
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override;
virtual ~rmw_operation() = default;

View File

@@ -650,7 +650,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -658,7 +658,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -666,7 +666,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -674,7 +674,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -682,7 +682,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -690,7 +690,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});

View File

@@ -322,8 +322,8 @@ void set_storage_service(http_context& ctx, routes& r) {
for (auto cf : column_families) {
column_families_vec.push_back(&db.find_column_family(keyspace, cf));
}
return parallel_for_each(column_families_vec, [&cm] (column_family* cf) {
return cm.perform_cleanup(cf);
return parallel_for_each(column_families_vec, [&cm, &db] (column_family* cf) {
return cm.perform_cleanup(db, cf);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);

View File

@@ -386,6 +386,7 @@ scylla_tests = set([
'test/boost/view_schema_ckey_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/stall_free_test',
'test/manual/ec2_snitch_test',
'test/manual/gce_snitch_test',
'test/manual/gossip',

View File

@@ -267,10 +267,13 @@ public:
}
};
/// The same as `impl_max_function_for' but without knowledge of `Type'.
/// The same as `impl_max_function_for' but without compile-time dependency on `Type'.
class impl_max_dynamic_function final : public aggregate_function::aggregate {
data_type _io_type;
opt_bytes _max;
public:
impl_max_dynamic_function(data_type io_type) : _io_type(std::move(io_type)) {}
virtual void reset() override {
_max = {};
}
@@ -278,12 +281,11 @@ public:
return _max.value_or(bytes{});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
if (values.empty() || !values[0]) {
return;
}
const auto val = *values[0];
if (!_max || *_max < val) {
_max = val;
if (!_max || _io_type->less(*_max, *values[0])) {
_max = values[0];
}
}
};
@@ -298,10 +300,13 @@ public:
};
class max_dynamic_function final : public native_aggregate_function {
data_type _io_type;
public:
max_dynamic_function(data_type io_type) : native_aggregate_function("max", io_type, { io_type }) {}
max_dynamic_function(data_type io_type)
: native_aggregate_function("max", io_type, { io_type })
, _io_type(std::move(io_type)) {}
virtual std::unique_ptr<aggregate> new_aggregate() override {
return std::make_unique<impl_max_dynamic_function>();
return std::make_unique<impl_max_dynamic_function>(_io_type);
}
};
@@ -358,10 +363,13 @@ public:
}
};
/// The same as `impl_min_function_for' but without knowledge of `Type'.
/// The same as `impl_min_function_for' but without compile-time dependency on `Type'.
class impl_min_dynamic_function final : public aggregate_function::aggregate {
data_type _io_type;
opt_bytes _min;
public:
impl_min_dynamic_function(data_type io_type) : _io_type(std::move(io_type)) {}
virtual void reset() override {
_min = {};
}
@@ -369,12 +377,11 @@ public:
return _min.value_or(bytes{});
}
virtual void add_input(cql_serialization_format sf, const std::vector<opt_bytes>& values) override {
if (!values[0]) {
if (values.empty() || !values[0]) {
return;
}
const auto val = *values[0];
if (!_min || val < *_min) {
_min = val;
if (!_min || _io_type->less(*values[0], *_min)) {
_min = values[0];
}
}
};
@@ -389,10 +396,13 @@ public:
};
class min_dynamic_function final : public native_aggregate_function {
data_type _io_type;
public:
min_dynamic_function(data_type io_type) : native_aggregate_function("min", io_type, { io_type }) {}
min_dynamic_function(data_type io_type)
: native_aggregate_function("min", io_type, { io_type })
, _io_type(std::move(io_type)) {}
virtual std::unique_ptr<aggregate> new_aggregate() override {
return std::make_unique<impl_min_dynamic_function>();
return std::make_unique<impl_min_dynamic_function>(_io_type);
}
};

View File

@@ -88,16 +88,13 @@ static data_value castas_fctn_simple(data_value from) {
template<typename ToType>
static data_value castas_fctn_from_decimal_to_float(data_value from) {
auto val_from = value_cast<big_decimal>(from);
boost::multiprecision::cpp_int ten(10);
boost::multiprecision::cpp_rational r = val_from.unscaled_value();
r /= boost::multiprecision::pow(ten, val_from.scale());
return static_cast<ToType>(r);
return static_cast<ToType>(val_from.as_rational());
}
static utils::multiprecision_int from_decimal_to_cppint(const data_value& from) {
const auto& val_from = value_cast<big_decimal>(from);
boost::multiprecision::cpp_int ten(10);
return boost::multiprecision::cpp_int(val_from.unscaled_value() / boost::multiprecision::pow(ten, val_from.scale()));
auto r = val_from.as_rational();
return utils::multiprecision_int(numerator(r)/denominator(r));
}
template<typename ToType>

View File

@@ -357,7 +357,12 @@ lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix,
collection_mutation_description mut;
mut.cells.reserve(1);
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
if (!value) {
mut.cells.emplace_back(to_bytes(*index), params.make_dead_cell());
} else {
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
}
m.set_cell(prefix, column, mut.serialize(*ltype));
}

View File

@@ -417,7 +417,7 @@ std::vector<const column_definition*> statement_restrictions::get_column_defs_fo
_clustering_columns_restrictions->num_prefix_columns_that_need_not_be_filtered();
for (auto&& cdef : _clustering_columns_restrictions->get_column_defs()) {
::shared_ptr<single_column_restriction> restr;
if (single_pk_restrs) {
if (single_ck_restrs) {
auto it = single_ck_restrs->restrictions().find(cdef);
if (it != single_ck_restrs->restrictions().end()) {
restr = dynamic_pointer_cast<single_column_restriction>(it->second);
@@ -688,6 +688,11 @@ static query::range<bytes_view> to_range(const term_slice& slice, const query_op
extract_bound(statements::bound::END));
}
static bool contains_without_wraparound(
const query::range<bytes_view>& range, bytes_view value, const serialized_tri_compare& cmp) {
return !range.is_wrap_around(cmp) && range.contains(value, cmp);
}
bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -702,13 +707,13 @@ bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
return false;
}
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return to_range(_slice, options, _column_def.name_as_text()).contains(
return contains_without_wraparound(to_range(_slice, options, _column_def.name_as_text()),
cell_value_bv, _column_def.type->as_tri_comparator());
});
}
bool single_column_restriction::slice::is_satisfied_by(bytes_view data, const query_options& options) const {
return to_range(_slice, options, _column_def.name_as_text()).contains(
return contains_without_wraparound(to_range(_slice, options, _column_def.name_as_text()),
data, _column_def.type->underlying_type()->as_tri_comparator());
}

View File

@@ -207,6 +207,9 @@ void alter_table_statement::add_column(const schema& schema, const table& cf, sc
"because a collection with the same name and a different type has already been used in the past", column_name));
}
}
if (type->is_counter() && !schema.is_counter()) {
throw exceptions::configuration_exception(format("Cannot add a counter column ({}) in a non counter column family", column_name));
}
cfm.with_column(column_name.name(), type, is_static ? column_kind::static_column : column_kind::regular_column);
@@ -222,7 +225,7 @@ void alter_table_statement::add_column(const schema& schema, const table& cf, sc
schema_builder builder(view);
if (view->view_info()->include_all_columns()) {
builder.with_column(column_name.name(), type);
} else if (view->view_info()->base_non_pk_columns_in_view_pk().empty()) {
} else if (!view->view_info()->has_base_non_pk_columns_in_view_pk()) {
db::view::create_virtual_column(builder, column_name.name(), type);
}
view_updates.push_back(view_ptr(builder.build()));

View File

@@ -59,6 +59,7 @@
#include "db/timeout_clock.hh"
#include "db/consistency_level_validations.hh"
#include "database.hh"
#include "test/lib/select_statement_utils.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
bool is_system_keyspace(const sstring& name);
@@ -67,6 +68,8 @@ namespace cql3 {
namespace statements {
static constexpr int DEFAULT_INTERNAL_PAGING_SIZE = select_statement::DEFAULT_COUNT_PAGE_SIZE;
thread_local int internal_paging_size = DEFAULT_INTERNAL_PAGING_SIZE;
thread_local const lw_shared_ptr<const select_statement::parameters> select_statement::_default_parameters = make_lw_shared<select_statement::parameters>();
select_statement::parameters::parameters()
@@ -333,7 +336,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
const bool nonpaged_filtering = restrictions_need_filtering && page_size <= 0;
if (aggregate || nonpaged_filtering) {
page_size = DEFAULT_COUNT_PAGE_SIZE;
page_size = internal_paging_size;
}
auto key_ranges = _restrictions->get_partition_key_ranges(options);
@@ -530,13 +533,29 @@ indexed_table_select_statement::do_execute_base_query(
if (old_paging_state && concurrency == 1) {
auto base_pk = generate_base_key_from_index_pk<partition_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
auto row_ranges = command->slice.default_row_ranges();
if (old_paging_state->get_clustering_key() && _schema->clustering_key_size() > 0) {
auto base_ck = generate_base_key_from_index_pk<clustering_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
command->slice.set_range(*_schema, base_pk,
std::vector<query::clustering_range>{query::clustering_range::make_starting_with(range_bound<clustering_key>(base_ck, false))});
query::trim_clustering_row_ranges_to(*_schema, row_ranges, base_ck, false);
command->slice.set_range(*_schema, base_pk, row_ranges);
} else {
command->slice.set_range(*_schema, base_pk, std::vector<query::clustering_range>{query::clustering_range::make_open_ended_both_sides()});
// There is no clustering key in old_paging_state and/or no clustering key in
// _schema, therefore read an entire partition (whole clustering range).
//
// The only exception to applying no restrictions on clustering key
// is a case when we have a secondary index on the first column
// of clustering key. In such a case we should not read the
// entire clustering range - only a range in which first column
// of clustering key has the correct value.
//
// This means that we should not set a open_ended_both_sides
// clustering range on base_pk, instead intersect it with
// _row_ranges (which contains the restrictions neccessary for the
// case described above). The result of such intersection is just
// _row_ranges, which we explicity set on base_pk.
command->slice.set_range(*_schema, base_pk, row_ranges);
}
}
concurrency *= 2;
@@ -974,12 +993,16 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
if (aggregate) {
const bool restrictions_need_filtering = _restrictions->need_filtering();
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format()), std::make_unique<cql3::query_options>(cql3::query_options(options)),
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format(), *_group_by_cell_indices), std::make_unique<cql3::query_options>(cql3::query_options(options)),
[this, &options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] (cql3::selection::result_set_builder& builder, std::unique_ptr<cql3::query_options>& internal_options) {
// page size is set to the internal count page size, regardless of the user-provided value
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), DEFAULT_COUNT_PAGE_SIZE));
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), internal_paging_size));
return repeat([this, &builder, &options, &internal_options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] () {
auto consume_results = [this, &builder, &options, &internal_options, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
auto consume_results = [this, &builder, &options, &internal_options, &proxy, &state, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd, lw_shared_ptr<const service::pager::paging_state> paging_state) {
if (paging_state) {
paging_state = generate_view_paging_state_from_base_query_results(paging_state, results, proxy, state, options);
}
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
if (restrictions_need_filtering) {
_stats.filtered_rows_read_total += *results->row_count();
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection,
@@ -987,24 +1010,24 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
} else {
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection));
}
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
return stop_iteration(!has_more_pages);
};
if (whole_partitions || partition_slices) {
return find_index_partition_ranges(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (dht::partition_range_vector partition_ranges, lw_shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, paging_state)
.then([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
});
});
} else {
return find_index_clustering_rows(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (std::vector<primary_key> primary_keys, lw_shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, paging_state)
.then([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
});
});
}
@@ -1661,6 +1684,16 @@ std::vector<size_t> select_statement::prepare_group_by(const schema& schema, sel
}
future<> set_internal_paging_size(int paging_size) {
return seastar::smp::invoke_on_all([paging_size] {
internal_paging_size = paging_size;
});
}
future<> reset_internal_paging_size() {
return set_internal_paging_size(DEFAULT_INTERNAL_PAGING_SIZE);
}
}
namespace util {

View File

@@ -552,14 +552,14 @@ database::~database() {
}
void database::update_version(const utils::UUID& version) {
if (_version != version) {
if (_version.get() != version) {
_schema_change_count++;
}
_version = version;
_version.set(version);
}
const utils::UUID& database::get_version() const {
return _version;
return _version.get();
}
static future<>
@@ -767,7 +767,7 @@ future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_
remove(*cf);
cf->clear_views();
auto& ks = find_keyspace(ks_name);
return when_all_succeed(cf->await_pending_writes(), cf->await_pending_reads()).then_unpack([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return cf->await_pending_ops().then([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return truncate(ks, *cf, std::move(tsf), snapshot).finally([this, cf] {
return cf->stop();
});
@@ -1851,7 +1851,11 @@ future<> database::truncate(const keyspace& ks, column_family& cf, timestamp_fun
// TODO: indexes.
// Note: since discard_sstables was changed to only count tables owned by this shard,
// we can get zero rp back. Changed assert, and ensure we save at least low_mark.
assert(low_mark <= rp || rp == db::replay_position());
// #6995 - the assert below was broken in c2c6c71 and remained so for many years.
// We nowadays do not flush tables with sstables but autosnapshot=false. This means
// the low_mark assertion does not hold, because we maybe/probably never got around to
// creating the sstables that would create them.
assert(!should_flush || low_mark <= rp || rp == db::replay_position());
rp = std::max(low_mark, rp);
return truncate_views(cf, truncated_at, should_flush).then([&cf, truncated_at, rp] {
// save_truncation_record() may actually fail after we cached the truncation time
@@ -1957,31 +1961,6 @@ future<> database::clear_snapshot(sstring tag, std::vector<sstring> keyspace_nam
});
}
future<utils::UUID> update_schema_version(distributed<service::storage_proxy>& proxy, schema_features features)
{
return db::schema_tables::calculate_schema_digest(proxy, features).then([&proxy] (utils::UUID uuid) {
return proxy.local().get_db().invoke_on_all([uuid] (database& db) {
db.update_version(uuid);
}).then([uuid] {
return db::system_keyspace::update_schema_version(uuid);
}).then([uuid] {
dblog.info("Schema version changed to {}", uuid);
return uuid;
});
});
}
future<> announce_schema_version(utils::UUID schema_version) {
return service::migration_manager::passive_announce(schema_version);
}
future<> update_schema_version_and_announce(distributed<service::storage_proxy>& proxy, schema_features features)
{
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
}
std::ostream& operator<<(std::ostream& os, const user_types_metadata& m) {
os << "org.apache.cassandra.config.UTMetaData@" << &m;
return os;

View File

@@ -55,6 +55,7 @@
#include <limits>
#include <cstddef>
#include "schema_fwd.hh"
#include "db/view/view.hh"
#include "db/schema_features.hh"
#include "gms/feature.hh"
#include "timestamp.hh"
@@ -95,6 +96,7 @@
#include "user_types_metadata.hh"
#include "query_class_config.hh"
#include "absl-flat_hash_map.hh"
#include "utils/updateable_value.hh"
class cell_locker;
class cell_locker_stats;
@@ -539,6 +541,8 @@ private:
utils::phased_barrier _pending_reads_phaser;
// Corresponding phaser for in-progress streams
utils::phased_barrier _pending_streams_phaser;
// Corresponding phaser for in-progress flushes
utils::phased_barrier _pending_flushes_phaser;
// This field cashes the last truncation time for the table.
// The master resides in system.truncated table
@@ -901,7 +905,7 @@ public:
lw_shared_ptr<const sstable_list> get_sstables_including_compacted_undeleted() const;
const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const;
std::vector<sstables::shared_sstable> select_sstables(const dht::partition_range& range) const;
std::vector<sstables::shared_sstable> candidates_for_compaction() const;
std::vector<sstables::shared_sstable> non_staging_sstables() const;
std::vector<sstables::shared_sstable> sstables_need_rewrite() const;
size_t sstables_count() const;
std::vector<uint64_t> sstable_count_per_level() const;
@@ -984,6 +988,14 @@ public:
return _pending_streams_phaser.advance_and_await();
}
future<> await_pending_flushes() {
return _pending_flushes_phaser.advance_and_await();
}
future<> await_pending_ops() {
return when_all(await_pending_reads(), await_pending_writes(), await_pending_streams(), await_pending_flushes()).discard_result();
}
void add_or_update_view(view_ptr v);
void remove_view(view_ptr v);
void clear_views();
@@ -1008,8 +1020,9 @@ public:
return *_config.sstables_manager;
}
// Reader's schema must be the same as the base schema of each of the views.
future<> populate_views(
std::vector<view_ptr>,
std::vector<db::view::view_and_base>,
dht::token base_token,
flat_mutation_reader&&,
gc_clock::time_point);
@@ -1027,7 +1040,7 @@ private:
tracing::trace_state_ptr tr_state, reader_concurrency_semaphore& sem, const io_priority_class& io_priority, query::partition_slice::option_set custom_opts) const;
std::vector<view_ptr> affected_views(const schema_ptr& base, const mutation& update, gc_clock::time_point now) const;
future<> generate_and_propagate_view_updates(const schema_ptr& base,
std::vector<view_ptr>&& views,
std::vector<db::view::view_and_base>&& views,
mutation&& m,
flat_mutation_reader_opt existings,
tracing::trace_state_ptr tr_state,
@@ -1099,6 +1112,10 @@ flat_mutation_reader make_local_shard_sstable_reader(schema_ptr s,
mutation_reader::forwarding fwd_mr,
sstables::read_monitor_generator& monitor_generator = sstables::default_read_monitor_generator());
/// Read a range from the passed-in sstables.
///
/// The reader is unrestricted, but will account its resource usage on the
/// semaphore belonging to the passed-in permit.
flat_mutation_reader make_range_sstable_reader(schema_ptr s,
reader_permit permit,
lw_shared_ptr<sstables::sstable_set> sstables,
@@ -1110,6 +1127,21 @@ flat_mutation_reader make_range_sstable_reader(schema_ptr s,
mutation_reader::forwarding fwd_mr,
sstables::read_monitor_generator& monitor_generator = sstables::default_read_monitor_generator());
/// Read a range from the passed-in sstables.
///
/// The reader is restricted, that is it will wait for admission on the semaphore
/// belonging to the passed-in permit, before starting to read.
flat_mutation_reader make_restricted_range_sstable_reader(schema_ptr s,
reader_permit permit,
lw_shared_ptr<sstables::sstable_set> sstables,
const dht::partition_range& pr,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr,
sstables::read_monitor_generator& monitor_generator = sstables::default_read_monitor_generator());
class user_types_metadata;
class keyspace_metadata final {
@@ -1376,7 +1408,7 @@ private:
flat_hash_map<std::pair<sstring, sstring>, utils::UUID, utils::tuple_hash, string_pair_eq>;
ks_cf_to_uuid_t _ks_cf_to_uuid;
std::unique_ptr<db::commitlog> _commitlog;
utils::UUID _version;
utils::updateable_value_source<utils::UUID> _version;
uint32_t _schema_change_count = 0;
// compaction_manager object is referenced by all column families of a database.
std::unique_ptr<compaction_manager> _compaction_manager;
@@ -1447,6 +1479,7 @@ public:
void update_version(const utils::UUID& version);
const utils::UUID& get_version() const;
utils::observable<utils::UUID>& observable_schema_version() const { return _version.as_observable(); }
db::commitlog* commitlog() const {
return _commitlog.get();
@@ -1664,10 +1697,6 @@ future<> stop_database(sharded<database>& db);
flat_mutation_reader make_multishard_streaming_reader(distributed<database>& db, schema_ptr schema,
std::function<std::optional<dht::partition_range>()> range_generator);
future<utils::UUID> update_schema_version(distributed<service::storage_proxy>& proxy, db::schema_features);
future<> announce_schema_version(utils::UUID schema_version);
future<> update_schema_version_and_announce(distributed<service::storage_proxy>& proxy, db::schema_features);
bool is_internal_keyspace(const sstring& name);
#endif /* DATABASE_HH_ */

View File

@@ -290,7 +290,7 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
mutation m(schema, key);
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr());
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
});
};

View File

@@ -521,7 +521,7 @@ public:
_segment_manager->totals.total_size_on_disk -= size_on_disk();
_segment_manager->totals.total_size -= (size_on_disk() + _buffer.size_bytes());
_segment_manager->add_file_to_delete(_file_name, _desc);
} else {
} else if (_segment_manager->cfg.warn_about_segments_left_on_disk_after_shutdown) {
clogger.warn("Segment {} is dirty and is left on disk.", *this);
}
}

View File

@@ -137,6 +137,7 @@ public:
bool reuse_segments = true;
bool use_o_dsync = false;
bool warn_about_segments_left_on_disk_after_shutdown = true;
const db::extensions * extensions = nullptr;
};

View File

@@ -304,7 +304,7 @@ future<> db::commitlog_replayer::impl::process(stats* s, commitlog::buffer_and_r
mutation m(cf.schema(), fm.decorated_key(*cf.schema()));
converting_mutation_partition_applier v(cm, *cf.schema(), m.partition());
fm.partition().accept(cm, v);
return do_with(std::move(m), [&db, &cf] (mutation m) {
return do_with(std::move(m), [&db, &cf] (const mutation& m) {
return db.apply_in_memory(m, cf, db::rp_handle(), db::no_timeout);
});
} else {

View File

@@ -662,6 +662,15 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"\tpriority_string : GnuTLS priority string controlling TLS algorithms used/allowed.\n"
"\trequire_client_auth : (Default: false ) Enables or disables certificate authentication.\n"
"Related information: Client-to-node encryption")
, alternator_encryption_options(this, "alternator_encryption_options", value_status::Used, {/*none*/},
"When Alternator via HTTPS is enabled with alternator_https_port, where to take the key and certificate. The available options are:\n"
"\n"
"\tcertificate: (Default: conf/scylla.crt) The location of a PEM-encoded x509 certificate used to identify and encrypt the client/server communication.\n"
"\tkeyfile: (Default: conf/scylla.key) PEM Key file associated with certificate.\n"
"\n"
"The advanced settings are:\n"
"\n"
"\tpriority_string : GnuTLS priority string controlling TLS algorithms used/allowed.")
, ssl_storage_port(this, "ssl_storage_port", value_status::Used, 7001,
"The SSL port for encrypted communication. Unused unless enabled in encryption_options.")
, enable_in_memory_data_store(this, "enable_in_memory_data_store", value_status::Used, false, "Enable in memory mode (system tables are always persisted)")
@@ -681,7 +690,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, replace_address(this, "replace_address", value_status::Used, "", "The listen_address or broadcast_address of the dead node to replace. Same as -Dcassandra.replace_address.")
, replace_address_first_boot(this, "replace_address_first_boot", value_status::Used, "", "Like replace_address option, but if the node has been bootstrapped successfully it will be ignored. Same as -Dcassandra.replace_address_first_boot.")
, override_decommission(this, "override_decommission", value_status::Used, false, "Set true to force a decommissioned node to join the cluster")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, true, "Set true to use enable repair based node operations instead of streaming based")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, false, "Set true to use enable repair based node operations instead of streaming based")
, ring_delay_ms(this, "ring_delay_ms", value_status::Used, 30 * 1000, "Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.")
, shadow_round_ms(this, "shadow_round_ms", value_status::Used, 300 * 1000, "The maximum gossip shadow round time. Can be used to reduce the gossip feature check time during node boot up.")
, fd_max_interval_ms(this, "fd_max_interval_ms", value_status::Used, 2 * 1000, "The maximum failure_detector interval time in milliseconds. Interval larger than the maximum will be ignored. Larger cluster may need to increase the default.")

View File

@@ -252,6 +252,7 @@ public:
named_value<uint32_t> permissions_cache_max_entries;
named_value<string_map> server_encryption_options;
named_value<string_map> client_encryption_options;
named_value<string_map> alternator_encryption_options;
named_value<uint32_t> ssl_storage_port;
named_value<bool> enable_in_memory_data_store;
named_value<bool> enable_cache;

View File

@@ -61,7 +61,8 @@ namespace db {
logging::logger cl_logger("consistency");
size_t quorum_for(const keyspace& ks) {
return (ks.get_replication_strategy().get_replication_factor() / 2) + 1;
size_t replication_factor = ks.get_replication_strategy().get_replication_factor();
return replication_factor ? (replication_factor / 2) + 1 : 0;
}
size_t local_quorum_for(const keyspace& ks, const sstring& dc) {
@@ -72,8 +73,8 @@ size_t local_quorum_for(const keyspace& ks, const sstring& dc) {
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
static_cast<const network_topology_strategy*>(&rs);
return (nrs->get_replication_factor(dc) / 2) + 1;
size_t replication_factor = nrs->get_replication_factor(dc);
return replication_factor ? (replication_factor / 2) + 1 : 0;
}
return quorum_for(ks);

View File

@@ -224,7 +224,9 @@ future<> manager::end_point_hints_manager::stop(drain should_drain) noexcept {
with_lock(file_update_mutex(), [this] {
if (_hints_store_anchor) {
hints_store_ptr tmp = std::exchange(_hints_store_anchor, nullptr);
return tmp->shutdown().finally([tmp] {});
return tmp->shutdown().finally([tmp] {
return tmp->release();
}).finally([tmp] {});
}
return make_ready_future<>();
}).handle_exception([&eptr] (auto e) { eptr = std::move(e); }).get();
@@ -290,7 +292,7 @@ inline bool manager::have_ep_manager(ep_key_type ep) const noexcept {
}
bool manager::store_hint(ep_key_type ep, schema_ptr s, lw_shared_ptr<const frozen_mutation> fm, tracing::trace_state_ptr tr_state) noexcept {
if (stopping() || !started() || !can_hint_for(ep)) {
if (stopping() || draining_all() || !started() || !can_hint_for(ep)) {
manager_logger.trace("Can't store a hint to {}", ep);
++_stats.dropped;
return false;
@@ -326,6 +328,10 @@ future<db::commitlog> manager::end_point_hints_manager::add_store() noexcept {
// HH doesn't utilize the flow that benefits from reusing segments.
// Therefore let's simply disable it to avoid any possible confusion.
cfg.reuse_segments = false;
// HH leaves segments on disk after commitlog shutdown, and later reads
// them when commitlog is re-created. This is expected to happen regularly
// during standard HH workload, so no need to print a warning about it.
cfg.warn_about_segments_left_on_disk_after_shutdown = false;
return commitlog::create_commitlog(std::move(cfg)).then([this] (commitlog l) {
// add_store() is triggered every time hint files are forcefully flushed to I/O (every hints_flush_period).
@@ -352,7 +358,9 @@ future<> manager::end_point_hints_manager::flush_current_hints() noexcept {
return futurize_invoke([this] {
return with_lock(file_update_mutex(), [this]() -> future<> {
return get_or_load().then([] (hints_store_ptr cptr) {
return cptr->shutdown();
return cptr->shutdown().finally([cptr] {
return cptr->release();
}).finally([cptr] {});
}).then([this] {
// Un-hold the commitlog object. Since we are under the exclusive _file_update_mutex lock there are no
// other hints_store_ptr copies and this would destroy the commitlog shared value.
@@ -529,7 +537,7 @@ bool manager::check_dc_for(ep_key_type ep) const noexcept {
}
void manager::drain_for(gms::inet_address endpoint) {
if (stopping()) {
if (stopping() || draining_all()) {
return;
}
@@ -540,6 +548,7 @@ void manager::drain_for(gms::inet_address endpoint) {
return with_semaphore(drain_lock(), 1, [this, endpoint] {
return futurize_invoke([this, endpoint] () {
if (utils::fb_utilities::is_me(endpoint)) {
set_draining_all();
return parallel_for_each(_ep_managers, [] (auto& pair) {
return pair.second.stop(drain::yes).finally([&pair] {
return with_file_update_mutex(pair.second, [&pair] {
@@ -808,7 +817,7 @@ void manager::end_point_hints_manager::sender::send_hints_maybe() noexcept {
int replayed_segments_count = 0;
try {
while (replay_allowed() && have_segments()) {
while (replay_allowed() && have_segments() && can_send()) {
if (!send_one_file(*_segments_to_replay.begin())) {
break;
}

View File

@@ -424,12 +424,14 @@ public:
enum class state {
started, // hinting is currently allowed (start() call is complete)
replay_allowed, // replaying (hints sending) is allowed
draining_all, // hinting is not allowed - all ep managers are being stopped because this node is leaving the cluster
stopping // hinting is not allowed - stopping is in progress (stop() method has been called)
};
using state_set = enum_set<super_enum<state,
state::started,
state::replay_allowed,
state::draining_all,
state::stopping>>;
private:
@@ -690,6 +692,14 @@ private:
return _state.contains(state::replay_allowed);
}
void set_draining_all() noexcept {
_state.set(state::draining_all);
}
bool draining_all() noexcept {
return _state.contains(state::draining_all);
}
public:
ep_managers_map_type::iterator find_ep_manager(ep_key_type ep_key) noexcept {
return _ep_managers.find(ep_key);

View File

@@ -113,7 +113,7 @@ future<> cql_table_large_data_handler::record_large_cells(const sstables::sstabl
auto ck_str = key_to_str(*clustering_key, s);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, format("{} {}", ck_str, column_name), extra_fields, ck_str, column_name);
} else {
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, nullptr, column_name);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, data_value::make_null(utf8_type), column_name);
}
}
@@ -125,7 +125,7 @@ future<> cql_table_large_data_handler::record_large_rows(const sstables::sstable
std::string ck_str = key_to_str(*clustering_key, s);
return try_record("row", sst, partition_key, int64_t(row_size), "row", ck_str, extra_fields, ck_str);
} else {
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, nullptr);
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, data_value::make_null(utf8_type));
}
}

View File

@@ -111,27 +111,12 @@ public:
return make_ready_future<>();
}
future<> maybe_delete_large_data_entries(const schema& s, sstring filename, uint64_t data_size) {
future<> maybe_delete_large_data_entries(const schema& /*s*/, sstring /*filename*/, uint64_t /*data_size*/) {
assert(running());
future<> large_partitions = make_ready_future<>();
if (__builtin_expect(data_size > _partition_threshold_bytes, false)) {
large_partitions = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_PARTITIONS);
});
}
future<> large_rows = make_ready_future<>();
if (__builtin_expect(data_size > _row_threshold_bytes, false)) {
large_rows = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_ROWS);
});
}
future<> large_cells = make_ready_future<>();
if (__builtin_expect(data_size > _cell_threshold_bytes, false)) {
large_cells = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_CELLS);
});
}
return when_all(std::move(large_partitions), std::move(large_rows), std::move(large_cells)).discard_result();
// Deletion of large data entries is disabled due to #7668
// They will evetually expire based on the 30 days TTL.
return make_ready_future<>();
}
const large_data_handler::stats& stats() const { return _stats; }

View File

@@ -801,6 +801,19 @@ future<> merge_unlock() {
return smp::submit_to(0, [] { the_merge_lock.signal(); });
}
static
future<> update_schema_version_and_announce(distributed<service::storage_proxy>& proxy, schema_features features) {
return calculate_schema_digest(proxy, features).then([&proxy] (utils::UUID uuid) {
return db::system_keyspace::update_schema_version(uuid).then([&proxy, uuid] {
return proxy.local().get_db().invoke_on_all([uuid] (database& db) {
db.update_version(uuid);
});
}).then([uuid] {
slogger.info("Schema version changed to {}", uuid);
});
});
}
/**
* Merge remote schema in form of mutations with local and mutate ks/cf metadata objects
* (which also involves fs operations on add/drop ks/cf)
@@ -821,6 +834,14 @@ future<> merge_schema(distributed<service::storage_proxy>& proxy, gms::feature_s
});
}
future<> recalculate_schema_version(distributed<service::storage_proxy>& proxy, gms::feature_service& feat) {
return merge_lock().then([&proxy, &feat] {
return update_schema_version_and_announce(proxy, feat.cluster_schema_features());
}).finally([] {
return merge_unlock();
});
}
future<> merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush)
{
return merge_lock().then([&proxy, mutations = std::move(mutations), do_flush] () mutable {
@@ -2904,10 +2925,6 @@ future<> maybe_update_legacy_secondary_index_mv_schema(service::migration_manage
// format, where "token" is not marked as computed. Once we're sure that all indexes have their
// columns marked as computed (because they were either created on a node that supports computed
// columns or were fixed by this utility function), it's safe to remove this function altogether.
if (!db.features().cluster_supports_computed_columns()) {
return make_ready_future<>();
}
if (v->clustering_key_size() == 0) {
return make_ready_future<>();
}

View File

@@ -170,6 +170,13 @@ future<> merge_schema(distributed<service::storage_proxy>& proxy, gms::feature_s
future<> merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush);
// Recalculates the local schema version.
//
// It is safe to call concurrently with recalculate_schema_version() and merge_schema() in which case it
// is guaranteed that the schema version we end up with after all calls will reflect the most recent state
// of feature_service and schema tables.
future<> recalculate_schema_version(distributed<service::storage_proxy>& proxy, gms::feature_service& feat);
future<std::set<sstring>> merge_keyspaces(distributed<service::storage_proxy>& proxy, schema_result&& before, schema_result&& after);
std::vector<mutation> make_create_keyspace_mutations(lw_shared_ptr<keyspace_metadata> keyspace, api::timestamp_type timestamp, bool with_tables_and_types_and_functions = true);

View File

@@ -201,10 +201,10 @@ static future<std::vector<token_range>> get_local_ranges(database& db) {
// All queries will be on that table, where all entries are text and there's no notion of
// token ranges form the CQL point of view.
auto left_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.start() || r.start()->value() == dht::minimum_token();
return r.end() && (!r.start() || r.start()->value() == dht::minimum_token());
});
auto right_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.end() || r.start()->value() == dht::maximum_token();
return r.start() && (!r.end() || r.end()->value() == dht::maximum_token());
});
if (left_inf != right_inf && left_inf != ranges.end() && right_inf != ranges.end()) {
local_ranges.push_back(token_range{to_bytes(right_inf->start()), to_bytes(left_inf->end())});

View File

@@ -58,6 +58,7 @@
#include "cql3/util.hh"
#include "db/view/view.hh"
#include "db/view/view_builder.hh"
#include "db/view/view_updating_consumer.hh"
#include "db/system_keyspace_view_types.hh"
#include "db/system_keyspace.hh"
#include "frozen_mutation.hh"
@@ -136,17 +137,90 @@ const column_definition* view_info::view_column(const column_definition& base_de
return _schema.get_column_definition(base_def.name());
}
const std::vector<column_id>& view_info::base_non_pk_columns_in_view_pk() const {
void view_info::set_base_info(db::view::base_info_ptr base_info) {
_base_info = std::move(base_info);
}
// A constructor for a base info that can facilitate reads and writes from the materialized view.
db::view::base_dependent_view_info::base_dependent_view_info(schema_ptr base_schema, std::vector<column_id>&& base_non_pk_columns_in_view_pk)
: _base_schema{std::move(base_schema)}
, _base_non_pk_columns_in_view_pk{std::move(base_non_pk_columns_in_view_pk)}
, has_base_non_pk_columns_in_view_pk{!_base_non_pk_columns_in_view_pk.empty()}
, use_only_for_reads{false} {
}
// A constructor for a base info that can facilitate only reads from the materialized view.
db::view::base_dependent_view_info::base_dependent_view_info(bool has_base_non_pk_columns_in_view_pk)
: _base_schema{nullptr}
, has_base_non_pk_columns_in_view_pk{has_base_non_pk_columns_in_view_pk}
, use_only_for_reads{true} {
}
const std::vector<column_id>& db::view::base_dependent_view_info::base_non_pk_columns_in_view_pk() const {
if (use_only_for_reads) {
on_internal_error(vlogger, "base_non_pk_columns_in_view_pk(): operation unsupported when initialized only for view reads.");
}
return _base_non_pk_columns_in_view_pk;
}
void view_info::initialize_base_dependent_fields(const schema& base) {
const schema_ptr& db::view::base_dependent_view_info::base_schema() const {
if (use_only_for_reads) {
on_internal_error(vlogger, "base_schema(): operation unsupported when initialized only for view reads.");
}
return _base_schema;
}
db::view::base_info_ptr view_info::make_base_dependent_view_info(const schema& base) const {
std::vector<column_id> base_non_pk_columns_in_view_pk;
bool has_base_non_pk_columns_in_view_pk = false;
bool can_only_read_from_view = false;
for (auto&& view_col : boost::range::join(_schema.partition_key_columns(), _schema.clustering_key_columns())) {
if (view_col.is_computed()) {
// we are not going to find it in the base table...
continue;
}
auto* base_col = base.get_column_definition(view_col.name());
if (base_col && !base_col->is_primary_key()) {
_base_non_pk_columns_in_view_pk.push_back(base_col->id);
base_non_pk_columns_in_view_pk.push_back(base_col->id);
has_base_non_pk_columns_in_view_pk = true;
} else if (!base_col) {
// If we didn't find the column in the base column then it must have been deleted
// or not yet added (by alter command), this means it is for sure not a pk column
// in the base table. This can happen if the version of the base schema is not the
// one that the view was created with. Seting this schema as the base can't harm since
// if we got to such a situation then it means it is only going to be used for reading
// (computation of shadowable tombstones) and in that case the existence of such a column
// is the only thing that is of interest to us.
has_base_non_pk_columns_in_view_pk = true;
can_only_read_from_view = true;
// We can break the loop here since we have the info we wanted and the list
// of columns is not going to be reliable anyhow.
break;
}
}
if (can_only_read_from_view) {
return make_lw_shared<db::view::base_dependent_view_info>(has_base_non_pk_columns_in_view_pk);
} else {
return make_lw_shared<db::view::base_dependent_view_info>(base.shared_from_this(), std::move(base_non_pk_columns_in_view_pk));
}
}
bool view_info::has_base_non_pk_columns_in_view_pk() const {
// The base info is not always available, this is because
// the base info initialization is separate from the view
// info construction. If we are trying to get this info without
// initializing the base information it means that we have a
// schema integrity problem as the creator of owning view schema
// didn't make sure to initialize it with base information.
if (!_base_info) {
on_internal_error(vlogger, "Tried to perform a view query which is base info dependant without initializing it");
}
return _base_info->has_base_non_pk_columns_in_view_pk;
}
namespace db {
@@ -194,12 +268,12 @@ bool may_be_affected_by(const schema& base, const view_info& view, const dht::de
}
static bool update_requires_read_before_write(const schema& base,
const std::vector<view_ptr>& views,
const std::vector<view_and_base>& views,
const dht::decorated_key& key,
const rows_entry& update,
gc_clock::time_point now) {
for (auto&& v : views) {
view_info& vf = *v->view_info();
view_info& vf = *v.view->view_info();
if (may_be_affected_by(base, vf, key, update, now)) {
return true;
}
@@ -246,12 +320,14 @@ class view_updates final {
view_ptr _view;
const view_info& _view_info;
schema_ptr _base;
base_info_ptr _base_info;
std::unordered_map<partition_key, mutation_partition, partition_key::hashing, partition_key::equality> _updates;
public:
explicit view_updates(view_ptr view, schema_ptr base)
: _view(std::move(view))
explicit view_updates(view_and_base vab)
: _view(std::move(vab.view))
, _view_info(*_view->view_info())
, _base(std::move(base))
, _base(vab.base->base_schema())
, _base_info(vab.base)
, _updates(8, partition_key::hashing(*_view), partition_key::equality(*_view)) {
}
@@ -313,7 +389,7 @@ row_marker view_updates::compute_row_marker(const clustering_row& base_row) cons
// they share liveness information. It's true especially in the only case currently allowed by CQL,
// which assumes there's up to one non-pk column in the view key. It's also true in alternator,
// which does not carry TTL information.
const auto& col_ids = _view_info.base_non_pk_columns_in_view_pk();
const auto& col_ids = _base_info->base_non_pk_columns_in_view_pk();
if (!col_ids.empty()) {
auto& def = _base->regular_column_at(col_ids[0]);
// Note: multi-cell columns can't be part of the primary key.
@@ -544,7 +620,7 @@ void view_updates::delete_old_entry(const partition_key& base_key, const cluster
void view_updates::do_delete_old_entry(const partition_key& base_key, const clustering_row& existing, const clustering_row& update, gc_clock::time_point now) {
auto& r = get_view_row(base_key, existing);
const auto& col_ids = _view_info.base_non_pk_columns_in_view_pk();
const auto& col_ids = _base_info->base_non_pk_columns_in_view_pk();
if (!col_ids.empty()) {
// We delete the old row using a shadowable row tombstone, making sure that
// the tombstone deletes everything in the row (or it might still show up).
@@ -685,7 +761,7 @@ void view_updates::generate_update(
return;
}
const auto& col_ids = _view_info.base_non_pk_columns_in_view_pk();
const auto& col_ids = _base_info->base_non_pk_columns_in_view_pk();
if (col_ids.empty()) {
// The view key is necessarily the same pre and post update.
if (existing && existing->is_live(*_base)) {
@@ -940,12 +1016,17 @@ future<stop_iteration> view_update_builder::on_results() {
future<std::vector<frozen_mutation_and_schema>> generate_view_updates(
const schema_ptr& base,
std::vector<view_ptr>&& views_to_update,
std::vector<view_and_base>&& views_to_update,
flat_mutation_reader&& updates,
flat_mutation_reader_opt&& existings,
gc_clock::time_point now) {
auto vs = boost::copy_range<std::vector<view_updates>>(views_to_update | boost::adaptors::transformed([&] (auto&& v) {
return view_updates(std::move(v), base);
auto vs = boost::copy_range<std::vector<view_updates>>(views_to_update | boost::adaptors::transformed([&] (view_and_base v) {
if (base->version() != v.base->base_schema()->version()) {
on_internal_error(vlogger, format("Schema version used for view updates ({}) does not match the current"
" base schema version of the view ({}) for view {}.{} of {}.{}",
base->version(), v.base->base_schema()->version(), v.view->ks_name(), v.view->cf_name(), base->ks_name(), base->cf_name()));
}
return view_updates(std::move(v));
}));
auto builder = std::make_unique<view_update_builder>(base, std::move(vs), std::move(updates), std::move(existings), now);
auto f = builder->build();
@@ -955,7 +1036,7 @@ future<std::vector<frozen_mutation_and_schema>> generate_view_updates(
query::clustering_row_ranges calculate_affected_clustering_ranges(const schema& base,
const dht::decorated_key& key,
const mutation_partition& mp,
const std::vector<view_ptr>& views,
const std::vector<view_and_base>& views,
gc_clock::time_point now) {
std::vector<nonwrapping_range<clustering_key_prefix_view>> row_ranges;
std::vector<nonwrapping_range<clustering_key_prefix_view>> view_row_ranges;
@@ -963,11 +1044,11 @@ query::clustering_row_ranges calculate_affected_clustering_ranges(const schema&
if (mp.partition_tombstone() || !mp.row_tombstones().empty()) {
for (auto&& v : views) {
// FIXME: #2371
if (v->view_info()->select_statement().get_restrictions()->has_unrestricted_clustering_columns()) {
if (v.view->view_info()->select_statement().get_restrictions()->has_unrestricted_clustering_columns()) {
view_row_ranges.push_back(nonwrapping_range<clustering_key_prefix_view>::make_open_ended_both_sides());
break;
}
for (auto&& r : v->view_info()->partition_slice().default_row_ranges()) {
for (auto&& r : v.view->view_info()->partition_slice().default_row_ranges()) {
view_row_ranges.push_back(r.transform(std::mem_fn(&clustering_key_prefix::view)));
}
}
@@ -1210,25 +1291,61 @@ view_builder::view_builder(database& db, db::system_distributed_keyspace& sys_di
}
future<> view_builder::start(service::migration_manager& mm) {
_started = seastar::async([this, &mm] {
// Guard the whole startup routine with a semaphore,
// so that it's not intercepted by `on_drop_view`, `on_create_view`
// or `on_update_view` events.
auto units = get_units(_sem, 1).get0();
// Wait for schema agreement even if we're a seed node.
while (!mm.have_schema_agreement()) {
if (_as.abort_requested()) {
return;
_started = do_with(view_builder_init_state{}, [this, &mm] (view_builder_init_state& vbi) {
return seastar::async([this, &mm, &vbi] {
// Guard the whole startup routine with a semaphore,
// so that it's not intercepted by `on_drop_view`, `on_create_view`
// or `on_update_view` events.
auto units = get_units(_sem, 1).get0();
// Wait for schema agreement even if we're a seed node.
while (!mm.have_schema_agreement()) {
seastar::sleep_abortable(500ms, _as).get();
}
seastar::sleep(500ms).get();
}
auto built = system_keyspace::load_built_views().get0();
auto in_progress = system_keyspace::load_view_build_progress().get0();
calculate_shard_build_step(std::move(built), std::move(in_progress)).get();
_mnotifier.register_listener(this);
_current_step = _base_to_build_step.begin();
// Waited on indirectly in stop().
(void)_build_step.trigger();
auto built = system_keyspace::load_built_views().get0();
auto in_progress = system_keyspace::load_view_build_progress().get0();
setup_shard_build_step(vbi, std::move(built), std::move(in_progress));
}).then_wrapped([this] (future<>&& f) {
// All shards need to arrive at the same decisions on whether or not to
// restart a view build at some common token (reshard), and which token
// to restart at. So we need to wait until all shards have read the view
// build statuses before they can all proceed to make the (same) decision.
// If we don't synchronize here, a fast shard may make a decision, start
// building and finish a build step - before the slowest shard even read
// the view build information.
std::exception_ptr eptr;
if (f.failed()) {
eptr = f.get_exception();
}
return container().invoke_on(0, [eptr = std::move(eptr)] (view_builder& builder) {
// The &builder is alive, because it can only be destroyed in
// sharded<view_builder>::stop(), which, in turn, waits for all
// view_builder::stop()-s to finish, and each stop() waits for
// the shard's current future (called _started) to resolve.
if (!eptr) {
if (++builder._shards_finished_read == smp::count) {
builder._shards_finished_read_promise.set_value();
}
} else {
if (builder._shards_finished_read < smp::count) {
builder._shards_finished_read = smp::count;
builder._shards_finished_read_promise.set_exception(std::move(eptr));
}
}
return builder._shards_finished_read_promise.get_shared_future();
});
}).then([this, &vbi] {
return calculate_shard_build_step(vbi);
}).then([this] {
_mnotifier.register_listener(this);
_current_step = _base_to_build_step.begin();
// Waited on indirectly in stop().
(void)_build_step.trigger();
return make_ready_future<>();
});
}).handle_exception([] (std::exception_ptr eptr) {
vlogger.error("start failed: {}", eptr);
return make_ready_future<>();
});
return make_ready_future<>();
}
@@ -1236,7 +1353,7 @@ future<> view_builder::start(service::migration_manager& mm) {
future<> view_builder::stop() {
vlogger.info("Stopping view builder");
_as.request_abort();
return _started.finally([this] {
return _started.then([this] {
return _mnotifier.unregister_listener(this).then([this] {
return _sem.wait();
}).then([this] {
@@ -1370,12 +1487,12 @@ void view_builder::reshard(
}
}
future<> view_builder::calculate_shard_build_step(
void view_builder::setup_shard_build_step(
view_builder_init_state& vbi,
std::vector<system_keyspace::view_name> built,
std::vector<system_keyspace::view_build_progress> in_progress) {
// Shard 0 makes cleanup changes to the system tables, but none that could conflict
// with the other shards; everyone is thus able to proceed independently.
auto bookkeeping_ops = std::make_unique<std::vector<future<>>>();
auto base_table_exists = [this] (const view_ptr& view) {
// This is a safety check in case this node missed a create MV statement
// but got a drop table for the base, and another node didn't get the
@@ -1402,9 +1519,9 @@ future<> view_builder::calculate_shard_build_step(
// Fall-through
}
if (this_shard_id() == 0) {
bookkeeping_ops->push_back(_sys_dist_ks.remove_view(name.first, name.second));
bookkeeping_ops->push_back(system_keyspace::remove_built_view(name.first, name.second));
bookkeeping_ops->push_back(
vbi.bookkeeping_ops.push_back(_sys_dist_ks.remove_view(name.first, name.second));
vbi.bookkeeping_ops.push_back(system_keyspace::remove_built_view(name.first, name.second));
vbi.bookkeeping_ops.push_back(
system_keyspace::remove_view_build_progress_across_all_shards(
std::move(name.first),
std::move(name.second)));
@@ -1412,50 +1529,48 @@ future<> view_builder::calculate_shard_build_step(
return view_ptr(nullptr);
};
auto built_views = boost::copy_range<std::unordered_set<utils::UUID>>(built
vbi.built_views = boost::copy_range<std::unordered_set<utils::UUID>>(built
| boost::adaptors::transformed(maybe_fetch_view)
| boost::adaptors::filtered([] (const view_ptr& v) { return bool(v); })
| boost::adaptors::transformed([] (const view_ptr& v) { return v->id(); }));
std::vector<std::vector<view_build_status>> view_build_status_per_shard;
for (auto& [view_name, first_token, next_token_opt, cpu_id] : in_progress) {
if (auto view = maybe_fetch_view(view_name)) {
if (built_views.find(view->id()) != built_views.end()) {
if (vbi.built_views.find(view->id()) != vbi.built_views.end()) {
if (this_shard_id() == 0) {
auto f = _sys_dist_ks.finish_view_build(std::move(view_name.first), std::move(view_name.second)).then([view = std::move(view)] {
return system_keyspace::remove_view_build_progress_across_all_shards(view->cf_name(), view->ks_name());
});
bookkeeping_ops->push_back(std::move(f));
vbi.bookkeeping_ops.push_back(std::move(f));
}
continue;
}
view_build_status_per_shard.resize(std::max(view_build_status_per_shard.size(), size_t(cpu_id + 1)));
view_build_status_per_shard[cpu_id].emplace_back(view_build_status{
vbi.status_per_shard.resize(std::max(vbi.status_per_shard.size(), size_t(cpu_id + 1)));
vbi.status_per_shard[cpu_id].emplace_back(view_build_status{
std::move(view),
std::move(first_token),
std::move(next_token_opt)});
}
}
}
// All shards need to arrive at the same decisions on whether or not to
// restart a view build at some common token (reshard), and which token
// to restart at. So we need to wait until all shards have read the view
// build statuses before they can all proceed to make the (same) decision.
// If we don't synchronoize here, a fast shard may make a decision, start
// building and finish a build step - before the slowest shard even read
// the view build information.
container().invoke_on(0, [] (view_builder& builder) {
if (++builder._shards_finished_read == smp::count) {
builder._shards_finished_read_promise.set_value();
future<> view_builder::calculate_shard_build_step(view_builder_init_state& vbi) {
auto base_table_exists = [this] (const view_ptr& view) {
// This is a safety check in case this node missed a create MV statement
// but got a drop table for the base, and another node didn't get the
// drop notification and sent us the view schema.
try {
_db.find_schema(view->view_info()->base_id());
return true;
} catch (const no_such_column_family&) {
return false;
}
return builder._shards_finished_read_promise.get_shared_future();
}).get();
};
std::unordered_set<utils::UUID> loaded_views;
if (view_build_status_per_shard.size() != smp::count) {
reshard(std::move(view_build_status_per_shard), loaded_views);
} else if (!view_build_status_per_shard.empty()) {
for (auto& status : view_build_status_per_shard[this_shard_id()]) {
if (vbi.status_per_shard.size() != smp::count) {
reshard(std::move(vbi.status_per_shard), loaded_views);
} else if (!vbi.status_per_shard.empty()) {
for (auto& status : vbi.status_per_shard[this_shard_id()]) {
load_view_status(std::move(status), loaded_views);
}
}
@@ -1472,18 +1587,18 @@ future<> view_builder::calculate_shard_build_step(
auto all_views = _db.get_views();
auto is_new = [&] (const view_ptr& v) {
return base_table_exists(v) && loaded_views.find(v->id()) == loaded_views.end()
&& built_views.find(v->id()) == built_views.end();
&& vbi.built_views.find(v->id()) == vbi.built_views.end();
};
for (auto&& view : all_views | boost::adaptors::filtered(is_new)) {
bookkeeping_ops->push_back(add_new_view(view, get_or_create_build_step(view->view_info()->base_id())));
vbi.bookkeeping_ops.push_back(add_new_view(view, get_or_create_build_step(view->view_info()->base_id())));
}
for (auto& [_, build_step] : _base_to_build_step) {
initialize_reader_at_current_token(build_step);
}
auto f = seastar::when_all_succeed(bookkeeping_ops->begin(), bookkeeping_ops->end());
return f.handle_exception([this, bookkeeping_ops = std::move(bookkeeping_ops)] (std::exception_ptr ep) {
auto f = seastar::when_all_succeed(vbi.bookkeeping_ops.begin(), vbi.bookkeeping_ops.end());
return f.handle_exception([this] (std::exception_ptr ep) {
log_level severity = _as.abort_requested() ? log_level::warn : log_level::error;
vlogger.log(severity, "Failed to update materialized view bookkeeping ({}), continuing anyway.", ep);
});
@@ -1732,7 +1847,7 @@ public:
return stop_iteration::yes;
}
_fragments_memory_usage += cr.memory_usage(*_step.base->schema());
_fragments_memory_usage += cr.memory_usage(*_step.reader.schema());
_fragments.push_back(std::move(cr));
if (_fragments_memory_usage > batch_memory_max) {
// Although we have not yet completed the batch of base rows that
@@ -1754,10 +1869,14 @@ public:
_builder._as.check();
if (!_fragments.empty()) {
_fragments.push_front(partition_start(_step.current_key, tombstone()));
auto base_schema = _step.base->schema();
auto views = with_base_info_snapshot(_views_to_build);
auto reader = make_flat_mutation_reader_from_fragments(_step.reader.schema(), std::move(_fragments));
reader.upgrade_schema(base_schema);
_step.base->populate_views(
_views_to_build,
std::move(views),
_step.current_token(),
make_flat_mutation_reader_from_fragments(_step.base->schema(), std::move(_fragments)),
std::move(reader),
_now).get();
_fragments.clear();
_fragments_memory_usage = 0;
@@ -1909,5 +2028,54 @@ future<bool> check_needs_view_update_path(db::system_distributed_keyspace& sys_d
});
}
const size_t view_updating_consumer::buffer_size_soft_limit{1 * 1024 * 1024};
const size_t view_updating_consumer::buffer_size_hard_limit{2 * 1024 * 1024};
void view_updating_consumer::do_flush_buffer() {
_staging_reader_handle.pause();
if (_buffer.front().partition().empty()) {
// If we flushed mid-partition we can have an empty mutation if we
// flushed right before getting the end-of-partition fragment.
_buffer.pop_front();
}
while (!_buffer.empty()) {
try {
auto lock_holder = _view_update_pusher(std::move(_buffer.front())).get();
} catch (...) {
vlogger.warn("Failed to push replica updates for table {}.{}: {}", _schema->ks_name(), _schema->cf_name(), std::current_exception());
}
_buffer.pop_front();
}
_buffer_size = 0;
_m = nullptr;
}
void view_updating_consumer::maybe_flush_buffer_mid_partition() {
if (_buffer_size >= buffer_size_hard_limit) {
auto m = mutation(_schema, _m->decorated_key(), mutation_partition(_schema));
do_flush_buffer();
_buffer.emplace_back(std::move(m));
_m = &_buffer.back();
}
}
view_updating_consumer::view_updating_consumer(schema_ptr schema, table& table, std::vector<sstables::shared_sstable> excluded_sstables, const seastar::abort_source& as,
evictable_reader_handle& staging_reader_handle)
: view_updating_consumer(std::move(schema), as, staging_reader_handle,
[table = table.shared_from_this(), excluded_sstables = std::move(excluded_sstables)] (mutation m) mutable {
auto s = m.schema();
return table->stream_view_replica_updates(std::move(s), std::move(m), db::no_timeout, excluded_sstables);
})
{ }
std::vector<db::view::view_and_base> with_base_info_snapshot(std::vector<view_ptr> vs) {
return boost::copy_range<std::vector<db::view::view_and_base>>(vs | boost::adaptors::transformed([] (const view_ptr& v) {
return db::view::view_and_base{v, v->view_info()->base_info()};
}));
}
} // namespace view
} // namespace db

View File

@@ -43,6 +43,46 @@ namespace db {
namespace view {
// Part of the view description which depends on the base schema version.
//
// This structure may change even though the view schema doesn't change, so
// it needs to live outside view_ptr.
struct base_dependent_view_info {
private:
schema_ptr _base_schema;
// Id of a regular base table column included in the view's PK, if any.
// Scylla views only allow one such column, alternator can have up to two.
std::vector<column_id> _base_non_pk_columns_in_view_pk;
public:
const std::vector<column_id>& base_non_pk_columns_in_view_pk() const;
const schema_ptr& base_schema() const;
// Indicates if the view hase pk columns which are not part of the base
// pk, it seems that !base_non_pk_columns_in_view_pk.empty() is the same,
// but actually there are cases where we can compute this boolean without
// succeeding to reliably build the former.
const bool has_base_non_pk_columns_in_view_pk;
// If base_non_pk_columns_in_view_pk couldn't reliably be built, this base
// info can't be used for computing view updates, only for reading the materialized
// view.
const bool use_only_for_reads;
// A constructor for a base info that can facilitate reads and writes from the materialized view.
base_dependent_view_info(schema_ptr base_schema, std::vector<column_id>&& base_non_pk_columns_in_view_pk);
// A constructor for a base info that can facilitate only reads from the materialized view.
base_dependent_view_info(bool has_base_non_pk_columns_in_view_pk);
};
// Immutable snapshot of view's base-schema-dependent part.
using base_info_ptr = lw_shared_ptr<const base_dependent_view_info>;
// Snapshot of the view schema and its base-schema-dependent part.
struct view_and_base {
view_ptr view;
base_info_ptr base;
};
/**
* Whether the view filter considers the specified partition key.
*
@@ -94,7 +134,7 @@ bool clustering_prefix_matches(const schema& base, const partition_key& key, con
future<std::vector<frozen_mutation_and_schema>> generate_view_updates(
const schema_ptr& base,
std::vector<view_ptr>&& views_to_update,
std::vector<view_and_base>&& views_to_update,
flat_mutation_reader&& updates,
flat_mutation_reader_opt&& existings,
gc_clock::time_point now);
@@ -103,7 +143,7 @@ query::clustering_row_ranges calculate_affected_clustering_ranges(
const schema& base,
const dht::decorated_key& key,
const mutation_partition& mp,
const std::vector<view_ptr>& views,
const std::vector<view_and_base>& views,
gc_clock::time_point now);
struct wait_for_all_updates_tag {};
@@ -133,6 +173,13 @@ future<> mutate_MV(
*/
void create_virtual_column(schema_builder& builder, const bytes& name, const data_type& type);
/**
* Converts a collection of view schema snapshots into a collection of
* view_and_base objects, which are snapshots of both the view schema
* and the base-schema-dependent part of view description.
*/
std::vector<view_and_base> with_base_info_snapshot(std::vector<view_ptr>);
}
}

View File

@@ -165,6 +165,12 @@ class view_builder final : public service::migration_listener::only_view_notific
// Used for testing.
std::unordered_map<std::pair<sstring, sstring>, seastar::shared_promise<>, utils::tuple_hash> _build_notifiers;
struct view_builder_init_state {
std::vector<future<>> bookkeeping_ops;
std::vector<std::vector<view_build_status>> status_per_shard;
std::unordered_set<utils::UUID> built_views;
};
public:
// The view builder processes the base table in steps of batch_size rows.
// However, if the individual rows are large, there is no real need to
@@ -201,7 +207,8 @@ private:
void initialize_reader_at_current_token(build_step&);
void load_view_status(view_build_status, std::unordered_set<utils::UUID>&);
void reshard(std::vector<std::vector<view_build_status>>, std::unordered_set<utils::UUID>&);
future<> calculate_shard_build_step(std::vector<system_keyspace::view_name>, std::vector<system_keyspace::view_build_progress>);
void setup_shard_build_step(view_builder_init_state& vbi, std::vector<system_keyspace::view_name>, std::vector<system_keyspace::view_build_progress>);
future<> calculate_shard_build_step(view_builder_init_state& vbi);
future<> add_new_view(view_ptr, build_step&);
future<> do_build_step();
void execute(build_step&, exponential_backoff_retry);

View File

@@ -42,35 +42,52 @@ future<> view_update_generator::start() {
_pending_sstables.wait().get();
}
// To ensure we don't race with updates, move the entire content
// into a local variable.
auto sstables_with_tables = std::exchange(_sstables_with_tables, {});
// If we got here, we will process all tables we know about so far eventually so there
// is no starvation
for (auto& t : _sstables_with_tables | boost::adaptors::map_keys) {
for (auto table_it = sstables_with_tables.begin(); table_it != sstables_with_tables.end(); table_it = sstables_with_tables.erase(table_it)) {
auto& [t, sstables] = *table_it;
schema_ptr s = t->schema();
// Copy what we have so far so we don't miss new updates
auto sstables = std::exchange(_sstables_with_tables[t], {});
vug_logger.trace("Processing {}.{}: {} sstables", s->ks_name(), s->cf_name(), sstables.size());
const auto num_sstables = sstables.size();
try {
// temporary: need an sstable set for the flat mutation reader, but the
// compaction_descriptor takes a vector. Soon this will become a compaction
// so the transformation to the SSTable set will not be needed.
auto ssts = make_lw_shared(t->get_compaction_strategy().make_sstable_set(s));
// Exploit the fact that sstables in the staging directory
// are usually non-overlapping and use a partitioned set for
// the read.
auto ssts = make_lw_shared(sstables::make_partitioned_sstable_set(s, make_lw_shared<sstable_list>(sstable_list{}), false));
for (auto& sst : sstables) {
ssts->insert(sst);
}
flat_mutation_reader staging_sstable_reader = ::make_range_sstable_reader(s,
auto ms = mutation_source([this, ssts] (
schema_ptr s,
reader_permit permit,
const dht::partition_range& pr,
const query::partition_slice& ps,
const io_priority_class& pc,
tracing::trace_state_ptr ts,
streamed_mutation::forwarding fwd_ms,
mutation_reader::forwarding fwd_mr) {
return ::make_restricted_range_sstable_reader(s, std::move(permit), std::move(ssts), pr, ps, pc, std::move(ts), fwd_ms, fwd_mr);
});
auto [staging_sstable_reader, staging_sstable_reader_handle] = make_manually_paused_evictable_reader(
std::move(ms),
s,
_db.make_query_class_config().semaphore.make_permit(),
std::move(ssts),
query::full_partition_range,
s->full_slice(),
service::get_local_streaming_priority(),
nullptr,
::streamed_mutation::forwarding::no,
::mutation_reader::forwarding::no);
inject_failure("view_update_generator_consume_staging_sstable");
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(s, *t, sstables, _as), db::no_timeout);
auto result = staging_sstable_reader.consume_in_thread(view_updating_consumer(s, *t, sstables, _as, staging_sstable_reader_handle), db::no_timeout);
if (result == stop_iteration::yes) {
break;
}
@@ -89,7 +106,7 @@ future<> view_update_generator::start() {
// Move from staging will be retried upon restart.
vug_logger.warn("Moving {} from staging failed: {}:{}. Ignoring...", s->ks_name(), s->cf_name(), std::current_exception());
}
_registration_sem.signal();
_registration_sem.signal(num_sstables);
}
// For each table, move the processed staging sstables into the table's base dir.
for (auto it = _sstables_to_move.begin(); it != _sstables_to_move.end(); ) {

View File

@@ -32,7 +32,10 @@
namespace db::view {
class view_update_generator {
public:
static constexpr size_t registration_queue_size = 5;
private:
database& _db;
seastar::abort_source _as;
future<> _started = make_ready_future<>();
@@ -51,6 +54,8 @@ public:
future<> start();
future<> stop();
future<> register_staging_sstable(sstables::shared_sstable sst, lw_shared_ptr<table> table);
ssize_t available_register_units() const { return _registration_sem.available_units(); }
private:
bool should_throttle() const;
};

View File

@@ -27,6 +27,8 @@
#include "sstables/shared_sstable.hh"
#include "database.hh"
class evictable_reader_handle;
namespace db::view {
/*
@@ -34,22 +36,46 @@ namespace db::view {
* It is expected to be run in seastar::async threaded context through consume_in_thread()
*/
class view_updating_consumer {
schema_ptr _schema;
lw_shared_ptr<table> _table;
std::vector<sstables::shared_sstable> _excluded_sstables;
const seastar::abort_source* _as;
std::optional<mutation> _m;
public:
view_updating_consumer(schema_ptr schema, table& table, std::vector<sstables::shared_sstable> excluded_sstables, const seastar::abort_source& as)
// We prefer flushing on partition boundaries, so at the end of a partition,
// we flush on reaching the soft limit. Otherwise we continue accumulating
// data. We flush mid-partition if we reach the hard limit.
static const size_t buffer_size_soft_limit;
static const size_t buffer_size_hard_limit;
private:
schema_ptr _schema;
const seastar::abort_source* _as;
evictable_reader_handle& _staging_reader_handle;
circular_buffer<mutation> _buffer;
mutation* _m{nullptr};
size_t _buffer_size{0};
noncopyable_function<future<row_locker::lock_holder>(mutation)> _view_update_pusher;
private:
void do_flush_buffer();
void maybe_flush_buffer_mid_partition();
public:
// Push updates with a custom pusher. Mainly for tests.
view_updating_consumer(schema_ptr schema, const seastar::abort_source& as, evictable_reader_handle& staging_reader_handle,
noncopyable_function<future<row_locker::lock_holder>(mutation)> view_update_pusher)
: _schema(std::move(schema))
, _table(table.shared_from_this())
, _excluded_sstables(std::move(excluded_sstables))
, _as(&as)
, _m()
, _staging_reader_handle(staging_reader_handle)
, _view_update_pusher(std::move(view_update_pusher))
{ }
view_updating_consumer(schema_ptr schema, table& table, std::vector<sstables::shared_sstable> excluded_sstables, const seastar::abort_source& as,
evictable_reader_handle& staging_reader_handle);
view_updating_consumer(view_updating_consumer&&) = default;
view_updating_consumer& operator=(view_updating_consumer&&) = delete;
void consume_new_partition(const dht::decorated_key& dk) {
_m = mutation(_schema, dk, mutation_partition(_schema));
_buffer.emplace_back(_schema, dk, mutation_partition(_schema));
_m = &_buffer.back();
}
void consume(tombstone t) {
@@ -60,7 +86,9 @@ public:
if (_as->abort_requested()) {
return stop_iteration::yes;
}
_buffer_size += sr.memory_usage(*_schema);
_m->partition().apply(*_schema, std::move(sr));
maybe_flush_buffer_mid_partition();
return stop_iteration::no;
}
@@ -68,7 +96,9 @@ public:
if (_as->abort_requested()) {
return stop_iteration::yes;
}
_buffer_size += cr.memory_usage(*_schema);
_m->partition().apply(*_schema, std::move(cr));
maybe_flush_buffer_mid_partition();
return stop_iteration::no;
}
@@ -76,14 +106,27 @@ public:
if (_as->abort_requested()) {
return stop_iteration::yes;
}
_buffer_size += rt.memory_usage(*_schema);
_m->partition().apply(*_schema, std::move(rt));
maybe_flush_buffer_mid_partition();
return stop_iteration::no;
}
// Expected to be run in seastar::async threaded context (consume_in_thread())
stop_iteration consume_end_of_partition();
stop_iteration consume_end_of_partition() {
if (_as->abort_requested()) {
return stop_iteration::yes;
}
if (_buffer_size >= buffer_size_soft_limit) {
do_flush_buffer();
}
return stop_iteration::no;
}
stop_iteration consume_end_of_stream() {
if (!_buffer.empty()) {
do_flush_buffer();
}
return stop_iteration(_as->abort_requested());
}
};

View File

@@ -59,7 +59,12 @@ future<> boot_strapper::bootstrap(streaming::stream_reason reason) {
return make_exception_future<>(std::runtime_error("Wrong stream_reason provided: it can only be replace or bootstrap"));
}
auto streamer = make_lw_shared<range_streamer>(_db, _token_metadata, _abort_source, _tokens, _address, description, reason);
streamer->add_source_filter(std::make_unique<range_streamer::failure_detector_source_filter>(gms::get_local_gossiper().get_unreachable_members()));
auto nodes_to_filter = gms::get_local_gossiper().get_unreachable_members();
if (reason == streaming::stream_reason::replace && _db.local().get_replace_address()) {
nodes_to_filter.insert(_db.local().get_replace_address().value());
}
blogger.debug("nodes_to_filter={}", nodes_to_filter);
streamer->add_source_filter(std::make_unique<range_streamer::failure_detector_source_filter>(nodes_to_filter));
auto keyspaces = make_lw_shared<std::vector<sstring>>(_db.local().get_non_system_keyspaces());
return do_for_each(*keyspaces, [this, keyspaces, streamer] (sstring& keyspace_name) {
auto& ks = _db.local().find_keyspace(keyspace_name);

View File

@@ -28,7 +28,8 @@ namespace query {
enum class digest_algorithm : uint8_t {
none = 0, // digest not required
MD5 = 1,
xxHash = 2,// default algorithm
legacy_xxHash_without_null_digest = 2,
xxHash = 3, // default algorithm
};
}

View File

@@ -36,7 +36,7 @@ struct noop_hasher {
};
class digester final {
std::variant<noop_hasher, md5_hasher, xx_hasher> _impl;
std::variant<noop_hasher, md5_hasher, xx_hasher, legacy_xx_hasher_without_null_digest> _impl;
public:
explicit digester(digest_algorithm algo) {
@@ -47,6 +47,9 @@ public:
case digest_algorithm::xxHash:
_impl = xx_hasher();
break;
case digest_algorithm::legacy_xxHash_without_null_digest:
_impl = legacy_xx_hasher_without_null_digest();
break;
case digest_algorithm ::none:
_impl = noop_hasher();
break;

View File

@@ -42,6 +42,11 @@ if __name__ == '__main__':
if node_exporter_p.exists() or (bindir_p() / 'prometheus-node_exporter').exists():
if force:
print('node_exporter already installed, reinstalling')
try:
node_exporter = systemd_unit('node-exporter.service')
node_exporter.stop()
except:
pass
else:
print('node_exporter already installed, you can use `--force` to force reinstallation')
sys.exit(1)

View File

@@ -61,7 +61,15 @@ def sh_command(*args):
return out
def get_url(path):
return urllib.request.urlopen(path).read().decode('utf-8')
# If server returns any error, like 403, or 500 urllib.request throws exception, which is not serializable.
# When multiprocessing routines fail to serialize it, it throws ambiguous serialization exception
# from get_json_from_url.
# In order to see legit error we catch it from the inside of process, covert to string and
# pass it as part of return value
try:
return 0, urllib.request.urlopen(path).read().decode('utf-8')
except Exception as exc:
return 1, str(exc)
def get_json_from_url(path):
pool = mp.Pool(processes=1)
@@ -71,13 +79,16 @@ def get_json_from_url(path):
# to enforce a wallclock timeout.
result = pool.apply_async(get_url, args=(path,))
try:
retval = result.get(timeout=5)
status, retval = result.get(timeout=5)
except mp.TimeoutError as err:
pool.terminate()
pool.join()
raise
if status == 1:
raise RuntimeError(f'Failed to get "{path}" due to the following error: {retval}')
return json.loads(retval)
def get_api(path):
return get_json_from_url("http://" + api_address + path)

View File

@@ -27,6 +27,7 @@ import glob
import shutil
import io
import stat
import distro
from scylla_util import *
interactive = False
@@ -385,6 +386,9 @@ if __name__ == '__main__':
if not stat.S_ISBLK(os.stat(dsk).st_mode):
print('{} is not block device'.format(dsk))
continue
if dsk in selected:
print(f'{dsk} is already added')
continue
selected.append(dsk)
devices.remove(dsk)
disks = ','.join(selected)
@@ -468,5 +472,10 @@ if __name__ == '__main__':
print('Please restart your machine before using ScyllaDB, as you have disabled')
print(' SELinux.')
if dist_name() == 'Ubuntu':
run('apt-get install -y hugepages')
if distro.id() == 'ubuntu':
# Ubuntu version is 20.04 or later
if int(distro.major_version()) >= 20:
hugepkg = 'libhugetlbfs-bin'
else:
hugepkg = 'hugepages'
run(f'apt-get install -y {hugepkg}')

View File

@@ -40,6 +40,10 @@ if __name__ == '__main__':
sys.exit(1)
memtotal = get_memtotal_gb()
if memtotal == 0:
print('memory too small: {} KB'.format(get_memtotal()))
sys.exit(1)
# Scylla document says 'swap size should be set to either total_mem/3 or
# 16GB - lower of the two', so we need to compare 16g vs memtotal/3 and
# choose lower one

View File

@@ -98,7 +98,7 @@ def curl(url, byte=False):
return res.read()
else:
return res.read().decode('utf-8')
except urllib.error.HTTPError:
except urllib.error.URLError:
logging.warn("Failed to grab %s..." % url)
time.sleep(5)
retries += 1
@@ -133,6 +133,8 @@ class aws_instance:
raise Exception("found more than one disk mounted at root'".format(root_dev_candidates))
root_dev = root_dev_candidates[0].device
if root_dev == '/dev/root':
root_dev = run('findmnt -n -o SOURCE /', shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
nvmes_present = list(filter(nvme_re.match, os.listdir("/dev")))
return {"root": [ root_dev ], "ephemeral": [ x for x in nvmes_present if not root_dev.startswith(os.path.join("/dev/", x)) ] }
@@ -184,7 +186,7 @@ class aws_instance:
instance_size = self.instance_size()
if instance_class in ['c3', 'c4', 'd2', 'i2', 'r3']:
return 'ixgbevf'
if instance_class in ['a1', 'c5', 'c5d', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d']:
if instance_class in ['a1', 'c5', 'c5a', 'c5d', 'c5n', 'c6g', 'c6gd', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'm6gd', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d']:
return 'ena'
if instance_class == 'm4':
if instance_size == '16xlarge':
@@ -221,7 +223,7 @@ class aws_instance:
def ebs_disks(self):
"""Returns all EBS disks"""
return set(self._disks["ephemeral"])
return set(self._disks["ebs"])
def public_ipv4(self):
"""Returns the public IPv4 address of this instance"""
@@ -329,9 +331,7 @@ class scylla_cpuinfo:
return len(self._cpu_data["system"])
# When a CLI tool is not installed, use relocatable CLI tool provided by Scylla
scylla_env = os.environ.copy()
scylla_env['PATH'] = '{}:{}'.format(scylla_env['PATH'], scyllabindir())
def run(cmd, shell=False, silent=False, exception=True):
stdout = subprocess.DEVNULL if silent else None
@@ -446,6 +446,19 @@ def dist_ver():
return distro.version()
SYSTEM_PARTITION_UUIDS = [
'21686148-6449-6e6f-744e-656564454649', # BIOS boot partition
'c12a7328-f81f-11d2-ba4b-00a0c93ec93b', # EFI system partition
'024dee41-33e7-11d3-9d69-0008c781f39f' # MBR partition scheme
]
def get_partition_uuid(dev):
return out(f'lsblk -n -oPARTTYPE {dev}')
def is_system_partition(dev):
uuid = get_partition_uuid(dev)
return (uuid in SYSTEM_PARTITION_UUIDS)
def is_unused_disk(dev):
# dev is not in /sys/class/block/, like /dev/nvme[0-9]+
if not os.path.isdir('/sys/class/block/{dev}'.format(dev=dev.replace('/dev/', ''))):
@@ -453,7 +466,8 @@ def is_unused_disk(dev):
try:
fd = os.open(dev, os.O_EXCL)
os.close(fd)
return True
# dev is not reserved for system
return not is_system_partition(dev)
except OSError:
return False

View File

@@ -0,0 +1,4 @@
# allocate enough inotify instances for large machines
# each tls instance needs 1 inotify instance, and there can be
# multiple tls instances per shard.
fs.inotify.max_user_instances = 1200

View File

@@ -39,6 +39,7 @@ override_dh_strip:
# The binaries (ethtool...patchelf) don't pass dh_strip after going through patchelf. Since they are
# already stripped, nothing is lost if we exclude them, so that's what we do.
dh_strip -Xlibprotobuf.so.15 -Xld.so -Xethtool -Xgawk -Xgzip -Xhwloc-calc -Xhwloc-distrib -Xifconfig -Xlscpu -Xnetstat -Xpatchelf --dbg-package=$(product)-server-dbg
find $(CURDIR)/debian/$(product)-server-dbg/usr/lib/debug/.build-id/ -name "*.debug" -exec objcopy --decompress-debug-sections {} \;
override_dh_makeshlibs:

View File

@@ -11,6 +11,7 @@ else
sysctl -p/usr/lib/sysctl.d/99-scylla-sched.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-aio.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-vm.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-inotify.conf || :
fi
#DEBHELPER#

View File

@@ -5,8 +5,8 @@ MAINTAINER Avi Kivity <avi@cloudius-systems.com>
ENV container docker
# The SCYLLA_REPO_URL argument specifies the URL to the RPM repository this Docker image uses to install Scylla. The default value is the Scylla's unstable RPM repository, which contains the daily build.
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo
ARG VERSION=666.development
ARG SCYLLA_REPO_URL=http://downloads.scylladb.com/rpm/unstable/centos/scylla-4.2/latest/scylla.repo
ARG VERSION=4.2
ADD scylla_bashrc /scylla_bashrc

View File

@@ -129,10 +129,9 @@ rm -rf $RPM_BUILD_ROOT
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla-housekeeping
%ghost /etc/systemd/system/scylla-helper.slice.d/
%ghost /etc/systemd/system/scylla-helper.slice.d/memory.conf
%ghost /etc/systemd/system/scylla-server.service.d/
%ghost /etc/systemd/system/scylla-server.service.d/capabilities.conf
%ghost /etc/systemd/system/scylla-server.service.d/mounts.conf
%ghost /etc/systemd/system/scylla-server.service.d/dependencies.conf
/etc/systemd/system/scylla-server.service.d/dependencies.conf
%ghost /etc/systemd/system/var-lib-systemd-coredump.mount
%ghost /etc/systemd/system/scylla-cpupower.service
@@ -189,6 +188,8 @@ Summary: Scylla configuration package for the Linux kernel
License: AGPLv3
URL: http://www.scylladb.com/
Requires: kmod
# tuned overwrites our sysctl settings
Obsoletes: tuned
%description kernel-conf
This package contains Linux kernel configuration changes for the Scylla database. Install this package
@@ -200,6 +201,7 @@ if Scylla is the main application on your server and you wish to optimize its la
/usr/lib/systemd/systemd-sysctl 99-scylla-sched.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-aio.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-vm.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-inotify.conf >/dev/null 2>&1 || :
%files kernel-conf
%defattr(-,root,root)

View File

@@ -25,6 +25,15 @@ By default, Scylla listens on this port on all network interfaces.
To listen only on a specific interface, pass also an "`alternator-address`"
option.
In addition to (or instead of) serving HTTP requests on `alternator-port`,
Scylla can accept DynamoDB API requests over HTTPS (encrypted), on the port
specified by `alternator-https-port`. As usual for HTTPS servers, the
operator must specify certificate and key files. By default these should
be placed in `/etc/scylla/scylla.crt` and `/etc/scylla/scylla.key`, but
these default locations can overridden by specifying
`--alternator-encryption-options keyfile="..."` and
`--alternator-encryption-options certificate="..."`.
As we explain below in the "Write isolation policies", Alternator has
four different choices for the implementation of writes, each with
different advantages. You should consider which of the options makes

View File

@@ -137,10 +137,9 @@ TODO: is there an SSL version of Thrift?
# DynamoDB client protocol
Scylla also supports, as an experimental feature, Amazon's DynamoDB API.
The DynamoDB API is a JSON over HTTP (unencrypted) or HTTPS (encrypted)
protocol. Because Scylla's support for this protocol is experimental,
it is not turned on by default, and must be turned on manually by setting
Scylla also supports Amazon's DynamoDB API. The DynamoDB API is a JSON over
HTTP (unencrypted) or HTTPS (encrypted) protocol. Support for this protocol
is not turned on by default, and must be turned on manually by setting
the `alternator_port` and/or `alternator_https_port` configuration option.
"Alternator" is the codename of Scylla's DynamoDB API support, and is
documented in more detail in [alternator.md](alternator/alternator.md).
@@ -153,6 +152,13 @@ There is also an `alternator_address` configuration option to set the IP
address (and therefore network interface) on which Scylla should listen
for the DynamoDB protocol. This address defaults to 0.0.0.0.
When the HTTPS-based protocol is enabled, the server also needs to know
the certificate and key files to use. The default locations of these files
are `/etc/scylla/scylla.crt` and `/etc/scylla/scylla.key` respectively, but
can be overridden by specifying in `alternator_encryption_options` the
`keyfile` and `certificate` options. For example,
`--alternator-encryption-options keyfile="..."`.
# Redis client protocol
Scylla also has partial and experimental support for the Redis API.

View File

@@ -468,6 +468,9 @@ public:
size_t buffer_size() const {
return _impl->buffer_size();
}
const circular_buffer<mutation_fragment>& buffer() const {
return _impl->buffer();
}
// Detach the internal buffer of the reader.
// Roughly equivalent to depleting it by calling pop_mutation_fragment()
// until is_buffer_empty() returns true.

View File

@@ -141,6 +141,7 @@ extern const std::string_view HINTED_HANDOFF_SEPARATE_CONNECTION;
extern const std::string_view LWT;
extern const std::string_view PER_TABLE_PARTITIONERS;
extern const std::string_view PER_TABLE_CACHING;
extern const std::string_view DIGEST_FOR_NULL_VALUES;
}

View File

@@ -56,6 +56,7 @@ constexpr std::string_view features::HINTED_HANDOFF_SEPARATE_CONNECTION = "HINTE
constexpr std::string_view features::LWT = "LWT";
constexpr std::string_view features::PER_TABLE_PARTITIONERS = "PER_TABLE_PARTITIONERS";
constexpr std::string_view features::PER_TABLE_CACHING = "PER_TABLE_CACHING";
constexpr std::string_view features::DIGEST_FOR_NULL_VALUES = "DIGEST_FOR_NULL_VALUES";
static logging::logger logger("features");
@@ -90,8 +91,9 @@ feature_service::feature_service(feature_config cfg) : _config(cfg)
, _hinted_handoff_separate_connection(*this, features::HINTED_HANDOFF_SEPARATE_CONNECTION)
, _lwt_feature(*this, features::LWT)
, _per_table_partitioners_feature(*this, features::PER_TABLE_PARTITIONERS)
, _per_table_caching_feature(*this, features::PER_TABLE_CACHING) {
}
, _per_table_caching_feature(*this, features::PER_TABLE_CACHING)
, _digest_for_null_values_feature(*this, features::DIGEST_FOR_NULL_VALUES)
{}
feature_config feature_config_from_db_config(db::config& cfg, std::set<sstring> disabled) {
feature_config fcfg;
@@ -179,6 +181,7 @@ std::set<std::string_view> feature_service::known_feature_set() {
gms::features::MC_SSTABLE,
gms::features::UDF,
gms::features::CDC,
gms::features::DIGEST_FOR_NULL_VALUES,
};
for (const sstring& s : _config._disabled_features) {
@@ -269,6 +272,7 @@ void feature_service::enable(const std::set<std::string_view>& list) {
std::ref(_lwt_feature),
std::ref(_per_table_partitioners_feature),
std::ref(_per_table_caching_feature),
std::ref(_digest_for_null_values_feature),
})
{
if (list.count(f.name())) {

View File

@@ -103,6 +103,7 @@ private:
gms::feature _lwt_feature;
gms::feature _per_table_partitioners_feature;
gms::feature _per_table_caching_feature;
gms::feature _digest_for_null_values_feature;
public:
bool cluster_supports_range_tombstones() const {
@@ -177,6 +178,10 @@ public:
return _per_table_caching_feature;
}
const feature& cluster_supports_digest_for_null_values() const {
return _digest_for_null_values_feature;
}
bool cluster_supports_row_level_repair() const {
return bool(_row_level_repair_feature);
}

View File

@@ -428,6 +428,7 @@ future<> gossiper::handle_shutdown_msg(inet_address from) {
return make_ready_future<>();
}
return seastar::async([this, from] {
auto permit = this->lock_endpoint(from).get0();
this->mark_as_shutdown(from);
});
}

View File

@@ -98,6 +98,7 @@ fedora_packages=(
debhelper
fakeroot
file
dpkg-dev
)
centos_packages=(

View File

@@ -132,6 +132,7 @@ relocate_python3() {
cp "$script" "$relocateddir"
cat > "$install"<<EOF
#!/usr/bin/env bash
export LC_ALL=en_US.UTF-8
x="\$(readlink -f "\$0")"
b="\$(basename "\$x")"
d="\$(dirname "\$x")"
@@ -143,7 +144,7 @@ DEBIAN_SSL_CERT_FILE="/etc/ssl/certs/ca-certificates.crt"
if [ -f "\${DEBIAN_SSL_CERT_FILE}" ]; then
c=\${DEBIAN_SSL_CERT_FILE}
fi
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/../bin:\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
EOF
chmod +x "$install"
}
@@ -378,5 +379,9 @@ elif ! $packaging; then
chown -R scylla:scylla $rdata
chown -R scylla:scylla $rhkdata
for file in dist/common/sysctl.d/*.conf; do
bn=$(basename "$file")
sysctl -p "$rusr"/lib/sysctl.d/"$bn"
done
$rprefix/scripts/scylla_post_install.sh
fi

View File

@@ -168,15 +168,33 @@ insert_token_range_to_sorted_container_while_unwrapping(
dht::token_range_vector
abstract_replication_strategy::get_ranges(inet_address ep) const {
return get_ranges(ep, _token_metadata);
return do_get_ranges(ep, _token_metadata, false);
}
dht::token_range_vector
abstract_replication_strategy::get_ranges_in_thread(inet_address ep) const {
return do_get_ranges(ep, _token_metadata, true);
}
dht::token_range_vector
abstract_replication_strategy::get_ranges(inet_address ep, token_metadata& tm) const {
return do_get_ranges(ep, tm, false);
}
dht::token_range_vector
abstract_replication_strategy::get_ranges_in_thread(inet_address ep, token_metadata& tm) const {
return do_get_ranges(ep, tm, true);
}
dht::token_range_vector
abstract_replication_strategy::do_get_ranges(inet_address ep, token_metadata& tm, bool can_yield) const {
dht::token_range_vector ret;
auto prev_tok = tm.sorted_tokens().back();
for (auto tok : tm.sorted_tokens()) {
for (inet_address a : calculate_natural_endpoints(tok, tm)) {
if (can_yield) {
seastar::thread::maybe_yield();
}
if (a == ep) {
insert_token_range_to_sorted_container_while_unwrapping(prev_tok, tok, ret);
break;

View File

@@ -113,10 +113,15 @@ public:
// It the analogue of Origin's getAddressRanges().get(endpoint).
// This function is not efficient, and not meant for the fast path.
dht::token_range_vector get_ranges(inet_address ep) const;
dht::token_range_vector get_ranges_in_thread(inet_address ep) const;
// Use the token_metadata provided by the caller instead of _token_metadata
dht::token_range_vector get_ranges(inet_address ep, token_metadata& tm) const;
dht::token_range_vector get_ranges_in_thread(inet_address ep, token_metadata& tm) const;
private:
dht::token_range_vector do_get_ranges(inet_address ep, token_metadata& tm, bool can_yield) const;
public:
// get_primary_ranges() returns the list of "primary ranges" for the given
// endpoint. "Primary ranges" are the ranges that the node is responsible
// for storing replica primarily, which means this is the first node

10
lua.cc
View File

@@ -262,14 +262,12 @@ static auto visit_lua_raw_value(lua_State* l, int index, Func&& f) {
template <typename Func>
static auto visit_decimal(const big_decimal &v, Func&& f) {
boost::multiprecision::cpp_int ten(10);
const auto& dividend = v.unscaled_value();
auto divisor = boost::multiprecision::pow(ten, v.scale());
boost::multiprecision::cpp_rational r = v.as_rational();
const boost::multiprecision::cpp_int& dividend = numerator(r);
const boost::multiprecision::cpp_int& divisor = denominator(r);
if (dividend % divisor == 0) {
return f(utils::multiprecision_int(boost::multiprecision::cpp_int(dividend/divisor)));
return f(utils::multiprecision_int(dividend/divisor));
}
boost::multiprecision::cpp_rational r = dividend;
r /= divisor;
return f(r.convert_to<double>());
}

42
main.cc
View File

@@ -830,6 +830,7 @@ int main(int ac, char** av) {
storage_proxy_smp_service_group_config.max_nonlocal_requests = 5000;
spcfg.read_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get0();
spcfg.write_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get0();
spcfg.hints_write_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get0();
spcfg.write_ack_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get0();
static db::view::node_update_backlog node_backlog(smp::count, 10ms);
scheduling_group_key_config storage_proxy_stats_cfg =
@@ -967,12 +968,16 @@ int main(int ac, char** av) {
mm.init_messaging_service();
}).get();
supervisor::notify("initializing storage proxy RPC verbs");
proxy.invoke_on_all([] (service::storage_proxy& p) {
p.init_messaging_service();
}).get();
proxy.invoke_on_all(&service::storage_proxy::init_messaging_service).get();
auto stop_proxy_handlers = defer_verbose_shutdown("storage proxy RPC verbs", [&proxy] {
proxy.invoke_on_all(&service::storage_proxy::uninit_messaging_service).get();
});
supervisor::notify("starting streaming service");
streaming::stream_session::init_streaming_service(db, sys_dist_ks, view_update_generator).get();
auto stop_streaming_service = defer_verbose_shutdown("streaming service", [] {
streaming::stream_session::uninit_streaming_service().get();
});
api::set_server_stream_manager(ctx).get();
supervisor::notify("starting hinted handoff manager");
@@ -1005,6 +1010,9 @@ int main(int ac, char** av) {
rs.stop().get();
});
repair_init_messaging_service_handler(rs, sys_dist_ks, view_update_generator).get();
auto stop_repair_messages = defer_verbose_shutdown("repair message handlers", [] {
repair_uninit_messaging_service_handler().get();
});
supervisor::notify("starting storage service", true);
auto& ss = service::get_local_storage_service();
ss.init_messaging_service_part().get();
@@ -1196,22 +1204,30 @@ int main(int ac, char** av) {
std::optional<uint16_t> alternator_https_port;
std::optional<tls::credentials_builder> creds;
if (cfg->alternator_https_port()) {
creds.emplace();
alternator_https_port = cfg->alternator_https_port();
creds->set_dh_level(tls::dh_params::level::MEDIUM);
creds->set_x509_key_file(cert, key, tls::x509_crt_format::PEM).get();
if (trust_store.empty()) {
creds->set_system_trust().get();
} else {
creds->set_x509_trust_file(trust_store, tls::x509_crt_format::PEM).get();
creds.emplace();
auto opts = cfg->alternator_encryption_options();
if (opts.empty()) {
// Earlier versions mistakenly configured Alternator's
// HTTPS parameters via the "server_encryption_option"
// configuration parameter. We *temporarily* continue
// to allow this, for backward compatibility.
opts = cfg->server_encryption_options();
if (!opts.empty()) {
startlog.warn("Setting server_encryption_options to configure "
"Alternator's HTTPS encryption is deprecated. Please "
"switch to setting alternator_encryption_options instead.");
}
}
creds->set_dh_level(tls::dh_params::level::MEDIUM);
auto cert = get_or_default(opts, "certificate", db::config::get_conf_sub("scylla.crt").string());
auto key = get_or_default(opts, "keyfile", db::config::get_conf_sub("scylla.key").string());
creds->set_x509_key_file(cert, key, tls::x509_crt_format::PEM).get();
auto prio = get_or_default(opts, "priority_string", sstring());
creds->set_priority_string(db::config::default_tls_priority);
if (!prio.empty()) {
creds->set_priority_string(prio);
}
if (clauth) {
creds->set_client_auth(seastar::tls::client_auth::REQUIRE);
}
}
bool alternator_enforce_authorization = cfg->alternator_enforce_authorization();
with_scheduling_group(dbcfg.statement_scheduling_group,

View File

@@ -572,7 +572,12 @@ messaging_service::initial_scheduling_info() const {
scheduling_group
messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
// We are not using get_rpc_client_idx() because it figures out the client
// index based on the current scheduling group, which is relevant when
// selecting the right client for sending a message, but is not relevant
// when registering handlers.
const auto idx = s_rpc_client_idx_table[static_cast<size_t>(verb)];
return _scheduling_info_for_connection_index[idx].sched_group;
}
scheduling_group
@@ -791,6 +796,10 @@ void messaging_service::register_stream_mutation_fragments(std::function<future<
register_handler(this, messaging_verb::STREAM_MUTATION_FRAGMENTS, std::move(func));
}
future<> messaging_service::unregister_stream_mutation_fragments() {
return unregister_handler(messaging_verb::STREAM_MUTATION_FRAGMENTS);
}
template<class SinkType, class SourceType>
future<rpc::sink<SinkType>, rpc::source<SourceType>>
do_make_sink_source(messaging_verb verb, uint32_t repair_meta_id, shared_ptr<messaging_service::rpc_protocol_client_wrapper> rpc_client, std::unique_ptr<messaging_service::rpc_protocol_wrapper>& rpc) {
@@ -822,6 +831,9 @@ rpc::sink<repair_row_on_wire_with_cmd> messaging_service::make_sink_for_repair_g
void messaging_service::register_repair_get_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_hash_with_cmd> source)>&& func) {
register_handler(this, messaging_verb::REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM, std::move(func));
}
future<> messaging_service::unregister_repair_get_row_diff_with_rpc_stream() {
return unregister_handler(messaging_verb::REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM);
}
// Wrapper for REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
future<rpc::sink<repair_row_on_wire_with_cmd>, rpc::source<repair_stream_cmd>>
@@ -841,6 +853,9 @@ rpc::sink<repair_stream_cmd> messaging_service::make_sink_for_repair_put_row_dif
void messaging_service::register_repair_put_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_stream_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_row_on_wire_with_cmd> source)>&& func) {
register_handler(this, messaging_verb::REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM, std::move(func));
}
future<> messaging_service::unregister_repair_put_row_diff_with_rpc_stream() {
return unregister_handler(messaging_verb::REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM);
}
// Wrapper for REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM
future<rpc::sink<repair_stream_cmd>, rpc::source<repair_hash_with_cmd>>
@@ -860,6 +875,9 @@ rpc::sink<repair_hash_with_cmd> messaging_service::make_sink_for_repair_get_full
void messaging_service::register_repair_get_full_row_hashes_with_rpc_stream(std::function<future<rpc::sink<repair_hash_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_stream_cmd> source)>&& func) {
register_handler(this, messaging_verb::REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM, std::move(func));
}
future<> messaging_service::unregister_repair_get_full_row_hashes_with_rpc_stream() {
return unregister_handler(messaging_verb::REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM);
}
// Send a message for verb
template <typename MsgIn, typename... MsgOut>
@@ -943,6 +961,9 @@ future<streaming::prepare_message> messaging_service::send_prepare_message(msg_a
return send_message<streaming::prepare_message>(this, messaging_verb::PREPARE_MESSAGE, id,
std::move(msg), plan_id, std::move(description), reason);
}
future<> messaging_service::unregister_prepare_message() {
return unregister_handler(messaging_verb::PREPARE_MESSAGE);
}
// PREPARE_DONE_MESSAGE
void messaging_service::register_prepare_done_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id)>&& func) {
@@ -952,6 +973,9 @@ future<> messaging_service::send_prepare_done_message(msg_addr id, UUID plan_id,
return send_message<void>(this, messaging_verb::PREPARE_DONE_MESSAGE, id,
plan_id, dst_cpu_id);
}
future<> messaging_service::unregister_prepare_done_message() {
return unregister_handler(messaging_verb::PREPARE_DONE_MESSAGE);
}
// STREAM_MUTATION
void messaging_service::register_stream_mutation(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, frozen_mutation fm, unsigned dst_cpu_id, rpc::optional<bool> fragmented, rpc::optional<streaming::stream_reason> reason)>&& func) {
@@ -976,6 +1000,9 @@ future<> messaging_service::send_stream_mutation_done(msg_addr id, UUID plan_id,
return send_message<void>(this, messaging_verb::STREAM_MUTATION_DONE, id,
plan_id, std::move(ranges), cf_id, dst_cpu_id);
}
future<> messaging_service::unregister_stream_mutation_done() {
return unregister_handler(messaging_verb::STREAM_MUTATION_DONE);
}
// COMPLETE_MESSAGE
void messaging_service::register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id, rpc::optional<bool> failed)>&& func) {
@@ -985,6 +1012,9 @@ future<> messaging_service::send_complete_message(msg_addr id, UUID plan_id, uns
return send_message<void>(this, messaging_verb::COMPLETE_MESSAGE, id,
plan_id, dst_cpu_id, failed);
}
future<> messaging_service::unregister_complete_message() {
return unregister_handler(messaging_verb::COMPLETE_MESSAGE);
}
void messaging_service::register_gossip_echo(std::function<future<> ()>&& func) {
register_handler(this, messaging_verb::GOSSIP_ECHO, std::move(func));
@@ -1199,14 +1229,14 @@ future<partition_checksum> messaging_service::send_repair_checksum_range(
}
// Wrapper for REPAIR_GET_FULL_ROW_HASHES
void messaging_service::register_repair_get_full_row_hashes(std::function<future<std::unordered_set<repair_hash>> (const rpc::client_info& cinfo, uint32_t repair_meta_id)>&& func) {
void messaging_service::register_repair_get_full_row_hashes(std::function<future<repair_hash_set> (const rpc::client_info& cinfo, uint32_t repair_meta_id)>&& func) {
register_handler(this, messaging_verb::REPAIR_GET_FULL_ROW_HASHES, std::move(func));
}
future<> messaging_service::unregister_repair_get_full_row_hashes() {
return unregister_handler(messaging_verb::REPAIR_GET_FULL_ROW_HASHES);
}
future<std::unordered_set<repair_hash>> messaging_service::send_repair_get_full_row_hashes(msg_addr id, uint32_t repair_meta_id) {
return send_message<future<std::unordered_set<repair_hash>>>(this, messaging_verb::REPAIR_GET_FULL_ROW_HASHES, std::move(id), repair_meta_id);
future<repair_hash_set> messaging_service::send_repair_get_full_row_hashes(msg_addr id, uint32_t repair_meta_id) {
return send_message<future<repair_hash_set>>(this, messaging_verb::REPAIR_GET_FULL_ROW_HASHES, std::move(id), repair_meta_id);
}
// Wrapper for REPAIR_GET_COMBINED_ROW_HASH
@@ -1231,13 +1261,13 @@ future<get_sync_boundary_response> messaging_service::send_repair_get_sync_bound
}
// Wrapper for REPAIR_GET_ROW_DIFF
void messaging_service::register_repair_get_row_diff(std::function<future<repair_rows_on_wire> (const rpc::client_info& cinfo, uint32_t repair_meta_id, std::unordered_set<repair_hash> set_diff, bool needs_all_rows)>&& func) {
void messaging_service::register_repair_get_row_diff(std::function<future<repair_rows_on_wire> (const rpc::client_info& cinfo, uint32_t repair_meta_id, repair_hash_set set_diff, bool needs_all_rows)>&& func) {
register_handler(this, messaging_verb::REPAIR_GET_ROW_DIFF, std::move(func));
}
future<> messaging_service::unregister_repair_get_row_diff() {
return unregister_handler(messaging_verb::REPAIR_GET_ROW_DIFF);
}
future<repair_rows_on_wire> messaging_service::send_repair_get_row_diff(msg_addr id, uint32_t repair_meta_id, std::unordered_set<repair_hash> set_diff, bool needs_all_rows) {
future<repair_rows_on_wire> messaging_service::send_repair_get_row_diff(msg_addr id, uint32_t repair_meta_id, repair_hash_set set_diff, bool needs_all_rows) {
return send_message<future<repair_rows_on_wire>>(this, messaging_verb::REPAIR_GET_ROW_DIFF, std::move(id), repair_meta_id, std::move(set_diff), needs_all_rows);
}

View File

@@ -297,10 +297,12 @@ public:
streaming::prepare_message msg, UUID plan_id, sstring description, rpc::optional<streaming::stream_reason> reason)>&& func);
future<streaming::prepare_message> send_prepare_message(msg_addr id, streaming::prepare_message msg, UUID plan_id,
sstring description, streaming::stream_reason);
future<> unregister_prepare_message();
// Wrapper for PREPARE_DONE_MESSAGE verb
void register_prepare_done_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id)>&& func);
future<> send_prepare_done_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id);
future<> unregister_prepare_done_message();
// Wrapper for STREAM_MUTATION verb
void register_stream_mutation(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, frozen_mutation fm, unsigned dst_cpu_id, rpc::optional<bool>, rpc::optional<streaming::stream_reason>)>&& func);
@@ -309,6 +311,7 @@ public:
// Wrapper for STREAM_MUTATION_FRAGMENTS
// The receiver of STREAM_MUTATION_FRAGMENTS sends status code to the sender to notify any error on the receiver side. The status code is of type int32_t. 0 means successful, -1 means error, other status code value are reserved for future use.
void register_stream_mutation_fragments(std::function<future<rpc::sink<int32_t>> (const rpc::client_info& cinfo, UUID plan_id, UUID schema_id, UUID cf_id, uint64_t estimated_partitions, rpc::optional<streaming::stream_reason> reason_opt, rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>> source)>&& func);
future<> unregister_stream_mutation_fragments();
rpc::sink<int32_t> make_sink_for_stream_mutation_fragments(rpc::source<frozen_mutation_fragment, rpc::optional<streaming::stream_mutation_fragments_cmd>>& source);
future<rpc::sink<frozen_mutation_fragment, streaming::stream_mutation_fragments_cmd>, rpc::source<int32_t>> make_sink_and_source_for_stream_mutation_fragments(utils::UUID schema_id, utils::UUID plan_id, utils::UUID cf_id, uint64_t estimated_partitions, streaming::stream_reason reason, msg_addr id);
@@ -316,22 +319,27 @@ public:
future<rpc::sink<repair_hash_with_cmd>, rpc::source<repair_row_on_wire_with_cmd>> make_sink_and_source_for_repair_get_row_diff_with_rpc_stream(uint32_t repair_meta_id, msg_addr id);
rpc::sink<repair_row_on_wire_with_cmd> make_sink_for_repair_get_row_diff_with_rpc_stream(rpc::source<repair_hash_with_cmd>& source);
void register_repair_get_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_hash_with_cmd> source)>&& func);
future<> unregister_repair_get_row_diff_with_rpc_stream();
// Wrapper for REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
future<rpc::sink<repair_row_on_wire_with_cmd>, rpc::source<repair_stream_cmd>> make_sink_and_source_for_repair_put_row_diff_with_rpc_stream(uint32_t repair_meta_id, msg_addr id);
rpc::sink<repair_stream_cmd> make_sink_for_repair_put_row_diff_with_rpc_stream(rpc::source<repair_row_on_wire_with_cmd>& source);
void register_repair_put_row_diff_with_rpc_stream(std::function<future<rpc::sink<repair_stream_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_row_on_wire_with_cmd> source)>&& func);
future<> unregister_repair_put_row_diff_with_rpc_stream();
// Wrapper for REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM
future<rpc::sink<repair_stream_cmd>, rpc::source<repair_hash_with_cmd>> make_sink_and_source_for_repair_get_full_row_hashes_with_rpc_stream(uint32_t repair_meta_id, msg_addr id);
rpc::sink<repair_hash_with_cmd> make_sink_for_repair_get_full_row_hashes_with_rpc_stream(rpc::source<repair_stream_cmd>& source);
void register_repair_get_full_row_hashes_with_rpc_stream(std::function<future<rpc::sink<repair_hash_with_cmd>> (const rpc::client_info& cinfo, uint32_t repair_meta_id, rpc::source<repair_stream_cmd> source)>&& func);
future<> unregister_repair_get_full_row_hashes_with_rpc_stream();
void register_stream_mutation_done(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id)>&& func);
future<> send_stream_mutation_done(msg_addr id, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id);
future<> unregister_stream_mutation_done();
void register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id, rpc::optional<bool> failed)>&& func);
future<> send_complete_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id, bool failed = false);
future<> unregister_complete_message();
// Wrapper for REPAIR_CHECKSUM_RANGE verb
void register_repair_checksum_range(std::function<future<partition_checksum> (sstring keyspace, sstring cf, dht::token_range range, rpc::optional<repair_checksum> hash_version)>&& func);
@@ -339,9 +347,9 @@ public:
future<partition_checksum> send_repair_checksum_range(msg_addr id, sstring keyspace, sstring cf, dht::token_range range, repair_checksum hash_version);
// Wrapper for REPAIR_GET_FULL_ROW_HASHES
void register_repair_get_full_row_hashes(std::function<future<std::unordered_set<repair_hash>> (const rpc::client_info& cinfo, uint32_t repair_meta_id)>&& func);
void register_repair_get_full_row_hashes(std::function<future<repair_hash_set> (const rpc::client_info& cinfo, uint32_t repair_meta_id)>&& func);
future<> unregister_repair_get_full_row_hashes();
future<std::unordered_set<repair_hash>> send_repair_get_full_row_hashes(msg_addr id, uint32_t repair_meta_id);
future<repair_hash_set> send_repair_get_full_row_hashes(msg_addr id, uint32_t repair_meta_id);
// Wrapper for REPAIR_GET_COMBINED_ROW_HASH
void register_repair_get_combined_row_hash(std::function<future<get_combined_row_hash_response> (const rpc::client_info& cinfo, uint32_t repair_meta_id, std::optional<repair_sync_boundary> common_sync_boundary)>&& func);
@@ -354,9 +362,9 @@ public:
future<get_sync_boundary_response> send_repair_get_sync_boundary(msg_addr id, uint32_t repair_meta_id, std::optional<repair_sync_boundary> skipped_sync_boundary);
// Wrapper for REPAIR_GET_ROW_DIFF
void register_repair_get_row_diff(std::function<future<repair_rows_on_wire> (const rpc::client_info& cinfo, uint32_t repair_meta_id, std::unordered_set<repair_hash> set_diff, bool needs_all_rows)>&& func);
void register_repair_get_row_diff(std::function<future<repair_rows_on_wire> (const rpc::client_info& cinfo, uint32_t repair_meta_id, repair_hash_set set_diff, bool needs_all_rows)>&& func);
future<> unregister_repair_get_row_diff();
future<repair_rows_on_wire> send_repair_get_row_diff(msg_addr id, uint32_t repair_meta_id, std::unordered_set<repair_hash> set_diff, bool needs_all_rows);
future<repair_rows_on_wire> send_repair_get_row_diff(msg_addr id, uint32_t repair_meta_id, repair_hash_set set_diff, bool needs_all_rows);
// Wrapper for REPAIR_PUT_ROW_DIFF
void register_repair_put_row_diff(std::function<future<> (const rpc::client_info& cinfo, uint32_t repair_meta_id, repair_rows_on_wire row_diff)>&& func);

View File

@@ -300,10 +300,9 @@ flat_mutation_reader read_context::create_reader(
}
auto& table = _db.local().find_column_family(schema);
auto class_config = _db.local().make_query_class_config();
if (!rm.rparts) {
rm.rparts = make_foreign(std::make_unique<reader_meta::remote_parts>(class_config.semaphore));
rm.rparts = make_foreign(std::make_unique<reader_meta::remote_parts>(semaphore()));
}
rm.rparts->range = std::make_unique<const dht::partition_range>(pr);
@@ -513,18 +512,28 @@ future<> read_context::lookup_readers() {
}
return parallel_for_each(boost::irange(0u, smp::count), [this] (shard_id shard) {
return _db.invoke_on(shard, [shard, cmd = &_cmd, ranges = &_ranges, gs = global_schema_ptr(_schema),
return _db.invoke_on(shard, [this, shard, cmd = &_cmd, ranges = &_ranges, gs = global_schema_ptr(_schema),
gts = tracing::global_trace_state_ptr(_trace_state)] (database& db) mutable {
auto schema = gs.get();
auto querier_opt = db.get_querier_cache().lookup_shard_mutation_querier(cmd->query_uuid, *schema, *ranges, cmd->slice, gts.get());
auto& table = db.find_column_family(schema);
auto& semaphore = db.make_query_class_config().semaphore;
auto& semaphore = this->semaphore();
if (!querier_opt) {
return reader_meta(reader_state::inexistent, reader_meta::remote_parts(semaphore));
}
auto& q = *querier_opt;
if (&q.permit().semaphore() != &semaphore) {
on_internal_error(mmq_log, format("looked-up reader belongs to different semaphore than the one appropriate for this query class: "
"looked-up reader belongs to {} (0x{:x}) the query class appropriate is {} (0x{:x})",
q.permit().semaphore().name(),
reinterpret_cast<uintptr_t>(&q.permit().semaphore()),
semaphore.name(),
reinterpret_cast<uintptr_t>(&semaphore)));
}
auto handle = pause(semaphore, std::move(q).reader());
return reader_meta(
reader_state::successful_lookup,

View File

@@ -734,56 +734,78 @@ void write_counter_cell(RowWriter& w, const query::partition_slice& slice, ::ato
});
}
// Used to return the timestamp of the latest update to the row
struct max_timestamp {
api::timestamp_type max = api::missing_timestamp;
void update(api::timestamp_type ts) {
max = std::max(max, ts);
}
};
template<>
struct appending_hash<row> {
template<typename Hasher>
void operator()(Hasher& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {
for (auto id : columns) {
const cell_and_hash* cell_and_hash = cells.find_cell_and_hash(id);
if (!cell_and_hash) {
return;
}
auto&& def = s.column_at(kind, id);
if (def.is_atomic()) {
max_ts.update(cell_and_hash->cell.as_atomic_cell(def).timestamp());
if constexpr (query::using_hash_of_hash_v<Hasher>) {
if (cell_and_hash->hash) {
feed_hash(h, *cell_and_hash->hash);
} else {
query::default_hasher cellh;
feed_hash(cellh, cell_and_hash->cell.as_atomic_cell(def), def);
feed_hash(h, cellh.finalize_uint64());
}
template<typename Hasher>
void appending_hash<row>::operator()(Hasher& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {
for (auto id : columns) {
const cell_and_hash* cell_and_hash = cells.find_cell_and_hash(id);
if (!cell_and_hash) {
feed_hash(h, appending_hash<row>::null_hash_value);
continue;
}
auto&& def = s.column_at(kind, id);
if (def.is_atomic()) {
max_ts.update(cell_and_hash->cell.as_atomic_cell(def).timestamp());
if constexpr (query::using_hash_of_hash_v<Hasher>) {
if (cell_and_hash->hash) {
feed_hash(h, *cell_and_hash->hash);
} else {
feed_hash(h, cell_and_hash->cell.as_atomic_cell(def), def);
query::default_hasher cellh;
feed_hash(cellh, cell_and_hash->cell.as_atomic_cell(def), def);
feed_hash(h, cellh.finalize_uint64());
}
} else {
auto cm = cell_and_hash->cell.as_collection_mutation();
max_ts.update(cm.last_update(*def.type));
if constexpr (query::using_hash_of_hash_v<Hasher>) {
if (cell_and_hash->hash) {
feed_hash(h, *cell_and_hash->hash);
} else {
query::default_hasher cellh;
feed_hash(cellh, cm, def);
feed_hash(h, cellh.finalize_uint64());
}
feed_hash(h, cell_and_hash->cell.as_atomic_cell(def), def);
}
} else {
auto cm = cell_and_hash->cell.as_collection_mutation();
max_ts.update(cm.last_update(*def.type));
if constexpr (query::using_hash_of_hash_v<Hasher>) {
if (cell_and_hash->hash) {
feed_hash(h, *cell_and_hash->hash);
} else {
feed_hash(h, cm, def);
query::default_hasher cellh;
feed_hash(cellh, cm, def);
feed_hash(h, cellh.finalize_uint64());
}
} else {
feed_hash(h, cm, def);
}
}
}
};
}
// Instantiation for mutation_test.cc
template void appending_hash<row>::operator()<xx_hasher>(xx_hasher& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const;
template<>
void appending_hash<row>::operator()<legacy_xx_hasher_without_null_digest>(legacy_xx_hasher_without_null_digest& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {
for (auto id : columns) {
const cell_and_hash* cell_and_hash = cells.find_cell_and_hash(id);
if (!cell_and_hash) {
return;
}
auto&& def = s.column_at(kind, id);
if (def.is_atomic()) {
max_ts.update(cell_and_hash->cell.as_atomic_cell(def).timestamp());
if (cell_and_hash->hash) {
feed_hash(h, *cell_and_hash->hash);
} else {
query::default_hasher cellh;
feed_hash(cellh, cell_and_hash->cell.as_atomic_cell(def), def);
feed_hash(h, cellh.finalize_uint64());
}
} else {
auto cm = cell_and_hash->cell.as_collection_mutation();
max_ts.update(cm.last_update(*def.type));
if (cell_and_hash->hash) {
feed_hash(h, *cell_and_hash->hash);
} else {
query::default_hasher cellh;
feed_hash(cellh, cm, def);
feed_hash(h, cellh.finalize_uint64());
}
}
}
}
cell_hash_opt row::cell_hash_for(column_id id) const {
if (_type == storage_type::vector) {
@@ -1721,7 +1743,7 @@ void row::apply_monotonically(const schema& s, column_kind kind, row&& other) {
// we erase the live cells according to the shadowable_tombstone rules.
static bool dead_marker_shadows_row(const schema& s, column_kind kind, const row_marker& marker) {
return s.is_view()
&& !s.view_info()->base_non_pk_columns_in_view_pk().empty()
&& s.view_info()->has_base_non_pk_columns_in_view_pk()
&& !marker.is_live()
&& kind == column_kind::regular_column; // not applicable to static rows
}

View File

@@ -649,6 +649,22 @@ public:
};
};
// Used to return the timestamp of the latest update to the row
struct max_timestamp {
api::timestamp_type max = api::missing_timestamp;
void update(api::timestamp_type ts) {
max = std::max(max, ts);
}
};
template<>
struct appending_hash<row> {
static constexpr int null_hash_value = 0xbeefcafe;
template<typename Hasher>
void operator()(Hasher& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const;
};
class row_marker;
int compare_row_marker_for_merge(const row_marker& left, const row_marker& right) noexcept;

View File

@@ -114,9 +114,6 @@ class reconcilable_result_builder {
const schema& _schema;
const query::partition_slice& _slice;
utils::chunked_vector<partition> _result;
uint32_t _live_rows{};
bool _return_static_content_on_partition_with_no_rows{};
bool _static_row_is_alive{};
uint32_t _total_live_rows = 0;
@@ -124,6 +121,10 @@ class reconcilable_result_builder {
stop_iteration _stop;
bool _short_read_allowed;
std::optional<streamed_mutation_freezer> _mutation_consumer;
uint32_t _live_rows{};
// make this the last member so it is destroyed first. #7240
utils::chunked_vector<partition> _result;
public:
reconcilable_result_builder(const schema& s, const query::partition_slice& slice,
query::result_memory_accounter&& accounter)

View File

@@ -30,6 +30,7 @@
#include "schema_registry.hh"
#include "mutation_compactor.hh"
logging::logger mrlog("mutation_reader");
static constexpr size_t merger_small_vector_size = 4;
@@ -659,6 +660,8 @@ flat_mutation_reader make_combined_reader(schema_ptr schema,
return make_combined_reader(std::move(schema), std::move(v), fwd_sm, fwd_mr);
}
const ssize_t new_reader_base_cost{16 * 1024};
class restricting_mutation_reader : public flat_mutation_reader::impl {
struct mutation_source_and_params {
mutation_source _ms;
@@ -685,8 +688,6 @@ class restricting_mutation_reader : public flat_mutation_reader::impl {
};
std::variant<pending_state, admitted_state> _state;
static const ssize_t new_reader_base_cost{16 * 1024};
template<typename Function>
requires std::is_move_constructible<Function>::value
&& requires(Function fn, flat_mutation_reader& reader) {
@@ -1026,6 +1027,13 @@ private:
bool _reader_created = false;
bool _drop_partition_start = false;
bool _drop_static_row = false;
// Trim range tombstones on the start of the buffer to the start of the read
// range (_next_position_in_partition). Set after reader recreation.
// Also validate the first not-trimmed mutation fragment's position.
bool _trim_range_tombstones = false;
// Validate the partition key of the first emitted partition, set after the
// reader was recreated.
bool _validate_partition_key = false;
position_in_partition::tri_compare _tri_cmp;
std::optional<dht::decorated_key> _last_pkey;
@@ -1047,7 +1055,10 @@ private:
void adjust_partition_slice();
flat_mutation_reader recreate_reader();
flat_mutation_reader resume_or_create_reader();
void maybe_validate_partition_start(const circular_buffer<mutation_fragment>& buffer);
void validate_position_in_partition(position_in_partition_view pos) const;
bool should_drop_fragment(const mutation_fragment& mf);
bool maybe_trim_range_tombstone(mutation_fragment& mf) const;
future<> do_fill_buffer(flat_mutation_reader& reader, db::timeout_clock::time_point timeout);
future<> fill_buffer(flat_mutation_reader& reader, db::timeout_clock::time_point timeout);
@@ -1120,16 +1131,11 @@ void evictable_reader::update_next_position(flat_mutation_reader& reader) {
_next_position_in_partition = position_in_partition::before_all_clustered_rows();
break;
case partition_region::clustered:
if (reader.is_buffer_empty()) {
_next_position_in_partition = position_in_partition::after_key(last_pos);
} else {
const auto& next_frag = reader.peek_buffer();
if (next_frag.is_end_of_partition()) {
if (!reader.is_buffer_empty() && reader.peek_buffer().is_end_of_partition()) {
push_mutation_fragment(reader.pop_mutation_fragment());
_next_position_in_partition = position_in_partition::for_partition_start();
} else {
_next_position_in_partition = position_in_partition(next_frag.position());
}
} else {
_next_position_in_partition = position_in_partition::after_key(last_pos);
}
break;
case partition_region::partition_end:
@@ -1154,6 +1160,9 @@ flat_mutation_reader evictable_reader::recreate_reader() {
const dht::partition_range* range = _pr;
const query::partition_slice* slice = &_ps;
_range_override.reset();
_slice_override.reset();
if (_last_pkey) {
bool partition_range_is_inclusive = true;
@@ -1190,6 +1199,9 @@ flat_mutation_reader evictable_reader::recreate_reader() {
range = &*_range_override;
}
_trim_range_tombstones = true;
_validate_partition_key = true;
return _ms.make_reader(
_schema,
_permit,
@@ -1216,6 +1228,78 @@ flat_mutation_reader evictable_reader::resume_or_create_reader() {
return recreate_reader();
}
template <typename... Arg>
static void require(bool condition, const char* msg, const Arg&... arg) {
if (!condition) {
on_internal_error(mrlog, format(msg, arg...));
}
}
void evictable_reader::maybe_validate_partition_start(const circular_buffer<mutation_fragment>& buffer) {
if (!_validate_partition_key || buffer.empty()) {
return;
}
// If this is set we can assume the first fragment is a partition-start.
const auto& ps = buffer.front().as_partition_start();
const auto tri_cmp = dht::ring_position_comparator(*_schema);
// If we recreated the reader after fast-forwarding it we won't have
// _last_pkey set. In this case it is enough to check if the partition
// is in range.
if (_last_pkey) {
const auto cmp_res = tri_cmp(*_last_pkey, ps.key());
if (_drop_partition_start) { // should be the same partition
require(
cmp_res == 0,
"{}(): validation failed, expected partition with key equal to _last_pkey {} due to _drop_partition_start being set, but got {}",
__FUNCTION__,
*_last_pkey,
ps.key());
} else { // should be a larger partition
require(
cmp_res < 0,
"{}(): validation failed, expected partition with key larger than _last_pkey {} due to _drop_partition_start being unset, but got {}",
__FUNCTION__,
*_last_pkey,
ps.key());
}
}
const auto& prange = _range_override ? *_range_override : *_pr;
require(
// TODO: somehow avoid this copy
prange.contains(ps.key(), tri_cmp),
"{}(): validation failed, expected partition with key that falls into current range {}, but got {}",
__FUNCTION__,
prange,
ps.key());
_validate_partition_key = false;
}
void evictable_reader::validate_position_in_partition(position_in_partition_view pos) const {
require(
_tri_cmp(_next_position_in_partition, pos) <= 0,
"{}(): validation failed, expected position in partition that is larger-than-equal than _next_position_in_partition {}, but got {}",
__FUNCTION__,
_next_position_in_partition,
pos);
if (_slice_override && pos.region() == partition_region::clustered) {
const auto ranges = _slice_override->row_ranges(*_schema, _last_pkey->key());
const bool any_contains = std::any_of(ranges.begin(), ranges.end(), [this, &pos] (const query::clustering_range& cr) {
// TODO: somehow avoid this copy
auto range = position_range(cr);
return range.contains(*_schema, pos);
});
require(
any_contains,
"{}(): validation failed, expected clustering fragment that is included in the slice {}, but got {}",
__FUNCTION__,
*_slice_override,
pos);
}
}
bool evictable_reader::should_drop_fragment(const mutation_fragment& mf) {
if (_drop_partition_start && mf.is_partition_start()) {
_drop_partition_start = false;
@@ -1228,12 +1312,50 @@ bool evictable_reader::should_drop_fragment(const mutation_fragment& mf) {
return false;
}
bool evictable_reader::maybe_trim_range_tombstone(mutation_fragment& mf) const {
// We either didn't read a partition yet (evicted after fast-forwarding) or
// didn't stop in a clustering region. We don't need to trim range
// tombstones in either case.
if (!_last_pkey || _next_position_in_partition.region() != partition_region::clustered) {
return false;
}
if (!mf.is_range_tombstone()) {
validate_position_in_partition(mf.position());
return false;
}
if (_tri_cmp(mf.position(), _next_position_in_partition) >= 0) {
validate_position_in_partition(mf.position());
return false; // rt in range, no need to trim
}
auto& rt = mf.as_mutable_range_tombstone();
require(
_tri_cmp(_next_position_in_partition, rt.end_position()) <= 0,
"{}(): validation failed, expected range tombstone with end pos larger than _next_position_in_partition {}, but got {}",
__FUNCTION__,
_next_position_in_partition,
rt.end_position());
rt.set_start(*_schema, position_in_partition_view::before_key(_next_position_in_partition));
return true;
}
future<> evictable_reader::do_fill_buffer(flat_mutation_reader& reader, db::timeout_clock::time_point timeout) {
if (!_drop_partition_start && !_drop_static_row) {
return reader.fill_buffer(timeout);
auto fill_buf_fut = reader.fill_buffer(timeout);
if (_validate_partition_key) {
fill_buf_fut = fill_buf_fut.then([this, &reader] {
maybe_validate_partition_start(reader.buffer());
});
}
return fill_buf_fut;
}
return repeat([this, &reader, timeout] {
return reader.fill_buffer(timeout).then([this, &reader] {
maybe_validate_partition_start(reader.buffer());
while (!reader.is_buffer_empty() && should_drop_fragment(reader.peek_buffer())) {
reader.pop_mutation_fragment();
}
@@ -1247,6 +1369,11 @@ future<> evictable_reader::fill_buffer(flat_mutation_reader& reader, db::timeout
if (reader.is_buffer_empty()) {
return make_ready_future<>();
}
while (_trim_range_tombstones && !reader.is_buffer_empty()) {
auto mf = reader.pop_mutation_fragment();
_trim_range_tombstones = maybe_trim_range_tombstone(mf);
push_mutation_fragment(std::move(mf));
}
reader.move_buffer_content_to(*this);
auto stop = [this, &reader] {
// The only problematic fragment kind is the range tombstone.
@@ -1287,7 +1414,13 @@ future<> evictable_reader::fill_buffer(flat_mutation_reader& reader, db::timeout
if (reader.is_buffer_empty()) {
return do_fill_buffer(reader, timeout);
}
push_mutation_fragment(reader.pop_mutation_fragment());
if (_trim_range_tombstones) {
auto mf = reader.pop_mutation_fragment();
_trim_range_tombstones = maybe_trim_range_tombstone(mf);
push_mutation_fragment(std::move(mf));
} else {
push_mutation_fragment(reader.pop_mutation_fragment());
}
return make_ready_future<>();
});
}).then([this, &reader] {
@@ -1915,11 +2048,13 @@ public:
}
}
void abort(std::exception_ptr ep) {
_end_of_stream = true;
_ex = std::move(ep);
if (_full) {
_full->set_exception(_ex);
_full.reset();
} else if (_not_full) {
_not_full->set_exception(_ex);
_not_full.reset();
}
}
};

View File

@@ -304,6 +304,8 @@ public:
mutation_source make_empty_mutation_source();
snapshot_source make_empty_snapshot_source();
extern const ssize_t new_reader_base_cost;
// Creates a restricted reader whose resource usages will be tracked
// during it's lifetime. If there are not enough resources (dues to
// existing readers) to create the new reader, it's construction will

View File

@@ -36,8 +36,14 @@ future<> feed_writer(flat_mutation_reader&& rd, Writer&& wr) {
auto f2 = rd.is_buffer_empty() ? rd.fill_buffer(db::no_timeout) : make_ready_future<>();
return when_all_succeed(std::move(f1), std::move(f2)).discard_result();
});
}).finally([&wr] {
return wr.consume_end_of_stream();
}).then_wrapped([&wr] (future<> f) {
if (f.failed()) {
auto ex = f.get_exception();
wr.abort(ex);
return make_exception_future<>(ex);
} else {
return wr.consume_end_of_stream();
}
});
});
}

View File

@@ -57,6 +57,9 @@ class shard_based_splitting_mutation_writer {
}
return std::move(_consume_fut);
}
void abort(std::exception_ptr ep) {
_handle.abort(ep);
}
};
private:
@@ -108,6 +111,13 @@ public:
return shard->consume_end_of_stream();
});
}
void abort(std::exception_ptr ep) {
for (auto&& shard : _shards) {
if (shard) {
shard->abort(ep);
}
}
}
};
future<> segregate_by_shard(flat_mutation_reader producer, reader_consumer consumer) {

View File

@@ -144,6 +144,9 @@ class timestamp_based_splitting_mutation_writer {
}
return std::move(_consume_fut);
}
void abort(std::exception_ptr ep) {
_handle.abort(ep);
}
};
private:
@@ -186,6 +189,11 @@ public:
return bucket.second.consume_end_of_stream();
});
}
void abort(std::exception_ptr ep) {
for (auto&& b : _buckets) {
b.second.abort(ep);
}
}
};
future<> timestamp_based_splitting_mutation_writer::write_to_bucket(bucket_id bucket, mutation_fragment&& mf) {

View File

@@ -640,12 +640,12 @@ partition_snapshot_ptr partition_entry::read(logalloc::region& r,
return partition_snapshot_ptr(std::move(snp));
}
std::vector<range_tombstone>
partition_snapshot::range_tombstone_result
partition_snapshot::range_tombstones(position_in_partition_view start, position_in_partition_view end)
{
partition_version* v = &*version();
if (!v->next()) {
return boost::copy_range<std::vector<range_tombstone>>(
return boost::copy_range<range_tombstone_result>(
v->partition().row_tombstones().slice(*_schema, start, end));
}
range_tombstone_list list(*_schema);
@@ -655,10 +655,10 @@ partition_snapshot::range_tombstones(position_in_partition_view start, position_
}
v = v->next();
}
return boost::copy_range<std::vector<range_tombstone>>(list.slice(*_schema, start, end));
return boost::copy_range<range_tombstone_result>(list.slice(*_schema, start, end));
}
std::vector<range_tombstone>
partition_snapshot::range_tombstone_result
partition_snapshot::range_tombstones()
{
return range_tombstones(

View File

@@ -26,6 +26,7 @@
#include "utils/anchorless_list.hh"
#include "utils/logalloc.hh"
#include "utils/coroutine.hh"
#include "utils/chunked_vector.hh"
#include <boost/intrusive/parent_from_member.hpp>
#include <boost/intrusive/slist.hpp>
@@ -400,10 +401,13 @@ public:
::static_row static_row(bool digest_requested) const;
bool static_row_continuous() const;
mutation_partition squashed() const;
using range_tombstone_result = utils::chunked_vector<range_tombstone>;
// Returns range tombstones overlapping with [start, end)
std::vector<range_tombstone> range_tombstones(position_in_partition_view start, position_in_partition_view end);
range_tombstone_result range_tombstones(position_in_partition_view start, position_in_partition_view end);
// Returns all range tombstones
std::vector<range_tombstone> range_tombstones();
range_tombstone_result range_tombstones();
};
class partition_snapshot_ptr {

View File

@@ -163,6 +163,11 @@ public:
return {partition_region::clustered, bound_weight::before_all_prefixed, &ck};
}
// Returns a view to before_key(pos._ck) if pos.is_clustering_row() else returns pos as-is.
static position_in_partition_view before_key(position_in_partition_view pos) {
return {partition_region::clustered, pos._bound_weight == bound_weight::equal ? bound_weight::before_all_prefixed : pos._bound_weight, pos._ck};
}
partition_region region() const { return _type; }
bound_weight get_bound_weight() const { return _bound_weight; }
bool is_partition_start() const { return _type == partition_region::partition_start; }

View File

@@ -28,6 +28,7 @@
reader_permit::resource_units::resource_units(reader_concurrency_semaphore& semaphore, reader_resources res) noexcept
: _semaphore(&semaphore), _resources(res) {
_semaphore->consume(res);
}
reader_permit::resource_units::resource_units(resource_units&& o) noexcept
@@ -75,7 +76,6 @@ reader_permit::resource_units reader_permit::consume_memory(size_t memory) {
}
reader_permit::resource_units reader_permit::consume_resources(reader_resources res) {
_semaphore->consume(res);
return resource_units(*_semaphore, res);
}
@@ -83,7 +83,6 @@ void reader_concurrency_semaphore::signal(const resources& r) noexcept {
_resources += r;
while (!_wait_list.empty() && has_available_units(_wait_list.front().res)) {
auto& x = _wait_list.front();
_resources -= x.res;
try {
x.pr.set_value(reader_permit::resource_units(*this, x.res));
} catch (...) {
@@ -104,7 +103,7 @@ reader_concurrency_semaphore::inactive_read_handle reader_concurrency_semaphore:
const auto [it, _] = _inactive_reads.emplace(_next_id++, std::move(ir));
(void)_;
++_inactive_read_stats.population;
return inactive_read_handle(it->first);
return inactive_read_handle(*this, it->first);
}
// The evicted reader will release its permit, hopefully allowing us to
@@ -115,6 +114,17 @@ reader_concurrency_semaphore::inactive_read_handle reader_concurrency_semaphore:
}
std::unique_ptr<reader_concurrency_semaphore::inactive_read> reader_concurrency_semaphore::unregister_inactive_read(inactive_read_handle irh) {
if (irh && irh._sem != this) {
throw std::runtime_error(fmt::format(
"reader_concurrency_semaphore::unregister_inactive_read(): "
"attempted to unregister an inactive read with a handle belonging to another semaphore: "
"this is {} (0x{:x}) but the handle belongs to {} (0x{:x})",
name(),
reinterpret_cast<uintptr_t>(this),
irh._sem->name(),
reinterpret_cast<uintptr_t>(irh._sem)));
}
if (auto it = _inactive_reads.find(irh._id); it != _inactive_reads.end()) {
auto ir = std::move(it->second);
_inactive_reads.erase(it);
@@ -158,7 +168,6 @@ future<reader_permit::resource_units> reader_concurrency_semaphore::do_wait_admi
--_inactive_read_stats.population;
}
if (may_proceed(r)) {
_resources -= r;
return make_ready_future<reader_permit::resource_units>(reader_permit::resource_units(*this, r));
}
promise<reader_permit::resource_units> pr;

View File

@@ -60,18 +60,20 @@ public:
};
class inactive_read_handle {
reader_concurrency_semaphore* _sem = nullptr;
uint64_t _id = 0;
friend class reader_concurrency_semaphore;
explicit inactive_read_handle(uint64_t id)
: _id(id) {
explicit inactive_read_handle(reader_concurrency_semaphore& sem, uint64_t id)
: _sem(&sem), _id(id) {
}
public:
inactive_read_handle() = default;
inactive_read_handle(inactive_read_handle&& o) : _id(std::exchange(o._id, 0)) {
inactive_read_handle(inactive_read_handle&& o) : _sem(std::exchange(o._sem, nullptr)), _id(std::exchange(o._id, 0)) {
}
inactive_read_handle& operator=(inactive_read_handle&& o) {
_sem = std::exchange(o._sem, nullptr);
_id = std::exchange(o._id, 0);
return *this;
}
@@ -105,6 +107,7 @@ private:
};
private:
const resources _initial_resources;
resources _resources;
expiring_fifo<entry, expiry_handler, db::timeout_clock> _wait_list;
@@ -135,7 +138,8 @@ public:
sstring name,
size_t max_queue_length = std::numeric_limits<size_t>::max(),
std::function<void()> prethrow_action = nullptr)
: _resources(count, memory)
: _initial_resources(count, memory)
, _resources(count, memory)
, _wait_list(expiry_handler(name))
, _name(std::move(name))
, _max_queue_length(max_queue_length)
@@ -144,11 +148,11 @@ public:
/// Create a semaphore with practically unlimited count and memory.
///
/// And conversely, no queue limit either.
explicit reader_concurrency_semaphore(no_limits)
explicit reader_concurrency_semaphore(no_limits, sstring name = "unlimited reader_concurrency_semaphore")
: reader_concurrency_semaphore(
std::numeric_limits<int>::max(),
std::numeric_limits<ssize_t>::max(),
"unlimited reader_concurrency_semaphore") {}
std::move(name)) {}
~reader_concurrency_semaphore();
@@ -158,6 +162,13 @@ public:
reader_concurrency_semaphore(reader_concurrency_semaphore&&) = delete;
reader_concurrency_semaphore& operator=(reader_concurrency_semaphore&&) = delete;
/// Returns the name of the semaphore
///
/// If the semaphore has no name, "unnamed reader concurrency semaphore" is returned.
std::string_view name() const {
return _name.empty() ? "unnamed reader concurrency semaphore" : std::string_view(_name);
}
/// Register an inactive read.
///
/// The semaphore will evict this read when there is a shortage of
@@ -193,6 +204,10 @@ public:
reader_permit make_permit();
const resources initial_resources() const {
return _initial_resources;
}
const resources available_resources() const {
return _resources;
}

View File

@@ -42,12 +42,20 @@ struct reader_resources {
return count >= other.count && memory >= other.memory;
}
reader_resources operator-(const reader_resources& other) const {
return reader_resources{count - other.count, memory - other.memory};
}
reader_resources& operator-=(const reader_resources& other) {
count -= other.count;
memory -= other.memory;
return *this;
}
reader_resources operator+(const reader_resources& other) const {
return reader_resources{count + other.count, memory + other.memory};
}
reader_resources& operator+=(const reader_resources& other) {
count += other.count;
memory += other.memory;

View File

@@ -62,7 +62,7 @@ shared_ptr<abstract_command> exists::prepare(service::storage_proxy& proxy, requ
}
future<redis_message> exists::execute(service::storage_proxy& proxy, redis::redis_options& options, service_permit permit) {
return seastar::do_for_each(_keys, [&proxy, &options, &permit, this] (bytes key) {
return seastar::do_for_each(_keys, [&proxy, &options, permit, this] (bytes& key) {
return redis::read_strings(proxy, options, key, permit).then([this] (lw_shared_ptr<strings_result> result) {
if (result->has_result()) {
_count++;

View File

@@ -44,15 +44,15 @@ mkdir -p $BUILDDIR/scylla-package
tar -C $BUILDDIR/scylla-package -xpf $RELOC_PKG
cd $BUILDDIR/scylla-package
PRODUCT=$(cat scylla/SCYLLA-PRODUCT-FILE)
SCYLLA_VERSION=$(cat scylla/SCYLLA-VERSION-FILE)
SCYLLA_RELEASE=$(cat scylla/SCYLLA-RELEASE-FILE)
ln -fv $RELOC_PKG ../$PRODUCT-server_$SCYLLA_VERSION-$SCYLLA_RELEASE.orig.tar.gz
if $DIST; then
export DEB_BUILD_OPTIONS="housekeeping"
fi
mv scylla/debian debian
PKG_NAME=$(dpkg-parsechangelog --show-field Source)
# XXX: Drop revision number from version string.
# Since it always '1', this should be okay for now.
PKG_VERSION=$(dpkg-parsechangelog --show-field Version |sed -e 's/-1$//')
ln -fv $RELOC_PKG ../"$PKG_NAME"_"$PKG_VERSION".orig.tar.gz
debuild -rfakeroot -us -uc

View File

@@ -1633,6 +1633,7 @@ future<> bootstrap_with_repair(seastar::sharded<database>& db, locator::token_me
auto& ks = db.local().find_keyspace(keyspace_name);
auto& strat = ks.get_replication_strategy();
dht::token_range_vector desired_ranges = strat.get_pending_address_ranges(tm, tokens, myip);
bool find_node_in_local_dc_only = strat.get_type() == locator::replication_strategy_type::network_topology;
//Active ranges
auto metadata_clone = tm.clone_only_token_map();
@@ -1719,6 +1720,9 @@ future<> bootstrap_with_repair(seastar::sharded<database>& db, locator::token_me
mandatory_neighbors = get_node_losing_the_ranges(old_endpoints, new_endpoints);
neighbors = mandatory_neighbors;
} else if (old_endpoints.size() < strat.get_replication_factor()) {
if (!find_node_in_local_dc_only) {
neighbors = old_endpoints;
} else {
if (old_endpoints_in_local_dc.size() == rf_in_local_dc) {
// Local DC has enough replica nodes.
mandatory_neighbors = get_node_losing_the_ranges(old_endpoints_in_local_dc, new_endpoints);
@@ -1746,6 +1750,7 @@ future<> bootstrap_with_repair(seastar::sharded<database>& db, locator::token_me
throw std::runtime_error(format("bootstrap_with_repair: keyspace={}, range={}, wrong number of old_endpoints_in_local_dc={}, rf_in_local_dc={}",
keyspace_name, desired_range, old_endpoints_in_local_dc.size(), rf_in_local_dc));
}
}
} else {
throw std::runtime_error(format("bootstrap_with_repair: keyspace={}, range={}, wrong number of old_endpoints={}, rf={}",
keyspace_name, desired_range, old_endpoints, strat.get_replication_factor()));

View File

@@ -23,6 +23,7 @@
#include <unordered_map>
#include <exception>
#include <absl/container/btree_set.h>
#include <seastar/core/sstring.hh>
#include <seastar/core/sharded.hh>
@@ -339,6 +340,8 @@ public:
}
};
using repair_hash_set = absl::btree_set<repair_hash>;
enum class repair_row_level_start_status: uint8_t {
ok,
no_such_column_family,

View File

@@ -47,6 +47,7 @@
#include "gms/gossiper.hh"
#include "repair/row_level.hh"
#include "mutation_source_metadata.hh"
#include "utils/stall_free.hh"
extern logging::logger rlogger;
@@ -459,7 +460,7 @@ public:
}
};
class repair_writer {
class repair_writer : public enable_lw_shared_from_this<repair_writer> {
schema_ptr _schema;
uint64_t _estimated_partitions;
size_t _nr_peer_nodes;
@@ -516,6 +517,7 @@ public:
auto [queue_reader, queue_handle] = make_queue_reader(_schema);
_mq[node_idx] = std::move(queue_handle);
table& t = db.local().find_column_family(_schema->id());
auto writer = shared_from_this();
_writer_done[node_idx] = mutation_writer::distribute_reader_and_consume_on_shards(_schema, std::move(queue_reader),
[&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions] (flat_mutation_reader reader) {
auto& t = db.local().find_column_family(reader.schema());
@@ -529,7 +531,7 @@ public:
sstables::shared_sstable sst = use_view_update_path ? t->make_streaming_staging_sstable() : t->make_streaming_sstable_for_write();
schema_ptr s = reader.schema();
auto& pc = service::get_local_streaming_priority();
return sst->write_components(std::move(reader), std::max(1ul, adjusted_estimated_partitions), s,
return sst->write_components(std::move(reader), adjusted_estimated_partitions, s,
t->get_sstables_manager().configure_writer(),
encoding_stats{}, pc).then([sst] {
return sst->open_data();
@@ -545,13 +547,13 @@ public:
return consumer(std::move(reader));
});
},
t.stream_in_progress()).then([this, node_idx] (uint64_t partitions) {
t.stream_in_progress()).then([node_idx, writer] (uint64_t partitions) {
rlogger.debug("repair_writer: keyspace={}, table={}, managed to write partitions={} to sstable",
_schema->ks_name(), _schema->cf_name(), partitions);
}).handle_exception([this, node_idx] (std::exception_ptr ep) {
writer->_schema->ks_name(), writer->_schema->cf_name(), partitions);
}).handle_exception([node_idx, writer] (std::exception_ptr ep) {
rlogger.warn("repair_writer: keyspace={}, table={}, multishard_writer failed: {}",
_schema->ks_name(), _schema->cf_name(), ep);
_mq[node_idx]->abort(ep);
writer->_schema->ks_name(), writer->_schema->cf_name(), ep);
writer->_mq[node_idx]->abort(ep);
return make_exception_future<>(std::move(ep));
});
}
@@ -654,7 +656,7 @@ private:
size_t _nr_peer_nodes= 1;
repair_stats _stats;
repair_reader _repair_reader;
repair_writer _repair_writer;
lw_shared_ptr<repair_writer> _repair_writer;
// Contains rows read from disk
std::list<repair_row> _row_buf;
// Contains rows we are working on to sync between peers
@@ -666,7 +668,7 @@ private:
// Tracks current sync boundary
std::optional<repair_sync_boundary> _current_sync_boundary;
// Contains the hashes of rows in the _working_row_buffor for all peer nodes
std::vector<std::unordered_set<repair_hash>> _peer_row_hash_sets;
std::vector<repair_hash_set> _peer_row_hash_sets;
// Gate used to make sure pending operation of meta data is done
seastar::gate _gate;
sink_source_for_get_full_row_hashes _sink_source_for_get_full_row_hashes;
@@ -735,7 +737,7 @@ public:
_seed,
repair_reader::is_local_reader(_repair_master || _same_sharding_config)
)
, _repair_writer(_schema, _estimated_partitions, _nr_peer_nodes, _reason)
, _repair_writer(make_lw_shared<repair_writer>(_schema, _estimated_partitions, _nr_peer_nodes, _reason))
, _sink_source_for_get_full_row_hashes(_repair_meta_id, _nr_peer_nodes,
[] (uint32_t repair_meta_id, netw::messaging_service::msg_addr addr) {
return netw::get_local_messaging_service().make_sink_and_source_for_repair_get_full_row_hashes_with_rpc_stream(repair_meta_id, addr);
@@ -754,11 +756,12 @@ public:
public:
future<> stop() {
auto gate_future = _gate.close();
auto writer_future = _repair_writer.wait_for_writer_done();
auto f1 = _sink_source_for_get_full_row_hashes.close();
auto f2 = _sink_source_for_get_row_diff.close();
auto f3 = _sink_source_for_put_row_diff.close();
return when_all_succeed(std::move(gate_future), std::move(writer_future), std::move(f1), std::move(f2), std::move(f3)).discard_result();
return when_all_succeed(std::move(gate_future), std::move(f1), std::move(f2), std::move(f3)).discard_result().finally([this] {
return _repair_writer->wait_for_writer_done();
});
}
static std::unordered_map<node_repair_meta_id, lw_shared_ptr<repair_meta>>& repair_meta_map() {
@@ -886,9 +889,9 @@ public:
}
// Must run inside a seastar thread
static std::unordered_set<repair_hash>
get_set_diff(const std::unordered_set<repair_hash>& x, const std::unordered_set<repair_hash>& y) {
std::unordered_set<repair_hash> set_diff;
static repair_hash_set
get_set_diff(const repair_hash_set& x, const repair_hash_set& y) {
repair_hash_set set_diff;
// Note std::set_difference needs x and y are sorted.
std::copy_if(x.begin(), x.end(), std::inserter(set_diff, set_diff.end()),
[&y] (auto& item) { thread::maybe_yield(); return y.find(item) == y.end(); });
@@ -906,14 +909,14 @@ public:
}
std::unordered_set<repair_hash>& peer_row_hash_sets(unsigned node_idx) {
repair_hash_set& peer_row_hash_sets(unsigned node_idx) {
return _peer_row_hash_sets[node_idx];
}
// Get a list of row hashes in _working_row_buf
future<std::unordered_set<repair_hash>>
future<repair_hash_set>
working_row_hashes() {
return do_with(std::unordered_set<repair_hash>(), [this] (std::unordered_set<repair_hash>& hashes) {
return do_with(repair_hash_set(), [this] (repair_hash_set& hashes) {
return do_for_each(_working_row_buf, [&hashes] (repair_row& r) {
hashes.emplace(r.hash());
}).then([&hashes] {
@@ -1090,24 +1093,32 @@ private:
});
}
future<> clear_row_buf() {
return utils::clear_gently(_row_buf);
}
future<> clear_working_row_buf() {
return utils::clear_gently(_working_row_buf).then([this] {
_working_row_buf_combined_hash.clear();
});
}
// Read rows from disk until _max_row_buf_size of rows are filled into _row_buf.
// Calculate the combined checksum of the rows
// Calculate the total size of the rows in _row_buf
future<get_sync_boundary_response>
get_sync_boundary(std::optional<repair_sync_boundary> skipped_sync_boundary) {
auto f = make_ready_future<>();
if (skipped_sync_boundary) {
_current_sync_boundary = skipped_sync_boundary;
_row_buf.clear();
_working_row_buf.clear();
_working_row_buf_combined_hash.clear();
} else {
_working_row_buf.clear();
_working_row_buf_combined_hash.clear();
f = clear_row_buf();
}
// Here is the place we update _last_sync_boundary
rlogger.trace("SET _last_sync_boundary from {} to {}", _last_sync_boundary, _current_sync_boundary);
_last_sync_boundary = _current_sync_boundary;
return row_buf_size().then([this, sb = std::move(skipped_sync_boundary)] (size_t cur_size) {
return f.then([this, sb = std::move(skipped_sync_boundary)] () mutable {
return clear_working_row_buf().then([this, sb = sb] () mutable {
return row_buf_size().then([this, sb = std::move(sb)] (size_t cur_size) {
return read_rows_from_disk(cur_size).then([this, sb = std::move(sb)] (std::list<repair_row> new_rows, size_t new_rows_size) mutable {
size_t new_rows_nr = new_rows.size();
_row_buf.splice(_row_buf.end(), new_rows);
@@ -1124,6 +1135,8 @@ private:
});
});
});
});
});
}
future<> move_row_buf_to_working_row_buf() {
@@ -1199,9 +1212,9 @@ private:
}
future<std::list<repair_row>>
copy_rows_from_working_row_buf_within_set_diff(std::unordered_set<repair_hash> set_diff) {
copy_rows_from_working_row_buf_within_set_diff(repair_hash_set set_diff) {
return do_with(std::list<repair_row>(), std::move(set_diff),
[this] (std::list<repair_row>& rows, std::unordered_set<repair_hash>& set_diff) {
[this] (std::list<repair_row>& rows, repair_hash_set& set_diff) {
return do_for_each(_working_row_buf, [this, &set_diff, &rows] (const repair_row& r) {
if (set_diff.count(r.hash()) > 0) {
rows.push_back(r);
@@ -1216,7 +1229,7 @@ private:
// Give a set of row hashes, return the corresponding rows
// If needs_all_rows is set, return all the rows in _working_row_buf, ignore the set_diff
future<std::list<repair_row>>
get_row_diff(std::unordered_set<repair_hash> set_diff, needs_all_rows_t needs_all_rows = needs_all_rows_t::no) {
get_row_diff(repair_hash_set set_diff, needs_all_rows_t needs_all_rows = needs_all_rows_t::no) {
if (needs_all_rows) {
if (!_repair_master || _nr_peer_nodes == 1) {
return make_ready_future<std::list<repair_row>>(std::move(_working_row_buf));
@@ -1227,19 +1240,28 @@ private:
}
}
future<> do_apply_rows(std::list<repair_row>& row_diff, unsigned node_idx, update_working_row_buf update_buf) {
return with_semaphore(_repair_writer.sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer.create_writer(_db, node_idx);
return do_for_each(row_diff, [this, node_idx, update_buf] (repair_row& r) {
if (update_buf) {
_working_row_buf_combined_hash.add(r.hash());
}
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf));
future<> do_apply_rows(std::list<repair_row>&& row_diff, unsigned node_idx, update_working_row_buf update_buf) {
return do_with(std::move(row_diff), [this, node_idx, update_buf] (std::list<repair_row>& row_diff) {
return with_semaphore(_repair_writer->sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer->create_writer(_db, node_idx);
return repeat([this, node_idx, update_buf, &row_diff] () mutable {
if (row_diff.empty()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
repair_row& r = row_diff.front();
if (update_buf) {
_working_row_buf_combined_hash.add(r.hash());
}
// The repair_row here is supposed to have
// mutation_fragment attached because we have stored it in
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer->do_write(node_idx, std::move(dk_with_hash), std::move(mf)).then([&row_diff] {
row_diff.pop_front();
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
});
});
}
@@ -1257,19 +1279,17 @@ private:
stats().rx_row_nr += row_diff.size();
stats().rx_row_nr_peer[from] += row_diff.size();
if (update_buf) {
std::list<repair_row> tmp;
tmp.swap(_working_row_buf);
// Both row_diff and _working_row_buf and are ordered, merging
// two sored list to make sure the combination of row_diff
// and _working_row_buf are ordered.
std::merge(tmp.begin(), tmp.end(), row_diff.begin(), row_diff.end(), std::back_inserter(_working_row_buf),
[this] (const repair_row& x, const repair_row& y) { thread::maybe_yield(); return _cmp(x.boundary(), y.boundary()) < 0; });
utils::merge_to_gently(_working_row_buf, row_diff,
[this] (const repair_row& x, const repair_row& y) { return _cmp(x.boundary(), y.boundary()) < 0; });
}
if (update_hash_set) {
_peer_row_hash_sets[node_idx] = boost::copy_range<std::unordered_set<repair_hash>>(row_diff |
_peer_row_hash_sets[node_idx] = boost::copy_range<repair_hash_set>(row_diff |
boost::adaptors::transformed([] (repair_row& r) { thread::maybe_yield(); return r.hash(); }));
}
do_apply_rows(row_diff, node_idx, update_buf).get();
do_apply_rows(std::move(row_diff), node_idx, update_buf).get();
}
future<>
@@ -1277,11 +1297,9 @@ private:
if (rows.empty()) {
return make_ready_future<>();
}
return to_repair_rows_list(rows).then([this] (std::list<repair_row> row_diff) {
return do_with(std::move(row_diff), [this] (std::list<repair_row>& row_diff) {
unsigned node_idx = 0;
return do_apply_rows(row_diff, node_idx, update_working_row_buf::no);
});
return to_repair_rows_list(std::move(rows)).then([this] (std::list<repair_row> row_diff) {
unsigned node_idx = 0;
return do_apply_rows(std::move(row_diff), node_idx, update_working_row_buf::no);
});
}
@@ -1360,13 +1378,13 @@ private:
public:
// RPC API
// Return the hashes of the rows in _working_row_buf
future<std::unordered_set<repair_hash>>
future<repair_hash_set>
get_full_row_hashes(gms::inet_address remote_node) {
if (remote_node == _myip) {
return get_full_row_hashes_handler();
}
return netw::get_local_messaging_service().send_repair_get_full_row_hashes(msg_addr(remote_node),
_repair_meta_id).then([this, remote_node] (std::unordered_set<repair_hash> hashes) {
_repair_meta_id).then([this, remote_node] (repair_hash_set hashes) {
rlogger.debug("Got full hashes from peer={}, nr_hashes={}", remote_node, hashes.size());
_metrics.rx_hashes_nr += hashes.size();
stats().rx_hashes_nr += hashes.size();
@@ -1377,7 +1395,7 @@ public:
private:
future<> get_full_row_hashes_source_op(
lw_shared_ptr<std::unordered_set<repair_hash>> current_hashes,
lw_shared_ptr<repair_hash_set> current_hashes,
gms::inet_address remote_node,
unsigned node_idx,
rpc::source<repair_hash_with_cmd>& source) {
@@ -1415,12 +1433,12 @@ private:
}
public:
future<std::unordered_set<repair_hash>>
future<repair_hash_set>
get_full_row_hashes_with_rpc_stream(gms::inet_address remote_node, unsigned node_idx) {
if (remote_node == _myip) {
return get_full_row_hashes_handler();
}
auto current_hashes = make_lw_shared<std::unordered_set<repair_hash>>();
auto current_hashes = make_lw_shared<repair_hash_set>();
return _sink_source_for_get_full_row_hashes.get_sink_source(remote_node, node_idx).then(
[this, current_hashes, remote_node, node_idx]
(rpc::sink<repair_stream_cmd>& sink, rpc::source<repair_hash_with_cmd>& source) mutable {
@@ -1435,7 +1453,7 @@ public:
}
// RPC handler
future<std::unordered_set<repair_hash>>
future<repair_hash_set>
get_full_row_hashes_handler() {
return with_gate(_gate, [this] {
return working_row_hashes();
@@ -1585,7 +1603,7 @@ public:
// RPC API
// Return rows in the _working_row_buf with hash within the given sef_diff
// Must run inside a seastar thread
void get_row_diff(std::unordered_set<repair_hash> set_diff, needs_all_rows_t needs_all_rows, gms::inet_address remote_node, unsigned node_idx) {
void get_row_diff(repair_hash_set set_diff, needs_all_rows_t needs_all_rows, gms::inet_address remote_node, unsigned node_idx) {
if (needs_all_rows || !set_diff.empty()) {
if (remote_node == _myip) {
return;
@@ -1654,11 +1672,11 @@ private:
}
future<> get_row_diff_sink_op(
std::unordered_set<repair_hash> set_diff,
repair_hash_set set_diff,
needs_all_rows_t needs_all_rows,
rpc::sink<repair_hash_with_cmd>& sink,
gms::inet_address remote_node) {
return do_with(std::move(set_diff), [needs_all_rows, remote_node, &sink] (std::unordered_set<repair_hash>& set_diff) mutable {
return do_with(std::move(set_diff), [needs_all_rows, remote_node, &sink] (repair_hash_set& set_diff) mutable {
if (inject_rpc_stream_error) {
return make_exception_future<>(std::runtime_error("get_row_diff: Inject sender error in sink loop"));
}
@@ -1685,7 +1703,7 @@ private:
public:
// Must run inside a seastar thread
void get_row_diff_with_rpc_stream(
std::unordered_set<repair_hash> set_diff,
repair_hash_set set_diff,
needs_all_rows_t needs_all_rows,
update_peer_row_hash_sets update_hash_set,
gms::inet_address remote_node,
@@ -1711,7 +1729,7 @@ public:
}
// RPC handler
future<repair_rows_on_wire> get_row_diff_handler(std::unordered_set<repair_hash> set_diff, needs_all_rows_t needs_all_rows) {
future<repair_rows_on_wire> get_row_diff_handler(repair_hash_set set_diff, needs_all_rows_t needs_all_rows) {
return with_gate(_gate, [this, set_diff = std::move(set_diff), needs_all_rows] () mutable {
return get_row_diff(std::move(set_diff), needs_all_rows).then([this] (std::list<repair_row> row_diff) {
return to_repair_rows_on_wire(std::move(row_diff));
@@ -1721,15 +1739,16 @@ public:
// RPC API
// Send rows in the _working_row_buf with hash within the given sef_diff
future<> put_row_diff(std::unordered_set<repair_hash> set_diff, needs_all_rows_t needs_all_rows, gms::inet_address remote_node) {
future<> put_row_diff(repair_hash_set set_diff, needs_all_rows_t needs_all_rows, gms::inet_address remote_node) {
if (!set_diff.empty()) {
if (remote_node == _myip) {
return make_ready_future<>();
}
auto sz = set_diff.size();
size_t sz = set_diff.size();
return get_row_diff(std::move(set_diff), needs_all_rows).then([this, remote_node, sz] (std::list<repair_row> row_diff) {
if (row_diff.size() != sz) {
throw std::runtime_error("row_diff.size() != set_diff.size()");
rlogger.warn("Hash conflict detected, keyspace={}, table={}, range={}, row_diff.size={}, set_diff.size={}. It is recommended to compact the table and rerun repair for the range.",
_schema->ks_name(), _schema->cf_name(), _range, row_diff.size(), sz);
}
return do_with(std::move(row_diff), [this, remote_node] (std::list<repair_row>& row_diff) {
return get_repair_rows_size(row_diff).then([this, remote_node, &row_diff] (size_t row_bytes) mutable {
@@ -1796,17 +1815,18 @@ private:
public:
future<> put_row_diff_with_rpc_stream(
std::unordered_set<repair_hash> set_diff,
repair_hash_set set_diff,
needs_all_rows_t needs_all_rows,
gms::inet_address remote_node, unsigned node_idx) {
if (!set_diff.empty()) {
if (remote_node == _myip) {
return make_ready_future<>();
}
auto sz = set_diff.size();
size_t sz = set_diff.size();
return get_row_diff(std::move(set_diff), needs_all_rows).then([this, remote_node, node_idx, sz] (std::list<repair_row> row_diff) {
if (row_diff.size() != sz) {
throw std::runtime_error("row_diff.size() != set_diff.size()");
rlogger.warn("Hash conflict detected, keyspace={}, table={}, range={}, row_diff.size={}, set_diff.size={}. It is recommended to compact the table and rerun repair for the range.",
_schema->ks_name(), _schema->cf_name(), _range, row_diff.size(), sz);
}
return do_with(std::move(row_diff), [this, remote_node, node_idx] (std::list<repair_row>& row_diff) {
return get_repair_rows_size(row_diff).then([this, remote_node, node_idx, &row_diff] (size_t row_bytes) mutable {
@@ -1845,7 +1865,7 @@ static future<stop_iteration> repair_get_row_diff_with_rpc_stream_process_op(
rpc::sink<repair_row_on_wire_with_cmd> sink,
rpc::source<repair_hash_with_cmd> source,
bool &error,
std::unordered_set<repair_hash>& current_set_diff,
repair_hash_set& current_set_diff,
std::optional<std::tuple<repair_hash_with_cmd>> hash_cmd_opt) {
repair_hash_with_cmd hash_cmd = std::get<0>(hash_cmd_opt.value());
rlogger.trace("Got repair_hash_with_cmd from peer={}, hash={}, cmd={}", from, hash_cmd.hash, int(hash_cmd.cmd));
@@ -1858,7 +1878,7 @@ static future<stop_iteration> repair_get_row_diff_with_rpc_stream_process_op(
}
bool needs_all_rows = hash_cmd.cmd == repair_stream_cmd::needs_all_rows;
_metrics.rx_hashes_nr += current_set_diff.size();
auto fp = make_foreign(std::make_unique<std::unordered_set<repair_hash>>(std::move(current_set_diff)));
auto fp = make_foreign(std::make_unique<repair_hash_set>(std::move(current_set_diff)));
return smp::submit_to(src_cpu_id % smp::count, [from, repair_meta_id, needs_all_rows, fp = std::move(fp)] {
auto rm = repair_meta::get_repair_meta(from, repair_meta_id);
if (fp.get_owner_shard() == this_shard_id()) {
@@ -1936,12 +1956,12 @@ static future<stop_iteration> repair_get_full_row_hashes_with_rpc_stream_process
if (status == repair_stream_cmd::get_full_row_hashes) {
return smp::submit_to(src_cpu_id % smp::count, [from, repair_meta_id] {
auto rm = repair_meta::get_repair_meta(from, repair_meta_id);
return rm->get_full_row_hashes_handler().then([] (std::unordered_set<repair_hash> hashes) {
return rm->get_full_row_hashes_handler().then([] (repair_hash_set hashes) {
_metrics.tx_hashes_nr += hashes.size();
return hashes;
});
}).then([sink] (std::unordered_set<repair_hash> hashes) mutable {
return do_with(std::move(hashes), [sink] (std::unordered_set<repair_hash>& hashes) mutable {
}).then([sink] (repair_hash_set hashes) mutable {
return do_with(std::move(hashes), [sink] (repair_hash_set& hashes) mutable {
return do_for_each(hashes, [sink] (const repair_hash& hash) mutable {
return sink(repair_hash_with_cmd{repair_stream_cmd::hash_data, hash});
}).then([sink] () mutable {
@@ -1964,7 +1984,7 @@ static future<> repair_get_row_diff_with_rpc_stream_handler(
uint32_t repair_meta_id,
rpc::sink<repair_row_on_wire_with_cmd> sink,
rpc::source<repair_hash_with_cmd> source) {
return do_with(false, std::unordered_set<repair_hash>(), [from, src_cpu_id, repair_meta_id, sink, source] (bool& error, std::unordered_set<repair_hash>& current_set_diff) mutable {
return do_with(false, repair_hash_set(), [from, src_cpu_id, repair_meta_id, sink, source] (bool& error, repair_hash_set& current_set_diff) mutable {
return repeat([from, src_cpu_id, repair_meta_id, sink, source, &error, &current_set_diff] () mutable {
return source().then([from, src_cpu_id, repair_meta_id, sink, source, &error, &current_set_diff] (std::optional<std::tuple<repair_hash_with_cmd>> hash_cmd_opt) mutable {
if (hash_cmd_opt) {
@@ -2107,7 +2127,7 @@ future<> repair_init_messaging_service_handler(repair_service& rs, distributed<d
auto from = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
return smp::submit_to(src_cpu_id % smp::count, [from, repair_meta_id] {
auto rm = repair_meta::get_repair_meta(from, repair_meta_id);
return rm->get_full_row_hashes_handler().then([] (std::unordered_set<repair_hash> hashes) {
return rm->get_full_row_hashes_handler().then([] (repair_hash_set hashes) {
_metrics.tx_hashes_nr += hashes.size();
return hashes;
});
@@ -2135,11 +2155,11 @@ future<> repair_init_messaging_service_handler(repair_service& rs, distributed<d
});
});
ms.register_repair_get_row_diff([] (const rpc::client_info& cinfo, uint32_t repair_meta_id,
std::unordered_set<repair_hash> set_diff, bool needs_all_rows) {
repair_hash_set set_diff, bool needs_all_rows) {
auto src_cpu_id = cinfo.retrieve_auxiliary<uint32_t>("src_cpu_id");
auto from = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
_metrics.rx_hashes_nr += set_diff.size();
auto fp = make_foreign(std::make_unique<std::unordered_set<repair_hash>>(std::move(set_diff)));
auto fp = make_foreign(std::make_unique<repair_hash_set>(std::move(set_diff)));
return smp::submit_to(src_cpu_id % smp::count, [from, repair_meta_id, fp = std::move(fp), needs_all_rows] () mutable {
auto rm = repair_meta::get_repair_meta(from, repair_meta_id);
if (fp.get_owner_shard() == this_shard_id()) {
@@ -2207,6 +2227,25 @@ future<> repair_init_messaging_service_handler(repair_service& rs, distributed<d
});
}
future<> repair_uninit_messaging_service_handler() {
return netw::get_messaging_service().invoke_on_all([] (auto& ms) {
return when_all_succeed(
ms.unregister_repair_get_row_diff_with_rpc_stream(),
ms.unregister_repair_put_row_diff_with_rpc_stream(),
ms.unregister_repair_get_full_row_hashes_with_rpc_stream(),
ms.unregister_repair_get_full_row_hashes(),
ms.unregister_repair_get_combined_row_hash(),
ms.unregister_repair_get_sync_boundary(),
ms.unregister_repair_get_row_diff(),
ms.unregister_repair_put_row_diff(),
ms.unregister_repair_row_level_start(),
ms.unregister_repair_row_level_stop(),
ms.unregister_repair_get_estimated_partitions(),
ms.unregister_repair_set_estimated_partitions(),
ms.unregister_repair_get_diff_algorithms()).discard_result();
});
}
class row_level_repair {
repair_info& _ri;
sstring _cf_name;
@@ -2439,7 +2478,7 @@ private:
// sequentially because the rows from repair follower 1 to
// repair master might reduce the amount of missing data
// between repair master and repair follower 2.
std::unordered_set<repair_hash> set_diff = repair_meta::get_set_diff(master.peer_row_hash_sets(node_idx), master.working_row_hashes().get0());
repair_hash_set set_diff = repair_meta::get_set_diff(master.peer_row_hash_sets(node_idx), master.working_row_hashes().get0());
// Request missing sets from peer node
rlogger.debug("Before get_row_diff to node {}, local={}, peer={}, set_diff={}",
node, master.working_row_hashes().get0().size(), master.peer_row_hash_sets(node_idx).size(), set_diff.size());
@@ -2462,9 +2501,9 @@ private:
// So we can figure out which rows peer node are missing and send the missing rows to them
check_in_shutdown();
_ri.check_in_abort();
std::unordered_set<repair_hash> local_row_hash_sets = master.working_row_hashes().get0();
repair_hash_set local_row_hash_sets = master.working_row_hashes().get0();
auto sz = _all_live_peer_nodes.size();
std::vector<std::unordered_set<repair_hash>> set_diffs(sz);
std::vector<repair_hash_set> set_diffs(sz);
for (size_t idx : boost::irange(size_t(0), sz)) {
set_diffs[idx] = repair_meta::get_set_diff(local_row_hash_sets, master.peer_row_hash_sets(idx));
}

View File

@@ -45,6 +45,7 @@ private:
};
future<> repair_init_messaging_service_handler(repair_service& rs, distributed<db::system_distributed_keyspace>& sys_dist_ks, distributed<db::view::view_update_generator>& view_update_generator);
future<> repair_uninit_messaging_service_handler();
class repair_info;

View File

@@ -1272,7 +1272,9 @@ flat_mutation_reader cache_entry::read(row_cache& rc, read_context& reader, row_
// Assumes reader is in the corresponding partition
flat_mutation_reader cache_entry::do_read(row_cache& rc, read_context& reader) {
auto snp = _pe.read(rc._tracker.region(), rc._tracker.cleaner(), _schema, &rc._tracker, reader.phase());
auto ckr = query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
auto ckr = with_linearized_managed_bytes([&] {
return query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
});
auto r = make_cache_flat_mutation_reader(_schema, _key, std::move(ckr), rc, reader.shared_from_this(), std::move(snp));
r.upgrade_schema(rc.schema());
r.upgrade_schema(reader.schema());

View File

@@ -19,6 +19,7 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/on_internal_error.hh>
#include <map>
#include "utils/UUID_gen.hh"
#include "cql3/column_identifier.hh"
@@ -43,6 +44,8 @@
constexpr int32_t schema::NAME_LENGTH;
extern logging::logger dblog;
sstring to_sstring(column_kind k) {
switch (k) {
case column_kind::partition_key: return "PARTITION_KEY";
@@ -592,11 +595,15 @@ schema::get_column_definition(const bytes& name) const {
const column_definition&
schema::column_at(column_kind kind, column_id id) const {
return _raw._columns.at(column_offset(kind) + id);
return column_at(static_cast<ordinal_column_id>(column_offset(kind) + id));
}
const column_definition&
schema::column_at(ordinal_column_id ordinal_id) const {
if (size_t(ordinal_id) >= _raw._columns.size()) [[unlikely]] {
on_internal_error(dblog, format("{}.{}@{}: column id {:d} >= {:d}",
ks_name(), cf_name(), version(), size_t(ordinal_id), _raw._columns.size()));
}
return _raw._columns.at(static_cast<column_count_type>(ordinal_id));
}
@@ -820,7 +827,7 @@ std::ostream& schema::describe(std::ostream& os) const {
os << "}";
os << "\n AND comment = '" << comment()<< "'";
os << "\n AND compaction = {'class': '" << sstables::compaction_strategy::name(compaction_strategy()) << "'";
map_as_cql_param(os, compaction_strategy_options()) << "}";
map_as_cql_param(os, compaction_strategy_options(), false) << "}";
os << "\n AND compression = {";
map_as_cql_param(os, get_compressor_params().get_options());
os << "}";

View File

@@ -92,7 +92,8 @@ executables = ['build/{}/scylla'.format(args.mode),
'/usr/sbin/ethtool',
'/usr/bin/netstat',
'/usr/bin/hwloc-distrib',
'/usr/bin/hwloc-calc']
'/usr/bin/hwloc-calc',
'/usr/bin/lsblk']
output = args.dest

View File

@@ -597,7 +597,7 @@ def current_shard():
def find_db(shard=None):
if not shard:
if shard is None:
shard = current_shard()
return gdb.parse_and_eval('::debug::db')['_instances']['_M_impl']['_M_start'][shard]['service']['_p']

View File

@@ -63,6 +63,17 @@ MemoryHigh=1200M
MemoryMax=1400M
MemoryLimit=1400M
EOS
# On CentOS7, systemd does not support percentage-based parameter.
# To apply memory parameter on CentOS7, we need to override the parameter
# in bytes, instead of percentage.
elif [ "$RHEL" -a "$VERSION_ID" = "7" ]; then
MEMORY_LIMIT=$((MEMTOTAL_BYTES / 100 * 5))
mkdir -p /etc/systemd/system/scylla-helper.slice.d/
cat << EOS > /etc/systemd/system/scylla-helper.slice.d/memory.conf
[Slice]
MemoryLimit=$MEMORY_LIMIT
EOS
fi
systemctl --system daemon-reload >/dev/null || true

Submodule seastar updated: 11e86172ba...0fba7da929

View File

@@ -25,6 +25,7 @@
#include <seastar/util/bool_class.hh>
#include <boost/range/algorithm/for_each.hpp>
#include "utils/small_vector.hh"
#include <absl/container/btree_set.h>
namespace ser {
@@ -81,6 +82,17 @@ static inline void serialize_array(Output& out, const Container& v) {
template<typename Container>
struct container_traits;
template<typename T>
struct container_traits<absl::btree_set<T>> {
struct back_emplacer {
absl::btree_set<T>& c;
back_emplacer(absl::btree_set<T>& c_) : c(c_) {}
void operator()(T&& v) {
c.emplace(std::move(v));
}
};
};
template<typename T>
struct container_traits<std::unordered_set<T>> {
struct back_emplacer {
@@ -253,6 +265,27 @@ struct serializer<std::list<T>> {
}
};
template<typename T>
struct serializer<absl::btree_set<T>> {
template<typename Input>
static absl::btree_set<T> read(Input& in) {
auto sz = deserialize(in, boost::type<uint32_t>());
absl::btree_set<T> v;
deserialize_array_helper<false, T>::doit(in, v, sz);
return v;
}
template<typename Output>
static void write(Output& out, const absl::btree_set<T>& v) {
safe_serialize_as_uint32(out, v.size());
serialize_array_helper<false, T>::doit(out, v);
}
template<typename Input>
static void skip(Input& in) {
auto sz = deserialize(in, boost::type<uint32_t>());
skip_array<T>(in, sz);
}
};
template<typename T>
struct serializer<std::unordered_set<T>> {
template<typename Input>

View File

@@ -92,7 +92,7 @@ void migration_manager::init_messaging_service()
//FIXME: future discarded.
(void)with_gate(_background_tasks, [this] {
mlogger.debug("features changed, recalculating schema version");
return update_schema_version_and_announce(get_storage_proxy(), _feat.cluster_schema_features());
return db::schema_tables::recalculate_schema_version(get_storage_proxy(), _feat);
});
};
@@ -1104,6 +1104,20 @@ future<schema_ptr> get_schema_definition(table_schema_version v, netw::messaging
mlogger.debug("Requesting schema {} from {}", v, dst);
auto& ms = netw::get_local_messaging_service();
return ms.send_get_schema_version(dst, v);
}).then([] (schema_ptr s) {
// If this is a view so this schema also needs a reference to the base
// table.
if (s->is_view()) {
if (!s->view_info()->base_info()) {
auto& db = service::get_local_storage_proxy().get_db().local();
// This line might throw a no_such_column_family
// It should be fine since if we tried to register a view for which
// we don't know the base table, our registry is broken.
schema_ptr base_schema = db.find_schema(s->view_info()->base_id());
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*base_schema));
}
}
return s;
});
}

View File

@@ -347,7 +347,7 @@ public:
_max = _max - row_count;
_exhausted = (row_count < page_size && !results->is_short_read()) || _max == 0;
if (!_exhausted || row_count > 0) {
if (!_exhausted && row_count > 0) {
if (_last_pkey) {
update_slice(*_last_pkey);
}

View File

@@ -120,9 +120,11 @@ using fbu = utils::fb_utilities;
static inline
query::digest_algorithm digest_algorithm(service::storage_proxy& proxy) {
return proxy.features().cluster_supports_xxhash_digest_algorithm()
? query::digest_algorithm::xxHash
: query::digest_algorithm::MD5;
return proxy.features().cluster_supports_digest_for_null_values()
? query::digest_algorithm::xxHash
: proxy.features().cluster_supports_xxhash_digest_algorithm()
? query::digest_algorithm::legacy_xxHash_without_null_digest
: query::digest_algorithm::MD5;
}
static inline
@@ -1760,6 +1762,7 @@ storage_proxy::storage_proxy(distributed<database>& db, storage_proxy::config cf
, _token_metadata(tm)
, _read_smp_service_group(cfg.read_smp_service_group)
, _write_smp_service_group(cfg.write_smp_service_group)
, _hints_write_smp_service_group(cfg.hints_write_smp_service_group)
, _write_ack_smp_service_group(cfg.write_ack_smp_service_group)
, _next_response_id(std::chrono::system_clock::now().time_since_epoch()/1ms)
, _hints_resource_manager(cfg.available_memory / 10)
@@ -1803,39 +1806,48 @@ storage_proxy::response_id_type storage_proxy::unique_response_handler::release(
}
future<>
storage_proxy::mutate_locally(const mutation& m, tracing::trace_state_ptr tr_state, clock_type::time_point timeout) {
storage_proxy::mutate_locally(const mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout, smp_service_group smp_grp) {
auto shard = _db.local().shard_of(m);
get_stats().replica_cross_shard_ops += shard != this_shard_id();
return _db.invoke_on(shard, {_write_smp_service_group, timeout},
[s = global_schema_ptr(m.schema()), m = freeze(m), gtr = tracing::global_trace_state_ptr(std::move(tr_state)), timeout] (database& db) mutable -> future<> {
return db.apply(s, m, gtr.get(), db::commitlog::force_sync::no, timeout);
return _db.invoke_on(shard, {smp_grp, timeout},
[s = global_schema_ptr(m.schema()),
m = freeze(m),
gtr = tracing::global_trace_state_ptr(std::move(tr_state)),
timeout,
sync] (database& db) mutable -> future<> {
return db.apply(s, m, gtr.get(), sync, timeout);
});
}
future<>
storage_proxy::mutate_locally(const schema_ptr& s, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout) {
storage_proxy::mutate_locally(const schema_ptr& s, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout,
smp_service_group smp_grp) {
auto shard = _db.local().shard_of(m);
get_stats().replica_cross_shard_ops += shard != this_shard_id();
return _db.invoke_on(shard, {_write_smp_service_group, timeout},
return _db.invoke_on(shard, {smp_grp, timeout},
[&m, gs = global_schema_ptr(s), gtr = tracing::global_trace_state_ptr(std::move(tr_state)), timeout, sync] (database& db) mutable -> future<> {
return db.apply(gs, m, gtr.get(), sync, timeout);
});
}
future<>
storage_proxy::mutate_locally(std::vector<mutation> mutations, tracing::trace_state_ptr tr_state, clock_type::time_point timeout) {
return do_with(std::move(mutations), [this, timeout, tr_state = std::move(tr_state)] (std::vector<mutation>& pmut) mutable {
return parallel_for_each(pmut.begin(), pmut.end(), [this, tr_state = std::move(tr_state), timeout] (const mutation& m) mutable {
return mutate_locally(m, tr_state, timeout);
storage_proxy::mutate_locally(std::vector<mutation> mutations, tracing::trace_state_ptr tr_state, clock_type::time_point timeout, smp_service_group smp_grp) {
return do_with(std::move(mutations), [this, timeout, tr_state = std::move(tr_state), smp_grp] (std::vector<mutation>& pmut) mutable {
return parallel_for_each(pmut.begin(), pmut.end(), [this, tr_state = std::move(tr_state), timeout, smp_grp] (const mutation& m) mutable {
return mutate_locally(m, tr_state, db::commitlog::force_sync::no, timeout, smp_grp);
});
});
}
future<>
storage_proxy::mutate_locally(std::vector<mutation> mutation, tracing::trace_state_ptr tr_state, clock_type::time_point timeout) {
return mutate_locally(std::move(mutation), tr_state, timeout, _write_smp_service_group);
}
future<>
storage_proxy::mutate_hint(const schema_ptr& s, const frozen_mutation& m, tracing::trace_state_ptr tr_state, clock_type::time_point timeout) {
auto shard = _db.local().shard_of(m);
get_stats().replica_cross_shard_ops += shard != this_shard_id();
return _db.invoke_on(shard, {_write_smp_service_group, timeout}, [&m, gs = global_schema_ptr(s), tr_state = std::move(tr_state), timeout] (database& db) mutable -> future<> {
return _db.invoke_on(shard, {_hints_write_smp_service_group, timeout}, [&m, gs = global_schema_ptr(s), tr_state = std::move(tr_state), timeout] (database& db) mutable -> future<> {
return db.apply_hint(gs, m, std::move(tr_state), timeout);
});
}
@@ -2573,14 +2585,13 @@ future<> storage_proxy::send_hint_to_endpoint(frozen_mutation_and_schema fm_a_s,
}
future<> storage_proxy::send_hint_to_all_replicas(frozen_mutation_and_schema fm_a_s) {
const auto timeout = db::timeout_clock::now() + 1h;
if (!_features.cluster_supports_hinted_handoff_separate_connection()) {
std::array<mutation, 1> ms{fm_a_s.fm.unfreeze(fm_a_s.s)};
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit(), timeout);
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit());
}
std::array<hint_wrapper, 1> ms{hint_wrapper { std::move(fm_a_s.fm.unfreeze(fm_a_s.s)) }};
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit(), timeout);
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit());
}
/**
@@ -4488,6 +4499,12 @@ future<bool> storage_proxy::cas(schema_ptr schema, shared_ptr<cas_request> reque
paxos::paxos_state::logger.debug("CAS[{}] successful", handler->id());
tracing::trace(handler->tr_state, "CAS successful");
return std::optional<bool>(condition_met);
}).handle_exception_type([handler] (unavailable_exception& e) {
// if learning stage encountered unavailablity error lets re-map it to a write error
// since unavailable error means that operation has never ever started which is not the case here
schema_ptr schema = handler->schema();
return make_exception_future<std::optional<bool>>(mutation_write_timeout_exception(schema->ks_name(), schema->cf_name(),
e.consistency, e.alive, e.required, db::write_type::CAS));
});
}
paxos::paxos_state::logger.debug("CAS[{}] PAXOS proposal not accepted (pre-empted by a higher ballot)",
@@ -4849,7 +4866,7 @@ void storage_proxy::init_messaging_service() {
});
};
auto receive_mutation_handler = [] (const rpc::client_info& cinfo, rpc::opt_time_point t, frozen_mutation in, std::vector<gms::inet_address> forward,
auto receive_mutation_handler = [] (smp_service_group smp_grp, const rpc::client_info& cinfo, rpc::opt_time_point t, frozen_mutation in, std::vector<gms::inet_address> forward,
gms::inet_address reply_to, unsigned shard, storage_proxy::response_id_type response_id, rpc::optional<std::optional<tracing::trace_info>> trace_info) {
tracing::trace_state_ptr trace_state_ptr;
auto src_addr = netw::messaging_service::get_source(cinfo);
@@ -4857,9 +4874,9 @@ void storage_proxy::init_messaging_service() {
utils::UUID schema_version = in.schema_version();
return handle_write(src_addr, t, schema_version, std::move(in), std::move(forward), reply_to, shard, response_id,
trace_info ? *trace_info : std::nullopt,
/* apply_fn */ [] (shared_ptr<storage_proxy>& p, tracing::trace_state_ptr tr_state, schema_ptr s, const frozen_mutation& m,
/* apply_fn */ [smp_grp] (shared_ptr<storage_proxy>& p, tracing::trace_state_ptr tr_state, schema_ptr s, const frozen_mutation& m,
clock_type::time_point timeout) {
return p->mutate_locally(std::move(s), m, std::move(tr_state), db::commitlog::force_sync::no, timeout);
return p->mutate_locally(std::move(s), m, std::move(tr_state), db::commitlog::force_sync::no, timeout, smp_grp);
},
/* forward_fn */ [] (netw::messaging_service::msg_addr addr, clock_type::time_point timeout, const frozen_mutation& m,
gms::inet_address reply_to, unsigned shard, response_id_type response_id,
@@ -4868,8 +4885,8 @@ void storage_proxy::init_messaging_service() {
return ms.send_mutation(addr, timeout, m, {}, reply_to, shard, response_id, std::move(trace_info));
});
};
ms.register_mutation(receive_mutation_handler);
ms.register_hint_mutation(receive_mutation_handler);
ms.register_mutation(std::bind_front<>(receive_mutation_handler, _write_smp_service_group));
ms.register_hint_mutation(std::bind_front<>(receive_mutation_handler, _hints_write_smp_service_group));
ms.register_paxos_learn([] (const rpc::client_info& cinfo, rpc::opt_time_point t, paxos::proposal decision,
std::vector<gms::inet_address> forward, gms::inet_address reply_to, unsigned shard,
@@ -5112,18 +5129,22 @@ void storage_proxy::init_messaging_service() {
future<> storage_proxy::uninit_messaging_service() {
auto& ms = netw::get_local_messaging_service();
return when_all_succeed(
ms.unregister_counter_mutation(),
ms.unregister_mutation(),
ms.unregister_hint_mutation(),
ms.unregister_mutation_done(),
ms.unregister_mutation_failed(),
ms.unregister_read_data(),
ms.unregister_read_mutation_data(),
ms.unregister_read_digest(),
ms.unregister_truncate(),
ms.unregister_get_schema_version(),
ms.unregister_paxos_prepare(),
ms.unregister_paxos_accept(),
ms.unregister_paxos_learn(),
ms.unregister_paxos_prune()
).discard_result();
}
future<rpc::tuple<foreign_ptr<lw_shared_ptr<reconcilable_result>>, cache_temperature>>
@@ -5217,8 +5238,7 @@ future<> storage_proxy::drain_on_shutdown() {
future<>
storage_proxy::stop() {
// FIXME: hints manager should be stopped here but it seems like this function is never called
return uninit_messaging_service();
return make_ready_future<>();
}
}

View File

@@ -166,6 +166,7 @@ public:
size_t available_memory;
smp_service_group read_smp_service_group = default_smp_service_group();
smp_service_group write_smp_service_group = default_smp_service_group();
smp_service_group hints_write_smp_service_group = default_smp_service_group();
// Write acknowledgments might not be received on the correct shard, and
// they need a separate smp_service_group to prevent an ABBA deadlock
// with writes.
@@ -256,6 +257,7 @@ private:
locator::token_metadata& _token_metadata;
smp_service_group _read_smp_service_group;
smp_service_group _write_smp_service_group;
smp_service_group _hints_write_smp_service_group;
smp_service_group _write_ack_smp_service_group;
response_id_type _next_response_id;
response_handlers_map _response_handlers;
@@ -314,7 +316,6 @@ private:
cdc_stats _cdc_stats;
private:
future<> uninit_messaging_service();
future<coordinator_query_result> query_singular(lw_shared_ptr<query::read_command> cmd,
dht::partition_range_vector&& partition_ranges,
db::consistency_level cl,
@@ -469,13 +470,31 @@ public:
return next;
}
void init_messaging_service();
future<> uninit_messaging_service();
private:
// Applies mutation on this node.
// Resolves with timed_out_error when timeout is reached.
future<> mutate_locally(const mutation& m, tracing::trace_state_ptr tr_state, clock_type::time_point timeout = clock_type::time_point::max());
future<> mutate_locally(const mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout, smp_service_group smp_grp);
// Applies mutation on this node.
// Resolves with timed_out_error when timeout is reached.
future<> mutate_locally(const schema_ptr&, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout = clock_type::time_point::max());
future<> mutate_locally(const schema_ptr&, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout,
smp_service_group smp_grp);
// Applies mutations on this node.
// Resolves with timed_out_error when timeout is reached.
future<> mutate_locally(std::vector<mutation> mutation, tracing::trace_state_ptr tr_state, clock_type::time_point timeout, smp_service_group smp_grp);
public:
// Applies mutation on this node.
// Resolves with timed_out_error when timeout is reached.
future<> mutate_locally(const mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout = clock_type::time_point::max()) {
return mutate_locally(m, tr_state, sync, timeout, _write_smp_service_group);
}
// Applies mutation on this node.
// Resolves with timed_out_error when timeout is reached.
future<> mutate_locally(const schema_ptr& s, const frozen_mutation& m, tracing::trace_state_ptr tr_state, db::commitlog::force_sync sync, clock_type::time_point timeout = clock_type::time_point::max()) {
return mutate_locally(s, m, tr_state, sync, timeout, _write_smp_service_group);
}
// Applies mutations on this node.
// Resolves with timed_out_error when timeout is reached.
future<> mutate_locally(std::vector<mutation> mutation, tracing::trace_state_ptr tr_state, clock_type::time_point timeout = clock_type::time_point::max());

Some files were not shown because too many files have changed in this diff Show More