Compare commits

...

55 Commits

Author SHA1 Message Date
Amos Kong
44ec73cfc4 schema.cc/describe: fix invalid compaction options in schema
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.

    AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}

map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.

Fixes #7741

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7734

(cherry picked from commit 6b1659ee80)
2021-03-24 12:58:11 +02:00
Tomasz Grabiec
df6f9a200f sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>

(cherry picked from commit 9272e74e8c)
2021-03-24 10:42:11 +02:00
Nadav Har'El
2f4a3c271c storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
(cherry picked from commit d73934372d)
2021-03-21 10:51:36 +02:00
Raphael S. Carvalho
6a11c20b4a LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
(cherry picked from commit e53cedabb1)
2021-03-18 19:20:10 +02:00
Raphael S. Carvalho
cccdd6aaae compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
(cherry picked from commit 7171244844)
2021-03-18 14:29:38 +02:00
Raphael S. Carvalho
92871a88c3 compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
(cherry picked from commit 2cf0c4bbf1)

Includes fixup patch:

compaction_manager: Fix use-after-free in rewrite_sstables()

Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
(cherry picked from commit f7cc431477)
2021-03-11 08:24:56 +02:00
Benny Halevy
85bbf6751d repair: repair_writer: do not capture lw_shared_ptr cross-shard
The shared_from_this lw_shared_ptr must not be accessed
across shards.  Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.

This was introduced in 289a08072a.

The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it.  It is already held
by the continuations until the end of the chain.

Fixes #7535

Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
(cherry picked from commit f93fb55726)
2021-03-03 21:27:44 +02:00
Hagit Segev
0ac069fdcc release: prepare for 4.2.4 2021-03-02 14:52:31 +02:00
Avi Kivity
738f8eaccd Update seastar submodule
* seastar 1266e42c82...0fba7da929 (1):
  > io_queue: Fix "delay" metrics

Fixes #8166.
2021-03-01 13:59:02 +02:00
Avi Kivity
5d32e91e16 Update seastar submodule
* seastar f760efe0a0...1266e42c82 (1):
  > rpc: streaming sink: order outgoing messages

Fixes #7552.
2021-03-01 12:22:17 +02:00
Benny Halevy
6c5f6b3f69 large_data_handler: disable deletion of large data entries
Currently we decide whether to delete large data entries
based on the overall sstable data_size, since the entries
themselves are typically much smaller than the whole sstable
(especially cells and rows), this causes overzealous
deletions (#7668) and inefficiency in the rows cache
due to the large number of range tombstones created.

Refs #7575

Test: sstable_3_x_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

This patch is targetted for branch-4.3 or earlier.
In 4.4, the problem was fixed in #7669, but the fix
is out of scope for backporting.

Branch: 4.3
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201203130018.1920271-1-bhalevy@scylladb.com>
(cherry picked from commit bb99d7ced6)
2021-03-01 10:54:33 +02:00
Raphael S. Carvalho
fba26b78d2 sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
(cherry picked from commit 21608bd677)
2021-02-28 16:43:02 +02:00
Pavel Solodovnikov
06e785994f large_data_handler: fix segmentation fault when constructing data_value from a nullptr
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:

In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.

These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.

This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.

What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.

A regression test is provided for the issue (written in
`cql-pytest` framework).

Tests: test/cql-pytest/test_large_cells_rows.py

Fixes: #6780

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 219ac2bab5)
2021-02-23 12:14:12 +02:00
Takuya ASADA
5bc48673aa scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040

(cherry picked from commit 32d4ec6b8a)
2021-02-21 16:23:45 +02:00
Nadav Har'El
59a01b2981 alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 653610f4bc)
2021-02-21 10:06:50 +02:00
Nadav Har'El
5dd49788c1 alternator: fix UpdateItem ADD for non-existent attribute
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).

We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.

Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(

Fixes #7763.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
(cherry picked from commit a8fdbf31cd)
2021-02-21 08:58:49 +02:00
Benny Halevy
56cbc9f3ed stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
(cherry picked from commit d01e7e7b58)
2021-02-14 13:11:43 +02:00
Avi Kivity
7469896017 table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
on_compaction_completion() updates _sstables_compacted_but_not_deleted
through a temporary to avoid an exception causing a partial update:

  1. copy _sstables_compacted_but_not_deleted to a temporary
  2. update temporary
  3. do dangerous stuff
  4. move temporary to _sstables_compacted_but_not_deleted

This is racy when we have parallel compactions, since step 3 yields.
We can have two invocations running in parallel, taking snapshots
of the same _sstables_compacted_but_not_deleted in step 1, each
modifying it in different ways, and only one of them winning the
race and assigning in step 4. With the right timing we can end
with extra sstables in _sstables_compacted_but_not_deleted.

Before a5369881b3, this was a benign race (only resulting in
deleted file space not being reclaimed until the service is shut
down), but afterwards, extra sstable references result in the service
refusing to shut down. This was observed in database_test in debug
mode, where the race more or less reliably happens for system.truncated.

Fix by using a different method to protect
_sstables_compacted_but_not_deleted. We unconditionally update it,
and also unconditionally fix it up (on success or failure) using
seastar::defer(). The fixup includes a call to rebuild_statistics()
which must happen every time we touch the sstable list.

Ref #7331.
Fixes #8038.

BACKPORT NOTES:
- Turns out this race prevented deletion of expired sstables because the leaked
deleted sstables would be accounted when checking if an expired sstable can
be purged.
- Switch to unordered_set<>::count() as it's not supported by older compilers.

(cherry picked from commit a43d5079f3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
2021-02-14 11:35:57 +02:00
Piotr Wojtczak
c7e2711dd4 Validate ascii values when creating from CQL
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.

Fixes #5421

Closes #7532

(cherry picked from commit caa3c471c0)
2021-02-10 19:37:56 +02:00
Piotr Dulikowski
a2355a35db hinted handoff: use default timeout for sending orphaned hints
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.

Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.

Fixes: #7051
(cherry picked from commit b111fa98ca)
2021-02-10 10:15:01 +02:00
Piotr Sarna
9e225ab447 Merge 'select_statement: Fix aggregate results on indexed selects (timeouts fixed) ' from Piotr Grabowski
Overview
Fixes #7355.

Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).

Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.

It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).

GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
 pk
----
  1
  1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.

Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).

The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
 count
-------
     2
```
The second commit fixes this issue by properly trimming `row_ranges`.

The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.

The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).

The second commit fixes the first two failing examples from issue #7355.

Tests (fourth commit)
Extensively tests queries on tables with secondary indices with  aggregates and `GROUP BY`s.

Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).

I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.

Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000).  This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.

Closes #7497

* github.com:scylladb/scylla:
  tests: Add secondary index aggregates tests
  select_statement: Introduce internal_paging_size
  select_statement: Fix paging on indexed selects
  select_statement: Fix GROUP BY on indexed select

(cherry picked from commit 8c645f74ce)
2021-02-08 20:32:36 +02:00
Amnon Heiman
e1205d1d5b API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007

(cherry picked from commit 4498bb0a48)
2021-02-08 17:04:27 +02:00
Avi Kivity
a78402efae Merge 'Add waiting for flushes on table drops' from Piotr Sarna
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush  (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault

Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)

Fixes #7792

Closes #7798

* github.com:scylladb/scylla:
  database: add flushes to waiting for pending operations
  table: unify waiting for pending operations
  database: add a phaser for flush operations
  database: add waiting for pending streams on table drop

(cherry picked from commit 7636799b18)
2021-02-02 17:23:34 +02:00
Avi Kivity
9fcf790234 row_cache: linearize key in cache_entry::do_read()
do_read() does not linearize cache_entry::_key; this can cause a crash
with keys larger than 13k.

Fixes #7897.

Closes #7898

(cherry picked from commit d508a63d4b)
2021-01-17 09:30:44 +02:00
Hagit Segev
24346215c2 release: prepare for 4.2.3 2021-01-04 19:51:12 +02:00
Benny Halevy
918ec5ecb3 compaction: compaction_writer: destroy shared_sstable after the sstable_writer
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
 (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
 (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```

What happens here is that:

    compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
            : _out(std::move(out))
            , _compression_metadata(cm)
            , _offsets(_compression_metadata->offsets.get_writer())
            , _compression(lc)
            , _full_checksum(ChecksumType::init_checksum())

_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.

Fixes #7821

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
(cherry picked from commit 8a745a0ee0)
2021-01-04 15:04:34 +02:00
Avi Kivity
7457683328 Revert "Merge 'Move temporaries to value view' from Piotr S"
This reverts commit d1fa0adcbe. It causes
a regression when processing some bind variables.

Fixes #7761.
2020-12-24 12:40:46 +02:00
Gleb Natapov
567889d283 mutation_writer: pass exceptions through feed_writer
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.

Fixes #7482
Message-Id: <20201216090038.GB3244976@scylladb.com>

(cherry picked from commit 61520a33d6)
2020-12-16 17:20:11 +02:00
Aleksandr Bykov
c605ed73bf dist: scylla_util: fix aws_instance.ebs_disks method
aws_instance.ebs_disks() method should return ebs disk
instead of ephemeral

Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com>

Closes #7780

(cherry picked from commit e74dc311e7)
2020-12-16 11:58:47 +02:00
Takuya ASADA
d0530d8ac2 node_exporter_install: stop service before force installing
Stop node-exporter.service before re-install it, to avoid 'Text file busy' error.

Fixes #6782

(cherry picked from commit ef05ea8e91)
2020-12-15 16:28:25 +02:00
Hagit Segev
696ef24226 release: prepare for 4.2.2 2020-12-13 20:34:03 +02:00
Avi Kivity
b8fe144301 dist: rpm: uninstall tuned when installing scylla-kernel-conf
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).

Not needed for .deb, since debian/ubunto do not install tuned by
default.

Fixes #7696

Closes #7776

(cherry picked from commit 615b8e8184)
2020-12-12 14:30:38 +02:00
Nadav Har'El
62f783be87 alternator: fix broken Scan/Query paging with bytes keys
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).

This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.

Fixes #7768

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
(cherry picked from commit 86779664f4)
2020-12-09 15:16:41 +02:00
Piotr Sarna
863e784951 db: fix getting local ranges for size estimates table
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.

Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:

(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
            _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
      _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
            _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}

Closes #7764

(cherry picked from commit 1cc4ed50c1)
2020-12-09 15:16:14 +02:00
Nadav Har'El
e5a6199b4d alternator, test: make test_fetch_from_system_tables faster
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.

This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.

As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.

Fixes #7706

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
(cherry picked from commit 220d6dde17)
2020-12-09 15:15:15 +02:00
Nadav Har'El
abaf6c192a alternator: fix query with both projection and filtering
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.

The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.

The two tests

 test_query_filter.py::test_query_filter_and_attributes_to_get
 test_filter_expression.py::test_filter_expression_and_projection_expression

Which failed before this patch now pass so we drop their "xfail" tag.

Fixes #6951.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 282742a469)
2020-12-09 14:39:17 +02:00
Eliran Sinvani
ef2f5ed434 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296

(cherry picked from commit 925cdc9ae1)
2020-11-29 16:45:14 +02:00
Raphael S. Carvalho
bac40e2512 sstable_directory: Fix 50% space requirement for resharding
This is a regression caused by aebd965f0.

After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.

Fixes #7463.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
(cherry picked from commit 6f805bd123)
2020-11-29 15:26:14 +02:00
Asias He
681c4d77bb repair: Make repair_writer a shared pointer
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:

class repair_writer {
   _writer_done[node_idx] =
      mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
         ...
      }).handle_exception([this] {
         ...
      });
}

The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.

To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.

Fixes #7406

Closes #7430

(cherry picked from commit 289a08072a)
2020-11-29 13:30:49 +02:00
Pavel Emelyanov
8572ee9da2 query_pager: Fix continuation handling for noop visitor
Before updating the _last_[cp]key (for subsequent .fetch_page())
the pager checks is 'if the pager is not exhausted OR the result
has data'.

The check seems broken: if the pager is not exhausted, but the
result is empty the call for keys will unconditionally try to
reference the last element from empty vector. The not exhausted
condition for empty result can happen if the short_read is set,
which, in turn, unconditionally happens upon meeting partition
end when visiting the partition with result builder.

The correct check should be 'if the pager is not exhausted AND
the result has data': the _last_[pc]key-s should be taken for
continuation (not exhausted), but can be taken if the result is
not empty (has data).

fixes: #7263
tests: unit(dev), but tests don't trigger this corner case

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921124329.21209-1-xemul@scylladb.com>
(cherry picked from commit 550fc734d9)
2020-11-29 12:01:37 +02:00
Takuya ASADA
95dbac56e5 install.sh: set PATH for relocatable CLI tools in python thunk
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.

Fixes #7350

(cherry picked from commit 5867af4edd)
2020-11-29 11:54:42 +02:00
Bentsi Magidovich
eeadeff0dc scylla_util.py: fix exception handling in curl
Retry mechanism didn't work when URLError happend. For example:

  urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>

Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.

Fixes: #7569

Closes #7567

(cherry picked from commit 956b97b2a8)
2020-11-29 11:48:30 +02:00
Takuya ASADA
62f3caab18 dist/redhat: packaging dependencies.conf as normal file, not ghost
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.

Fixes #7703

Closes #7704

(cherry picked from commit ba4d54efa3)
2020-11-29 11:40:22 +02:00
Takuya ASADA
1a4869231a install.sh: apply sysctl.d files on non-packaging installation
We don't apply sysctl.d files on non-packaging installation, apply them
just like rpm/deb taking care of that.

Fixes #7702

Closes #7705

(cherry picked from commit 5f81f97773)
2020-11-29 11:35:37 +02:00
Avi Kivity
3568d0cbb6 dist: sysctl: configure more inotify instances
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.

This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.

Increase to 1200, which is enough for 6 instances * 200 shards.

Fixes #7700.

Closes #7701

(cherry picked from commit 390e07d591)
2020-11-29 11:04:45 +02:00
Raphael S. Carvalho
030c2e3270 compaction: Make sure a partition is filtered out only by producer
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9

To fix this problem, let's make sure that partition filtering will only
happen on the producer side.

Fixes #7590.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
(cherry picked from commit 13fa2bec4c)
2020-11-19 14:08:25 +02:00
Piotr Dulikowski
37a5e9ab15 hints: don't read hint files when it's not allowed to send
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:

- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
  target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
  this operation after sleeping 1 second.

This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.

This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.

First observed in #6964

Tests:
- unit(dev)
- hinted handoff dtests

Closes #7407

(cherry picked from commit 77a0f1a153)
2020-11-16 14:30:07 +02:00
Botond Dénes
a15b5d514d mutation_reader: queue_reader: don't set EOS flag on abort
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.

Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.

Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
(cherry picked from commit f5323b29d9)
2020-11-15 11:07:38 +02:00
Botond Dénes
064f8f8bcf types: validate(): linearize values lazily
Instead of eagerly linearizing all values as they are passed to
validate(), defer linearization to those validators that actually need
linearized values. Linearizing large values puts pressure on the memory
allocator with large contiguous allocation requests. This is something
we are trying to actively avoid, especially if it is not really neaded.
Turns out the types, whose validators really want linearized values are
a minority, as most validators just look at the size of the value, and
some like bytes don't need validation at all, while usually having large
values.

This is achieved by templating the validator struct on the view and
using the FragmentedRange concept to treat all passed in views
(`bytes_view` and `fragmented_temporary_buffer_view`) uniformly.
This patch makes no attempt at converting existing validators to work
with fragmented buffers, only trivial cases are converted. The major
offenders still left are ascii/utf8 and collections.

Fixes: #7318

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201007054524.909420-1-bdenes@scylladb.com>
(cherry picked from commit db56ae695c)
2020-11-11 10:55:54 +02:00
Amnon Heiman
04fe0a7395 scyllatop/livedata.py: Safe iteration over metrics
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.

Fixes #7488

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 52db99f25f)
2020-11-08 19:16:13 +02:00
Calle Wilund
ef26d90868 partition_version: Change range_tombstones() to return chunked_vector
Refs #7364

The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.

Closes #7433

(cherry picked from commit 4b65d67a1a)
2020-11-08 14:38:32 +02:00
Tomasz Grabiec
790f51c210 sstables: ka/la: Fix abort when next_partition() is called with certain reader state
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().

The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:

  scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.

This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.

The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.

Fixes #7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit fb9b5cae05)
2020-11-08 14:25:47 +02:00
Yaron Kaikov
4fb8ebccff release: prepare for 4.2.1 2020-11-08 12:41:06 +02:00
Avi Kivity
d1fa0adcbe Merge 'Move temporaries to value view' from Piotr S
"
Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream.

Tests: unit(dev)
Fixes https://github.com/scylladb/scylla/issues/7019
"

7055297649 ("cql3: remove query_options::linearize and _temporaries")
is reverted from this backport since linearize() is still used in
this branch.

* psarna-move_temporaries_to_value_view:
  cql3: remove query_options::linearize and _temporaries
  cql3: remove make_temporary helper function
  cql3: store temporaries in-place instead of in query_options
  cql3: add temporary_value to value view
  cql3: allow moving data out of raw_value
  cql3: split values.hh into a .cc file

(cherry picked from commit 2b308a973f)
2020-11-05 19:24:23 +02:00
Piotr Sarna
46b56d885e schema_tables: fix fixing old secondary index schemas
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.

Fixes #7515

Closes #7516

(cherry picked from commit b66c285f94)
2020-11-05 17:53:08 +02:00
64 changed files with 1392 additions and 318 deletions

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.2.0
VERSION=4.2.4
if test -f version
then

View File

@@ -159,23 +159,40 @@ static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException", format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error("ValidationException", format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error("ValidationException", "begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error("ValidationException", format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error("ValidationException", "begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error("ValidationException", format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -279,24 +296,38 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error("ValidationException", format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error("ValidationException", format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -310,7 +341,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -341,57 +373,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error("ValidationException", "between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error(
"ValidationException",
if (bounds_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -438,19 +484,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -462,7 +508,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -574,7 +621,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -605,13 +653,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -52,6 +52,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

View File

@@ -1795,7 +1795,8 @@ static std::string get_item_type_string(const rjson::value& v) {
// calculate_attrs_to_get() takes either AttributesToGet or
// ProjectionExpression parameters (having both is *not* allowed),
// and returns the list of cells we need to read.
// and returns the list of cells we need to read, or an empty set when
// *all* attributes are to be returned.
// In our current implementation, only top-level attributes are stored
// as cells, and nested documents are stored serialized as JSON.
// So this function currently returns only the the top-level attributes
@@ -2133,19 +2134,30 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value result;
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
// An ADD can be used to create a new attribute (when
// v1.IsNull()) or to add to a pre-existing attribute:
if (v1.IsNull()) {
std::string v2_type = get_item_type_string(v2);
if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
result = v2;
} else {
throw api_error("ValidationException", format("An operand in the update expression has an incorrect data type: {}", v2));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error("ValidationException", format("An operand in the update expression has an incorrect data type: {}", v1));
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error("ValidationException", format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error("ValidationException", format("An operand in the update expression has an incorrect data type: {}", v1));
}
}
do_update(to_bytes(column_name), result);
},
@@ -2459,6 +2471,10 @@ public:
std::unordered_set<std::string>& used_attribute_values);
bool check(const rjson::value& item) const;
bool filters_on(std::string_view attribute) const;
// for_filters_on() runs the given function on the attributes that the
// filter works on. It may run for the same attribute more than once if
// used more than once in the filter.
void for_filters_on(const noncopyable_function<void(std::string_view)>& func) const;
operator bool() const { return bool(_imp); }
};
@@ -2539,10 +2555,26 @@ bool filter::filters_on(std::string_view attribute) const {
}, *_imp);
}
void filter::for_filters_on(const noncopyable_function<void(std::string_view)>& func) const {
if (_imp) {
std::visit(overloaded_functor {
[&] (const conditions_filter& f) -> void {
for (auto it = f.conditions.MemberBegin(); it != f.conditions.MemberEnd(); ++it) {
func(rjson::to_string_view(it->name));
}
},
[&] (const expression_filter& f) -> void {
return for_condition_expression_on(f.expression, func);
}
}, *_imp);
}
}
class describe_items_visitor {
typedef std::vector<const column_definition*> columns_t;
const columns_t& _columns;
const std::unordered_set<std::string>& _attrs_to_get;
std::unordered_set<std::string> _extra_filter_attrs;
const filter& _filter;
typename columns_t::const_iterator _column_it;
rjson::value _item;
@@ -2558,7 +2590,20 @@ public:
, _item(rjson::empty_object())
, _items(rjson::empty_array())
, _scanned_count(0)
{ }
{
// _filter.check() may need additional attributes not listed in
// _attrs_to_get (i.e., not requested as part of the output).
// We list those in _extra_filter_attrs. We will include them in
// the JSON but take them out before finally returning the JSON.
if (!_attrs_to_get.empty()) {
_filter.for_filters_on([&] (std::string_view attr) {
std::string a(attr); // no heterogenous maps searches :-(
if (!_attrs_to_get.contains(a)) {
_extra_filter_attrs.emplace(std::move(a));
}
});
}
}
void start_row() {
_column_it = _columns.begin();
@@ -2572,7 +2617,7 @@ public:
result_bytes_view->with_linearized([this] (bytes_view bv) {
std::string column_name = (*_column_it)->name_as_text();
if (column_name != executor::ATTRS_COLUMN_NAME) {
if (_attrs_to_get.empty() || _attrs_to_get.count(column_name) > 0) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(column_name) || _extra_filter_attrs.contains(column_name)) {
if (!_item.HasMember(column_name.c_str())) {
rjson::set_with_string_name(_item, column_name, rjson::empty_object());
}
@@ -2584,7 +2629,7 @@ public:
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
if (_attrs_to_get.empty() || _attrs_to_get.count(attr_name) > 0) {
if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name) || _extra_filter_attrs.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
rjson::set_with_string_name(_item, attr_name, deserialize_item(value));
}
@@ -2596,6 +2641,11 @@ public:
void end_row() {
if (_filter.check(_item)) {
// Remove the extra attributes _extra_filter_attrs which we had
// to add just for the filter, and not requested to be returned:
for (const auto& attr : _extra_filter_attrs) {
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
}
_item = rjson::empty_object();
@@ -2630,7 +2680,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.partition_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_pk_it)));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_pk_it, cdef));
++exploded_pk_it;
}
auto ck = paging_state.get_clustering_key();
@@ -2640,7 +2690,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_ck_it)));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
++exploded_ck_it;
}
}

View File

@@ -348,6 +348,39 @@ bool condition_expression_on(const parsed::condition_expression& ce, std::string
}, ce._expression);
}
// for_condition_expression_on() runs a given function over all the attributes
// mentioned in the expression. If the same attribute is mentioned more than
// once, the function will be called more than once for the same attribute.
static void for_value_on(const parsed::value& v, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::constant& c) { },
[&] (const parsed::value::function_call& f) {
for (const parsed::value& value : f._parameters) {
for_value_on(value, func);
}
},
[&] (const parsed::path& p) {
func(p.root());
}
}, v._value);
}
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) {
for (const parsed::value& value : cond._values) {
for_value_on(value, func);
}
},
[&] (const parsed::condition_expression::condition_list& list) {
for (const parsed::condition_expression& cond : list.conditions) {
for_condition_expression_on(cond, func);
}
}
}, ce._expression);
}
// The following calculate_value() functions calculate, or evaluate, a parsed
// expression. The parsed expression is assumed to have been "resolved", with
// the matching resolve_* function.
@@ -570,52 +603,8 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {

View File

@@ -27,6 +27,8 @@
#include <unordered_set>
#include <string_view>
#include <seastar/util/noncopyable_function.hh>
#include "expressions_types.hh"
#include "rjson.hh"
@@ -59,6 +61,11 @@ void validate_value(const rjson::value& v, const char* caller);
bool condition_expression_on(const parsed::condition_expression& ce, std::string_view attribute);
// for_condition_expression_on() runs the given function on the attributes
// that the expression uses. It may run for the same attribute more than once
// if the same attribute is used more than once in the expression.
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func);
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:

View File

@@ -650,7 +650,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -658,7 +658,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
@@ -666,7 +666,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -674,7 +674,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
@@ -682,7 +682,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -690,7 +690,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});

View File

@@ -59,6 +59,7 @@
#include "db/timeout_clock.hh"
#include "db/consistency_level_validations.hh"
#include "database.hh"
#include "test/lib/select_statement_utils.hh"
#include <boost/algorithm/cxx11/any_of.hpp>
bool is_system_keyspace(const sstring& name);
@@ -67,6 +68,8 @@ namespace cql3 {
namespace statements {
static constexpr int DEFAULT_INTERNAL_PAGING_SIZE = select_statement::DEFAULT_COUNT_PAGE_SIZE;
thread_local int internal_paging_size = DEFAULT_INTERNAL_PAGING_SIZE;
thread_local const lw_shared_ptr<const select_statement::parameters> select_statement::_default_parameters = make_lw_shared<select_statement::parameters>();
select_statement::parameters::parameters()
@@ -333,7 +336,7 @@ select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
const bool nonpaged_filtering = restrictions_need_filtering && page_size <= 0;
if (aggregate || nonpaged_filtering) {
page_size = DEFAULT_COUNT_PAGE_SIZE;
page_size = internal_paging_size;
}
auto key_ranges = _restrictions->get_partition_key_ranges(options);
@@ -530,13 +533,29 @@ indexed_table_select_statement::do_execute_base_query(
if (old_paging_state && concurrency == 1) {
auto base_pk = generate_base_key_from_index_pk<partition_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
auto row_ranges = command->slice.default_row_ranges();
if (old_paging_state->get_clustering_key() && _schema->clustering_key_size() > 0) {
auto base_ck = generate_base_key_from_index_pk<clustering_key>(old_paging_state->get_partition_key(),
old_paging_state->get_clustering_key(), *_schema, *_view_schema);
command->slice.set_range(*_schema, base_pk,
std::vector<query::clustering_range>{query::clustering_range::make_starting_with(range_bound<clustering_key>(base_ck, false))});
query::trim_clustering_row_ranges_to(*_schema, row_ranges, base_ck, false);
command->slice.set_range(*_schema, base_pk, row_ranges);
} else {
command->slice.set_range(*_schema, base_pk, std::vector<query::clustering_range>{query::clustering_range::make_open_ended_both_sides()});
// There is no clustering key in old_paging_state and/or no clustering key in
// _schema, therefore read an entire partition (whole clustering range).
//
// The only exception to applying no restrictions on clustering key
// is a case when we have a secondary index on the first column
// of clustering key. In such a case we should not read the
// entire clustering range - only a range in which first column
// of clustering key has the correct value.
//
// This means that we should not set a open_ended_both_sides
// clustering range on base_pk, instead intersect it with
// _row_ranges (which contains the restrictions neccessary for the
// case described above). The result of such intersection is just
// _row_ranges, which we explicity set on base_pk.
command->slice.set_range(*_schema, base_pk, row_ranges);
}
}
concurrency *= 2;
@@ -974,12 +993,16 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
const bool aggregate = _selection->is_aggregate() || has_group_by();
if (aggregate) {
const bool restrictions_need_filtering = _restrictions->need_filtering();
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format()), std::make_unique<cql3::query_options>(cql3::query_options(options)),
return do_with(cql3::selection::result_set_builder(*_selection, now, options.get_cql_serialization_format(), *_group_by_cell_indices), std::make_unique<cql3::query_options>(cql3::query_options(options)),
[this, &options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] (cql3::selection::result_set_builder& builder, std::unique_ptr<cql3::query_options>& internal_options) {
// page size is set to the internal count page size, regardless of the user-provided value
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), DEFAULT_COUNT_PAGE_SIZE));
internal_options.reset(new cql3::query_options(std::move(internal_options), options.get_paging_state(), internal_paging_size));
return repeat([this, &builder, &options, &internal_options, &proxy, &state, now, whole_partitions, partition_slices, restrictions_need_filtering] () {
auto consume_results = [this, &builder, &options, &internal_options, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
auto consume_results = [this, &builder, &options, &internal_options, &proxy, &state, restrictions_need_filtering] (foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd, lw_shared_ptr<const service::pager::paging_state> paging_state) {
if (paging_state) {
paging_state = generate_view_paging_state_from_base_query_results(paging_state, results, proxy, state, options);
}
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
if (restrictions_need_filtering) {
_stats.filtered_rows_read_total += *results->row_count();
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection,
@@ -987,24 +1010,24 @@ indexed_table_select_statement::do_execute(service::storage_proxy& proxy,
} else {
query::result_view::consume(*results, cmd->slice, cql3::selection::result_set_builder::visitor(builder, *_schema, *_selection));
}
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
return stop_iteration(!has_more_pages);
};
if (whole_partitions || partition_slices) {
return find_index_partition_ranges(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (dht::partition_range_vector partition_ranges, lw_shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
return do_execute_base_query(proxy, std::move(partition_ranges), state, *internal_options, now, paging_state)
.then([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
});
});
} else {
return find_index_clustering_rows(proxy, state, *internal_options).then(
[this, now, &state, &internal_options, &proxy, consume_results = std::move(consume_results)] (std::vector<primary_key> primary_keys, lw_shared_ptr<const service::pager::paging_state> paging_state) {
bool has_more_pages = paging_state && paging_state->get_remaining() > 0;
internal_options.reset(new cql3::query_options(std::move(internal_options), paging_state ? make_lw_shared<service::pager::paging_state>(*paging_state) : nullptr));
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, std::move(paging_state)).then(consume_results).then([has_more_pages] {
return stop_iteration(!has_more_pages);
return this->do_execute_base_query(proxy, std::move(primary_keys), state, *internal_options, now, paging_state)
.then([paging_state, consume_results = std::move(consume_results)](foreign_ptr<lw_shared_ptr<query::result>> results, lw_shared_ptr<query::read_command> cmd) {
return consume_results(std::move(results), std::move(cmd), std::move(paging_state));
});
});
}
@@ -1661,6 +1684,16 @@ std::vector<size_t> select_statement::prepare_group_by(const schema& schema, sel
}
future<> set_internal_paging_size(int paging_size) {
return seastar::smp::invoke_on_all([paging_size] {
internal_paging_size = paging_size;
});
}
future<> reset_internal_paging_size() {
return set_internal_paging_size(DEFAULT_INTERNAL_PAGING_SIZE);
}
}
namespace util {

View File

@@ -767,7 +767,7 @@ future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_
remove(*cf);
cf->clear_views();
auto& ks = find_keyspace(ks_name);
return when_all_succeed(cf->await_pending_writes(), cf->await_pending_reads()).then_unpack([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return cf->await_pending_ops().then([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return truncate(ks, *cf, std::move(tsf), snapshot).finally([this, cf] {
return cf->stop();
});

View File

@@ -541,6 +541,8 @@ private:
utils::phased_barrier _pending_reads_phaser;
// Corresponding phaser for in-progress streams
utils::phased_barrier _pending_streams_phaser;
// Corresponding phaser for in-progress flushes
utils::phased_barrier _pending_flushes_phaser;
// This field cashes the last truncation time for the table.
// The master resides in system.truncated table
@@ -986,6 +988,14 @@ public:
return _pending_streams_phaser.advance_and_await();
}
future<> await_pending_flushes() {
return _pending_flushes_phaser.advance_and_await();
}
future<> await_pending_ops() {
return when_all(await_pending_reads(), await_pending_writes(), await_pending_streams(), await_pending_flushes()).discard_result();
}
void add_or_update_view(view_ptr v);
void remove_view(view_ptr v);
void clear_views();

View File

@@ -61,7 +61,8 @@ namespace db {
logging::logger cl_logger("consistency");
size_t quorum_for(const keyspace& ks) {
return (ks.get_replication_strategy().get_replication_factor() / 2) + 1;
size_t replication_factor = ks.get_replication_strategy().get_replication_factor();
return replication_factor ? (replication_factor / 2) + 1 : 0;
}
size_t local_quorum_for(const keyspace& ks, const sstring& dc) {
@@ -72,8 +73,8 @@ size_t local_quorum_for(const keyspace& ks, const sstring& dc) {
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
static_cast<const network_topology_strategy*>(&rs);
return (nrs->get_replication_factor(dc) / 2) + 1;
size_t replication_factor = nrs->get_replication_factor(dc);
return replication_factor ? (replication_factor / 2) + 1 : 0;
}
return quorum_for(ks);

View File

@@ -817,7 +817,7 @@ void manager::end_point_hints_manager::sender::send_hints_maybe() noexcept {
int replayed_segments_count = 0;
try {
while (replay_allowed() && have_segments()) {
while (replay_allowed() && have_segments() && can_send()) {
if (!send_one_file(*_segments_to_replay.begin())) {
break;
}

View File

@@ -113,7 +113,7 @@ future<> cql_table_large_data_handler::record_large_cells(const sstables::sstabl
auto ck_str = key_to_str(*clustering_key, s);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, format("{} {}", ck_str, column_name), extra_fields, ck_str, column_name);
} else {
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, nullptr, column_name);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, data_value::make_null(utf8_type), column_name);
}
}
@@ -125,7 +125,7 @@ future<> cql_table_large_data_handler::record_large_rows(const sstables::sstable
std::string ck_str = key_to_str(*clustering_key, s);
return try_record("row", sst, partition_key, int64_t(row_size), "row", ck_str, extra_fields, ck_str);
} else {
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, nullptr);
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, data_value::make_null(utf8_type));
}
}

View File

@@ -111,27 +111,12 @@ public:
return make_ready_future<>();
}
future<> maybe_delete_large_data_entries(const schema& s, sstring filename, uint64_t data_size) {
future<> maybe_delete_large_data_entries(const schema& /*s*/, sstring /*filename*/, uint64_t /*data_size*/) {
assert(running());
future<> large_partitions = make_ready_future<>();
if (__builtin_expect(data_size > _partition_threshold_bytes, false)) {
large_partitions = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_PARTITIONS);
});
}
future<> large_rows = make_ready_future<>();
if (__builtin_expect(data_size > _row_threshold_bytes, false)) {
large_rows = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_ROWS);
});
}
future<> large_cells = make_ready_future<>();
if (__builtin_expect(data_size > _cell_threshold_bytes, false)) {
large_cells = with_sem([&s, filename, this] () mutable {
return delete_large_data_entries(s, std::move(filename), db::system_keyspace::LARGE_CELLS);
});
}
return when_all(std::move(large_partitions), std::move(large_rows), std::move(large_cells)).discard_result();
// Deletion of large data entries is disabled due to #7668
// They will evetually expire based on the 30 days TTL.
return make_ready_future<>();
}
const large_data_handler::stats& stats() const { return _stats; }

View File

@@ -2925,10 +2925,6 @@ future<> maybe_update_legacy_secondary_index_mv_schema(service::migration_manage
// format, where "token" is not marked as computed. Once we're sure that all indexes have their
// columns marked as computed (because they were either created on a node that supports computed
// columns or were fixed by this utility function), it's safe to remove this function altogether.
if (!db.features().cluster_supports_computed_columns()) {
return make_ready_future<>();
}
if (v->clustering_key_size() == 0) {
return make_ready_future<>();
}

View File

@@ -201,10 +201,10 @@ static future<std::vector<token_range>> get_local_ranges(database& db) {
// All queries will be on that table, where all entries are text and there's no notion of
// token ranges form the CQL point of view.
auto left_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.start() || r.start()->value() == dht::minimum_token();
return r.end() && (!r.start() || r.start()->value() == dht::minimum_token());
});
auto right_inf = boost::find_if(ranges, [] (auto&& r) {
return !r.end() || r.start()->value() == dht::maximum_token();
return r.start() && (!r.end() || r.end()->value() == dht::maximum_token());
});
if (left_inf != right_inf && left_inf != ranges.end() && right_inf != ranges.end()) {
local_ranges.push_back(token_range{to_bytes(right_inf->start()), to_bytes(left_inf->end())});

View File

@@ -42,6 +42,11 @@ if __name__ == '__main__':
if node_exporter_p.exists() or (bindir_p() / 'prometheus-node_exporter').exists():
if force:
print('node_exporter already installed, reinstalling')
try:
node_exporter = systemd_unit('node-exporter.service')
node_exporter.stop()
except:
pass
else:
print('node_exporter already installed, you can use `--force` to force reinstallation')
sys.exit(1)

View File

@@ -98,7 +98,7 @@ def curl(url, byte=False):
return res.read()
else:
return res.read().decode('utf-8')
except urllib.error.HTTPError:
except urllib.error.URLError:
logging.warn("Failed to grab %s..." % url)
time.sleep(5)
retries += 1
@@ -133,6 +133,8 @@ class aws_instance:
raise Exception("found more than one disk mounted at root'".format(root_dev_candidates))
root_dev = root_dev_candidates[0].device
if root_dev == '/dev/root':
root_dev = run('findmnt -n -o SOURCE /', shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
nvmes_present = list(filter(nvme_re.match, os.listdir("/dev")))
return {"root": [ root_dev ], "ephemeral": [ x for x in nvmes_present if not root_dev.startswith(os.path.join("/dev/", x)) ] }
@@ -221,7 +223,7 @@ class aws_instance:
def ebs_disks(self):
"""Returns all EBS disks"""
return set(self._disks["ephemeral"])
return set(self._disks["ebs"])
def public_ipv4(self):
"""Returns the public IPv4 address of this instance"""
@@ -329,9 +331,7 @@ class scylla_cpuinfo:
return len(self._cpu_data["system"])
# When a CLI tool is not installed, use relocatable CLI tool provided by Scylla
scylla_env = os.environ.copy()
scylla_env['PATH'] = '{}:{}'.format(scyllabindir(), scylla_env['PATH'])
def run(cmd, shell=False, silent=False, exception=True):
stdout = subprocess.DEVNULL if silent else None

View File

@@ -0,0 +1,4 @@
# allocate enough inotify instances for large machines
# each tls instance needs 1 inotify instance, and there can be
# multiple tls instances per shard.
fs.inotify.max_user_instances = 1200

View File

@@ -11,6 +11,7 @@ else
sysctl -p/usr/lib/sysctl.d/99-scylla-sched.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-aio.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-vm.conf || :
sysctl -p/usr/lib/sysctl.d/99-scylla-inotify.conf || :
fi
#DEBHELPER#

View File

@@ -129,10 +129,9 @@ rm -rf $RPM_BUILD_ROOT
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla-housekeeping
%ghost /etc/systemd/system/scylla-helper.slice.d/
%ghost /etc/systemd/system/scylla-helper.slice.d/memory.conf
%ghost /etc/systemd/system/scylla-server.service.d/
%ghost /etc/systemd/system/scylla-server.service.d/capabilities.conf
%ghost /etc/systemd/system/scylla-server.service.d/mounts.conf
%ghost /etc/systemd/system/scylla-server.service.d/dependencies.conf
/etc/systemd/system/scylla-server.service.d/dependencies.conf
%ghost /etc/systemd/system/var-lib-systemd-coredump.mount
%ghost /etc/systemd/system/scylla-cpupower.service
@@ -189,6 +188,8 @@ Summary: Scylla configuration package for the Linux kernel
License: AGPLv3
URL: http://www.scylladb.com/
Requires: kmod
# tuned overwrites our sysctl settings
Obsoletes: tuned
%description kernel-conf
This package contains Linux kernel configuration changes for the Scylla database. Install this package
@@ -200,6 +201,7 @@ if Scylla is the main application on your server and you wish to optimize its la
/usr/lib/systemd/systemd-sysctl 99-scylla-sched.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-aio.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-vm.conf >/dev/null 2>&1 || :
/usr/lib/systemd/systemd-sysctl 99-scylla-inotify.conf >/dev/null 2>&1 || :
%files kernel-conf
%defattr(-,root,root)

View File

@@ -144,7 +144,7 @@ DEBIAN_SSL_CERT_FILE="/etc/ssl/certs/ca-certificates.crt"
if [ -f "\${DEBIAN_SSL_CERT_FILE}" ]; then
c=\${DEBIAN_SSL_CERT_FILE}
fi
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/../bin:\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
EOF
chmod +x "$install"
}
@@ -379,5 +379,9 @@ elif ! $packaging; then
chown -R scylla:scylla $rdata
chown -R scylla:scylla $rhkdata
for file in dist/common/sysctl.d/*.conf; do
bn=$(basename "$file")
sysctl -p "$rusr"/lib/sysctl.d/"$bn"
done
$rprefix/scripts/scylla_post_install.sh
fi

View File

@@ -2048,11 +2048,13 @@ public:
}
}
void abort(std::exception_ptr ep) {
_end_of_stream = true;
_ex = std::move(ep);
if (_full) {
_full->set_exception(_ex);
_full.reset();
} else if (_not_full) {
_not_full->set_exception(_ex);
_not_full.reset();
}
}
};

View File

@@ -36,8 +36,14 @@ future<> feed_writer(flat_mutation_reader&& rd, Writer&& wr) {
auto f2 = rd.is_buffer_empty() ? rd.fill_buffer(db::no_timeout) : make_ready_future<>();
return when_all_succeed(std::move(f1), std::move(f2)).discard_result();
});
}).finally([&wr] {
return wr.consume_end_of_stream();
}).then_wrapped([&wr] (future<> f) {
if (f.failed()) {
auto ex = f.get_exception();
wr.abort(ex);
return make_exception_future<>(ex);
} else {
return wr.consume_end_of_stream();
}
});
});
}

View File

@@ -57,6 +57,9 @@ class shard_based_splitting_mutation_writer {
}
return std::move(_consume_fut);
}
void abort(std::exception_ptr ep) {
_handle.abort(ep);
}
};
private:
@@ -108,6 +111,13 @@ public:
return shard->consume_end_of_stream();
});
}
void abort(std::exception_ptr ep) {
for (auto&& shard : _shards) {
if (shard) {
shard->abort(ep);
}
}
}
};
future<> segregate_by_shard(flat_mutation_reader producer, reader_consumer consumer) {

View File

@@ -144,6 +144,9 @@ class timestamp_based_splitting_mutation_writer {
}
return std::move(_consume_fut);
}
void abort(std::exception_ptr ep) {
_handle.abort(ep);
}
};
private:
@@ -186,6 +189,11 @@ public:
return bucket.second.consume_end_of_stream();
});
}
void abort(std::exception_ptr ep) {
for (auto&& b : _buckets) {
b.second.abort(ep);
}
}
};
future<> timestamp_based_splitting_mutation_writer::write_to_bucket(bucket_id bucket, mutation_fragment&& mf) {

View File

@@ -640,12 +640,12 @@ partition_snapshot_ptr partition_entry::read(logalloc::region& r,
return partition_snapshot_ptr(std::move(snp));
}
std::vector<range_tombstone>
partition_snapshot::range_tombstone_result
partition_snapshot::range_tombstones(position_in_partition_view start, position_in_partition_view end)
{
partition_version* v = &*version();
if (!v->next()) {
return boost::copy_range<std::vector<range_tombstone>>(
return boost::copy_range<range_tombstone_result>(
v->partition().row_tombstones().slice(*_schema, start, end));
}
range_tombstone_list list(*_schema);
@@ -655,10 +655,10 @@ partition_snapshot::range_tombstones(position_in_partition_view start, position_
}
v = v->next();
}
return boost::copy_range<std::vector<range_tombstone>>(list.slice(*_schema, start, end));
return boost::copy_range<range_tombstone_result>(list.slice(*_schema, start, end));
}
std::vector<range_tombstone>
partition_snapshot::range_tombstone_result
partition_snapshot::range_tombstones()
{
return range_tombstones(

View File

@@ -26,6 +26,7 @@
#include "utils/anchorless_list.hh"
#include "utils/logalloc.hh"
#include "utils/coroutine.hh"
#include "utils/chunked_vector.hh"
#include <boost/intrusive/parent_from_member.hpp>
#include <boost/intrusive/slist.hpp>
@@ -400,10 +401,13 @@ public:
::static_row static_row(bool digest_requested) const;
bool static_row_continuous() const;
mutation_partition squashed() const;
using range_tombstone_result = utils::chunked_vector<range_tombstone>;
// Returns range tombstones overlapping with [start, end)
std::vector<range_tombstone> range_tombstones(position_in_partition_view start, position_in_partition_view end);
range_tombstone_result range_tombstones(position_in_partition_view start, position_in_partition_view end);
// Returns all range tombstones
std::vector<range_tombstone> range_tombstones();
range_tombstone_result range_tombstones();
};
class partition_snapshot_ptr {

View File

@@ -460,7 +460,7 @@ public:
}
};
class repair_writer {
class repair_writer : public enable_lw_shared_from_this<repair_writer> {
schema_ptr _schema;
uint64_t _estimated_partitions;
size_t _nr_peer_nodes;
@@ -517,6 +517,7 @@ public:
auto [queue_reader, queue_handle] = make_queue_reader(_schema);
_mq[node_idx] = std::move(queue_handle);
table& t = db.local().find_column_family(_schema->id());
auto writer = shared_from_this();
_writer_done[node_idx] = mutation_writer::distribute_reader_and_consume_on_shards(_schema, std::move(queue_reader),
[&db, reason = this->_reason, estimated_partitions = this->_estimated_partitions] (flat_mutation_reader reader) {
auto& t = db.local().find_column_family(reader.schema());
@@ -546,13 +547,13 @@ public:
return consumer(std::move(reader));
});
},
t.stream_in_progress()).then([this, node_idx] (uint64_t partitions) {
t.stream_in_progress()).then([node_idx, writer] (uint64_t partitions) {
rlogger.debug("repair_writer: keyspace={}, table={}, managed to write partitions={} to sstable",
_schema->ks_name(), _schema->cf_name(), partitions);
}).handle_exception([this, node_idx] (std::exception_ptr ep) {
writer->_schema->ks_name(), writer->_schema->cf_name(), partitions);
}).handle_exception([node_idx, writer] (std::exception_ptr ep) {
rlogger.warn("repair_writer: keyspace={}, table={}, multishard_writer failed: {}",
_schema->ks_name(), _schema->cf_name(), ep);
_mq[node_idx]->abort(ep);
writer->_schema->ks_name(), writer->_schema->cf_name(), ep);
writer->_mq[node_idx]->abort(ep);
return make_exception_future<>(std::move(ep));
});
}
@@ -655,7 +656,7 @@ private:
size_t _nr_peer_nodes= 1;
repair_stats _stats;
repair_reader _repair_reader;
repair_writer _repair_writer;
lw_shared_ptr<repair_writer> _repair_writer;
// Contains rows read from disk
std::list<repair_row> _row_buf;
// Contains rows we are working on to sync between peers
@@ -736,7 +737,7 @@ public:
_seed,
repair_reader::is_local_reader(_repair_master || _same_sharding_config)
)
, _repair_writer(_schema, _estimated_partitions, _nr_peer_nodes, _reason)
, _repair_writer(make_lw_shared<repair_writer>(_schema, _estimated_partitions, _nr_peer_nodes, _reason))
, _sink_source_for_get_full_row_hashes(_repair_meta_id, _nr_peer_nodes,
[] (uint32_t repair_meta_id, netw::messaging_service::msg_addr addr) {
return netw::get_local_messaging_service().make_sink_and_source_for_repair_get_full_row_hashes_with_rpc_stream(repair_meta_id, addr);
@@ -759,7 +760,7 @@ public:
auto f2 = _sink_source_for_get_row_diff.close();
auto f3 = _sink_source_for_put_row_diff.close();
return when_all_succeed(std::move(gate_future), std::move(f1), std::move(f2), std::move(f3)).discard_result().finally([this] {
return _repair_writer.wait_for_writer_done();
return _repair_writer->wait_for_writer_done();
});
}
@@ -1241,8 +1242,8 @@ private:
future<> do_apply_rows(std::list<repair_row>&& row_diff, unsigned node_idx, update_working_row_buf update_buf) {
return do_with(std::move(row_diff), [this, node_idx, update_buf] (std::list<repair_row>& row_diff) {
return with_semaphore(_repair_writer.sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer.create_writer(_db, node_idx);
return with_semaphore(_repair_writer->sem(), 1, [this, node_idx, update_buf, &row_diff] {
_repair_writer->create_writer(_db, node_idx);
return repeat([this, node_idx, update_buf, &row_diff] () mutable {
if (row_diff.empty()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
@@ -1256,7 +1257,7 @@ private:
// to_repair_rows_list above where the repair_row is created.
mutation_fragment mf = std::move(r.get_mutation_fragment());
auto dk_with_hash = r.get_dk_with_hash();
return _repair_writer.do_write(node_idx, std::move(dk_with_hash), std::move(mf)).then([&row_diff] {
return _repair_writer->do_write(node_idx, std::move(dk_with_hash), std::move(mf)).then([&row_diff] {
row_diff.pop_front();
return make_ready_future<stop_iteration>(stop_iteration::no);
});

View File

@@ -1272,7 +1272,9 @@ flat_mutation_reader cache_entry::read(row_cache& rc, read_context& reader, row_
// Assumes reader is in the corresponding partition
flat_mutation_reader cache_entry::do_read(row_cache& rc, read_context& reader) {
auto snp = _pe.read(rc._tracker.region(), rc._tracker.cleaner(), _schema, &rc._tracker, reader.phase());
auto ckr = query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
auto ckr = with_linearized_managed_bytes([&] {
return query::clustering_key_filter_ranges::get_ranges(*_schema, reader.slice(), _key.key());
});
auto r = make_cache_flat_mutation_reader(_schema, _key, std::move(ckr), rc, reader.shared_from_this(), std::move(snp));
r.upgrade_schema(rc.schema());
r.upgrade_schema(reader.schema());

View File

@@ -827,7 +827,7 @@ std::ostream& schema::describe(std::ostream& os) const {
os << "}";
os << "\n AND comment = '" << comment()<< "'";
os << "\n AND compaction = {'class': '" << sstables::compaction_strategy::name(compaction_strategy()) << "'";
map_as_cql_param(os, compaction_strategy_options()) << "}";
map_as_cql_param(os, compaction_strategy_options(), false) << "}";
os << "\n AND compression = {";
map_as_cql_param(os, get_compressor_params().get_options());
os << "}";

Submodule seastar updated: f760efe0a0...0fba7da929

View File

@@ -347,7 +347,7 @@ public:
_max = _max - row_count;
_exhausted = (row_count < page_size && !results->is_short_read()) || _max == 0;
if (!_exhausted || row_count > 0) {
if (!_exhausted && row_count > 0) {
if (_last_pkey) {
update_slice(*_last_pkey);
}

View File

@@ -2585,14 +2585,13 @@ future<> storage_proxy::send_hint_to_endpoint(frozen_mutation_and_schema fm_a_s,
}
future<> storage_proxy::send_hint_to_all_replicas(frozen_mutation_and_schema fm_a_s) {
const auto timeout = db::timeout_clock::now() + 1h;
if (!_features.cluster_supports_hinted_handoff_separate_connection()) {
std::array<mutation, 1> ms{fm_a_s.fm.unfreeze(fm_a_s.s)};
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit(), timeout);
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit());
}
std::array<hint_wrapper, 1> ms{hint_wrapper { std::move(fm_a_s.fm.unfreeze(fm_a_s.s)) }};
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit(), timeout);
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit());
}
/**

View File

@@ -2546,7 +2546,7 @@ future<> storage_service::rebuild(sstring source_dc) {
slogger.info("Streaming for rebuild successful");
}).handle_exception([] (auto ep) {
// This is used exclusively through JMX, so log the full trace but only throw a simple RTE
slogger.warn("Error while rebuilding node: {}", std::current_exception());
slogger.warn("Error while rebuilding node: {}", ep);
return make_exception_future<>(std::move(ep));
});
});

View File

@@ -145,8 +145,8 @@ static std::vector<shared_sstable> get_uncompacting_sstables(column_family& cf,
class compaction;
struct compaction_writer {
sstable_writer writer;
shared_sstable sst;
sstable_writer writer;
};
class compacting_sstable_writer {
@@ -570,10 +570,12 @@ private:
std::move(gc_consumer));
return seastar::async([cfc = std::move(cfc), reader = std::move(reader), this] () mutable {
reader.consume_in_thread(std::move(cfc), make_partition_filter(), db::no_timeout);
reader.consume_in_thread(std::move(cfc), db::no_timeout);
});
});
return consumer(make_sstable_reader());
// producer will filter out a partition before it reaches the consumer(s)
auto producer = make_filtering_reader(make_sstable_reader(), make_partition_filter());
return consumer(std::move(producer));
}
virtual reader_consumer make_interposer_consumer(reader_consumer end_consumer) {
@@ -775,7 +777,8 @@ public:
cfg.max_sstable_size = _max_sstable_size;
cfg.monitor = &default_write_monitor();
cfg.run_identifier = _run_identifier;
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority), sst};
auto writer = sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority);
return compaction_writer{std::move(sst), std::move(writer)};
}
virtual void stop_sstable_writer(compaction_writer* writer) override {
@@ -839,7 +842,8 @@ public:
cfg.max_sstable_size = _max_sstable_size;
cfg.monitor = &_active_write_monitors.back();
cfg.run_identifier = _run_identifier;
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority), sst};
auto writer = sst->get_writer(*_schema, partitions_per_sstable(), cfg, get_encoding_stats(), _io_priority);
return compaction_writer{std::move(sst), std::move(writer)};
}
virtual void stop_sstable_writer(compaction_writer* writer) override {
@@ -1305,7 +1309,8 @@ public:
cfg.monitor = &default_write_monitor();
// sstables generated for a given shard will share the same run identifier.
cfg.run_identifier = _run_identifiers.at(shard);
return compaction_writer{sst->get_writer(*_schema, partitions_per_sstable(shard), cfg, get_encoding_stats(), _io_priority, shard), sst};
auto writer = sst->get_writer(*_schema, partitions_per_sstable(shard), cfg, get_encoding_stats(), _io_priority, shard);
return compaction_writer{std::move(sst), std::move(writer)};
}
void on_new_partition() override {}

View File

@@ -646,10 +646,11 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
_tasks.push_back(task);
auto sstables = std::make_unique<std::vector<sstables::shared_sstable>>(get_func(*cf));
auto compacting = make_lw_shared<compacting_sstable_registration>(this, *sstables);
auto sstables_ptr = sstables.get();
_stats.pending_tasks += sstables->size();
task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr] () mutable {
task->compaction_done = do_until([sstables_ptr] { return sstables_ptr->empty(); }, [this, task, options, sstables_ptr, compacting] () mutable {
// FIXME: lock cf here
if (!can_proceed(task)) {
@@ -659,7 +660,7 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
auto sst = sstables_ptr->back();
sstables_ptr->pop_back();
return repeat([this, task, options, sst = std::move(sst)] () mutable {
return repeat([this, task, options, sst = std::move(sst), compacting] () mutable {
column_family& cf = *task->compacting_cf;
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
@@ -667,21 +668,22 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
auto descriptor = sstables::compaction_descriptor({ sst }, cf.get_sstable_set(), service::get_local_compaction_priority(),
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, options);
auto compacting = make_lw_shared<compacting_sstable_registration>(this, descriptor.sstables);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
descriptor.release_exhausted = [compacting] (const std::vector<sstables::shared_sstable>& exhausted_sstables) {
compacting->release_compacting(exhausted_sstables);
};
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)] () mutable {
return cf.run_compaction(std::move(descriptor));
return with_semaphore(_rewrite_sstables_sem, 1, [this, task, &cf, descriptor = std::move(descriptor)] () mutable {
_stats.pending_tasks--;
_stats.active_tasks++;
task->compaction_running = true;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor)] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_scheduling_group, [this, &cf, descriptor = std::move(descriptor)]() mutable {
return cf.run_compaction(std::move(descriptor));
});
});
}).then_wrapped([this, task, compacting = std::move(compacting)] (future<> f) mutable {
}).then_wrapped([this, task, compacting] (future<> f) mutable {
task->compaction_running = false;
_stats.active_tasks--;
if (!can_proceed(task)) {

View File

@@ -110,6 +110,7 @@ private:
std::unordered_map<column_family*, rwlock> _compaction_locks;
semaphore _custom_job_sem{1};
seastar::named_semaphore _rewrite_sstables_sem = {1, named_semaphore_exception_factory{"rewrite sstables"}};
std::function<void()> compaction_submission_callback();
// all registered column families are submitted for compaction at a constant interval.

View File

@@ -176,7 +176,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
unsigned max_filled_level = 0;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t offstrategy_threshold = (mode == reshape_mode::strict) ? std::max(schema->min_compaction_threshold(), 4) : std::max(schema->max_compaction_threshold(), 32);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
auto tolerance = [mode] (unsigned level) -> unsigned {
if (mode == reshape_mode::strict) {

View File

@@ -374,6 +374,7 @@ private:
_fwd_end = _fwd ? position_in_partition::before_all_clustered_rows() : position_in_partition::after_all_clustered_rows();
_out_of_range = false;
_range_tombstones.reset();
_ready = {};
_first_row_encountered = false;
}
public:

View File

@@ -250,11 +250,11 @@ sstable_directory::load_foreign_sstables(sstable_info_vector info_vec) {
}
future<>
sstable_directory::remove_input_sstables_from_resharding(const std::vector<sstables::shared_sstable>& sstlist) {
sstable_directory::remove_input_sstables_from_resharding(std::vector<sstables::shared_sstable> sstlist) {
dirlog.debug("Removing {} resharded SSTables", sstlist.size());
return parallel_for_each(sstlist, [] (sstables::shared_sstable sst) {
return parallel_for_each(std::move(sstlist), [] (const sstables::shared_sstable& sst) {
dirlog.trace("Removing resharded SSTable {}", sst->get_filename());
return sst->unlink();
return sst->unlink().then([sst] {});
});
}
@@ -393,7 +393,9 @@ sstable_directory::reshard(sstable_info_vector shared_info, compaction_manager&
desc.creator = std::move(creator);
return sstables::compact_sstables(std::move(desc), table).then([this, &sstlist] (sstables::compaction_info result) {
return when_all_succeed(collect_output_sstables_from_resharding(std::move(result.new_sstables)), remove_input_sstables_from_resharding(sstlist)).discard_result();
// input sstables are moved, to guarantee their resources are released once we're done
// resharding them.
return when_all_succeed(collect_output_sstables_from_resharding(std::move(result.new_sstables)), remove_input_sstables_from_resharding(std::move(sstlist))).discard_result();
});
});
});

View File

@@ -119,7 +119,7 @@ private:
future<> process_descriptor(sstables::entry_descriptor desc, const ::io_priority_class& iop);
void validate(sstables::shared_sstable sst) const;
void handle_component(scan_state& state, sstables::entry_descriptor desc, std::filesystem::path filename);
future<> remove_input_sstables_from_resharding(const std::vector<sstables::shared_sstable>& sstlist);
future<> remove_input_sstables_from_resharding(std::vector<sstables::shared_sstable> sstlist);
future<> collect_output_sstables_from_resharding(std::vector<sstables::shared_sstable> resharded_sstables);
future<> remove_input_sstables_from_reshaping(std::vector<sstables::shared_sstable> sstlist);

View File

@@ -1734,8 +1734,8 @@ void sstable::write_collection(file_writer& out, const composite& clustering_key
void sstable::write_clustered_row(file_writer& out, const schema& schema, const clustering_row& clustered_row) {
auto clustering_key = composite::from_clustering_element(schema, clustered_row.key());
maybe_write_row_marker(out, schema, clustered_row.marker(), clustering_key);
maybe_write_row_tombstone(out, clustering_key, clustered_row);
maybe_write_row_marker(out, schema, clustered_row.marker(), clustering_key);
if (schema.clustering_key_size()) {
column_name_helper::min_max_components(schema, _collector.min_column_names(), _collector.max_column_names(),

View File

@@ -115,7 +115,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
for (auto& pair : all_buckets.first) {
auto ssts = std::move(pair.second);
if (ssts.size() > offstrategy_threshold) {
ssts.resize(std::min(multi_window.size(), max_sstables));
ssts.resize(std::min(ssts.size(), max_sstables));
compaction_descriptor desc(std::move(ssts), std::optional<sstables::sstable_set>(), iop);
desc.options = compaction_options::make_reshape();
return desc;

View File

@@ -432,7 +432,7 @@ future<prepare_message> stream_session::prepare(std::vector<stream_request> requ
try {
db.find_column_family(ks, cf);
} catch (no_such_column_family&) {
auto err = format("[Stream #{{}}] prepare requested ks={{}} cf={{}} does not exist", ks, cf);
auto err = format("[Stream #{{}}] prepare requested ks={{}} cf={{}} does not exist", plan_id, ks, cf);
sslog.warn(err.c_str());
throw std::runtime_error(err);
}

View File

@@ -1048,7 +1048,7 @@ table::stop() {
return make_ready_future<>();
}
return _async_gate.close().then([this] {
return when_all(await_pending_writes(), await_pending_reads(), await_pending_streams()).discard_result().finally([this] {
return await_pending_ops().finally([this] {
return when_all(_memtables->request_flush(), _streaming_memtables->request_flush()).discard_result().finally([this] {
return _compaction_manager.remove(this).then([this] {
// Nest, instead of using when_all, so we don't lose any exceptions.
@@ -1226,9 +1226,20 @@ table::on_compaction_completion(sstables::compaction_completion_desc& desc) {
}
}
auto new_compacted_but_not_deleted = _sstables_compacted_but_not_deleted;
// rebuilding _sstables_compacted_but_not_deleted first to make the entire rebuild operation exception safe.
new_compacted_but_not_deleted.insert(new_compacted_but_not_deleted.end(), desc.old_sstables.begin(), desc.old_sstables.end());
// Precompute before so undo_compacted_but_not_deleted can be sure not to throw
std::unordered_set<sstables::shared_sstable> s(
desc.old_sstables.begin(), desc.old_sstables.end());
_sstables_compacted_but_not_deleted.insert(_sstables_compacted_but_not_deleted.end(), desc.old_sstables.begin(), desc.old_sstables.end());
// After we are done, unconditionally remove compacted sstables from _sstables_compacted_but_not_deleted,
// or they could stay forever in the set, resulting in deleted files remaining
// opened and disk space not being released until shutdown.
auto undo_compacted_but_not_deleted = defer([&] {
auto e = boost::range::remove_if(_sstables_compacted_but_not_deleted, [&] (sstables::shared_sstable sst) {
return s.count(sst);
});
_sstables_compacted_but_not_deleted.erase(e, _sstables_compacted_but_not_deleted.end());
rebuild_statistics();
});
_cache.invalidate([this, &desc] () noexcept {
// FIXME: this is not really noexcept, but we need to provide strong exception guarantees.
@@ -1239,8 +1250,6 @@ table::on_compaction_completion(sstables::compaction_completion_desc& desc) {
// to sstables files that are about to be deleted.
_cache.refresh_snapshot();
_sstables_compacted_but_not_deleted = std::move(new_compacted_but_not_deleted);
rebuild_statistics();
auto f = seastar::with_gate(_sstable_deletion_gate, [this, sstables_to_remove = desc.old_sstables] {
@@ -1256,17 +1265,6 @@ table::on_compaction_completion(sstables::compaction_completion_desc& desc) {
// Any remaining SSTables will eventually be re-compacted and re-deleted.
tlogger.error("Compacted SSTables deletion failed: {}. Ignored.", std::current_exception());
}
// unconditionally remove compacted sstables from _sstables_compacted_but_not_deleted,
// or they could stay forever in the set, resulting in deleted files remaining
// opened and disk space not being released until shutdown.
std::unordered_set<sstables::shared_sstable> s(
desc.old_sstables.begin(), desc.old_sstables.end());
auto e = boost::range::remove_if(_sstables_compacted_but_not_deleted, [&] (sstables::shared_sstable sst) -> bool {
return s.count(sst);
});
_sstables_compacted_but_not_deleted.erase(e, _sstables_compacted_but_not_deleted.end());
rebuild_statistics();
}
// For replace/remove_ancestors_needed_write, note that we need to update the compaction backlog
@@ -1825,7 +1823,8 @@ future<std::unordered_map<sstring, table::snapshot_details>> table::get_snapshot
}
future<> table::flush() {
return _memtables->request_flush();
auto op = _pending_flushes_phaser.start();
return _memtables->request_flush().then([op = std::move(op)] {});
}
// FIXME: We can do much better than this in terms of cache management. Right

View File

@@ -136,7 +136,7 @@ def test_update_condition_eq_different(test_table_s):
ConditionExpression='a = :val2',
ExpressionAttributeValues={':val1': val1, ':val2': val2})
# Also check an actual case of same time, but inequality.
# Also check an actual case of same type, but inequality.
def test_update_condition_eq_unequal(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
@@ -146,6 +146,13 @@ def test_update_condition_eq_unequal(test_table_s):
UpdateExpression='SET a = :val1',
ConditionExpression='a = :oldval',
ExpressionAttributeValues={':val1': 3, ':oldval': 2})
# If the attribute being compared doesn't exist, it's considered a failed
# condition, not an error:
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET a = :val1',
ConditionExpression='q = :oldval',
ExpressionAttributeValues={':val1': 3, ':oldval': 2})
# Check that set equality is checked correctly. Unlike string equality (for
# example), it cannot be done with just naive string comparison of the JSON
@@ -269,15 +276,44 @@ def test_update_condition_lt(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# Trying to compare an unsupported type - e.g., in the following test
# a boolean, is unfortunately caught by boto3 and cannot be tested here...
#test_table_s.update_item(Key={'p': p},
# AttributeUpdates={'d': {'Value': False, 'Action': 'PUT'}})
#with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
# test_table_s.update_item(Key={'p': p},
# UpdateExpression='SET z = :newval',
# ConditionExpression='d < :oldval',
# ExpressionAttributeValues={':newval': 2, ':oldval': True})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval < q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval < a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x < :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval < x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 4
# Test for ConditionExpression with operator "<="
@@ -341,6 +377,44 @@ def test_update_condition_le(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval <= q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval <= a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x <= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval <= x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 7
# Test for ConditionExpression with operator ">"
@@ -404,6 +478,44 @@ def test_update_condition_gt(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval > q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval > a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x > :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval > x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 4
# Test for ConditionExpression with operator ">="
@@ -467,6 +579,44 @@ def test_update_condition_ge(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '0'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval >= q',
ExpressionAttributeValues={':newval': 2, ':oldval': '17'})
# If a comparison parameter comes from a constant specified in the query,
# and it has a type not supported by the comparison (e.g., a list), it's
# not just a failed comparison - it is considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval >= a',
ExpressionAttributeValues={':newval': 2, ':oldval': [1,2]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='x >= :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression=':oldval >= x',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 7
# Test for ConditionExpression with ternary operator "BETWEEN" (checking
@@ -548,6 +698,60 @@ def test_update_condition_between(test_table_s):
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': '0', ':oldval2': '2'})
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='q BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': b'dog', ':oldval2': b'zebra'})
# If and operand from the query, and it has a type not supported by the
# comparison (e.g., a list), it's not just a failed condition - it is
# considered a ValidationException
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': [1,2], ':oldval2': [2,3]})
# However, if when the wrong type comes from an item attribute, not the
# query, the comparison is simply false - not a ValidationException.
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'x': {'Value': [1,2,3], 'Action': 'PUT'},
'y': {'Value': [2,3,4], 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN x and y',
ExpressionAttributeValues={':newval': 2})
# If the two operands come from the query (":val" references) then if they
# have different types or the wrong order, this is a ValidationException.
# But if one or more of the operands come from the item, this only causes
# a false condition - not a ValidationException.
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': 2, ':oldval2': 1})
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval1 AND :oldval2',
ExpressionAttributeValues={':newval': 2, ':oldval1': 2, ':oldval2': 'dog'})
test_table_s.update_item(Key={'p': p}, AttributeUpdates={'two': {'Value': 2, 'Action': 'PUT'}})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN two AND :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 1})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN :oldval AND two',
ExpressionAttributeValues={':newval': 2, ':oldval': 3})
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET z = :newval',
ConditionExpression='a BETWEEN two AND :oldval',
ExpressionAttributeValues={':newval': 2, ':oldval': 'dog'})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['z'] == 9
# Test for ConditionExpression with multi-operand operator "IN", checking
@@ -605,6 +809,13 @@ def test_update_condition_in(test_table_s):
UpdateExpression='SET c = :val37',
ConditionExpression='a IN ()',
ExpressionAttributeValues=values)
# If the attribute being compared doesn't even exist, this is also
# considered as a false condition - not an error.
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
UpdateExpression='SET c = :val37',
ConditionExpression='q IN ({})'.format(','.join(values.keys())),
ExpressionAttributeValues=values)
# Beyond the above operators, there are also test functions supported -
# attribute_exists, attribute_not_exists, attribute_type, begins_with,

View File

@@ -237,6 +237,30 @@ def test_update_expected_1_le(test_table_s):
'AttributeValueList': [2, 3]}}
)
# Comparison operators like le work only on numbers, strings or bytes.
# As noted in issue #8043, if any other type is included in *the query*,
# the result should be a ValidationException, but if the wrong type appears
# in the item, not the query, the result is a failed condition.
def test_update_expected_1_le_validation(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'a': {'Value': 1, 'Action': 'PUT'},
'b': {'Value': [1,2], 'Action': 'PUT'}})
# Bad type (a list) in the query. Result is ValidationException.
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'a': {'ComparisonOperator': 'LE',
'AttributeValueList': [[1,2,3]]}}
)
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'LE',
'AttributeValueList': [3]}}
)
assert not 'z' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']
# Tests for Expected with ComparisonOperator = "LT":
def test_update_expected_1_lt(test_table_s):
p = random_string()
@@ -894,6 +918,34 @@ def test_update_expected_1_between(test_table_s):
AttributeUpdates={'z': {'Value': 2, 'Action': 'PUT'}},
Expected={'d': {'ComparisonOperator': 'BETWEEN', 'AttributeValueList': [set([1]), set([2])]}})
# BETWEEN work only on numbers, strings or bytes. As noted in issue #8043,
# if any other type is included in *the query*, the result should be a
# ValidationException, but if the wrong type appears in the item, not the
# query, the result is a failed condition.
# BETWEEN should also generate ValidationException if the two ends of the
# range are not of the same type or not in the correct order, but this
# already is tested in the test above (test_update_expected_1_between).
def test_update_expected_1_between_validation(test_table_s):
p = random_string()
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'a': {'Value': 1, 'Action': 'PUT'},
'b': {'Value': [1,2], 'Action': 'PUT'}})
# Bad type (a list) in the query. Result is ValidationException.
with pytest.raises(ClientError, match='ValidationException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'a': {'ComparisonOperator': 'BETWEEN',
'AttributeValueList': [[1,2,3], [2,3,4]]}}
)
with pytest.raises(ClientError, match='ConditionalCheckFailedException'):
test_table_s.update_item(Key={'p': p},
AttributeUpdates={'z': {'Value': 17, 'Action': 'PUT'}},
Expected={'b': {'ComparisonOperator': 'BETWEEN',
'AttributeValueList': [1,2]}}
)
assert not 'z' in test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']
##############################################################################
# Instead of ComparisonOperator and AttributeValueList, one can specify either
# Value or Exists:

View File

@@ -236,6 +236,30 @@ def test_filter_expression_ge(test_table_sn_with_data):
expected_items = [item for item in items if item[xn] >= xv]
assert(got_items == expected_items)
# Comparison operators such as >= or BETWEEN only work on numbers, strings or
# bytes. When an expression's operands come from the item and has a wrong type
# (e.g., a list), the result is that the item is skipped - aborting the scan
# with a ValidationException is a bug (this was issue #8043).
def test_filter_expression_le_bad_type(test_table_sn_with_data):
table, p, items = test_table_sn_with_data
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='l <= :xv',
ExpressionAttributeValues={':p': p, ':xv': 3})
assert got_items == []
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression=':xv <= l',
ExpressionAttributeValues={':p': p, ':xv': 3})
assert got_items == []
def test_filter_expression_between_bad_type(test_table_sn_with_data):
table, p, items = test_table_sn_with_data
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='s between :xv and l',
ExpressionAttributeValues={':p': p, ':xv': 'cat'})
assert got_items == []
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='s between l and :xv',
ExpressionAttributeValues={':p': p, ':xv': 'cat'})
assert got_items == []
got_items = full_query(table, KeyConditionExpression='p=:p', FilterExpression='s between i and :xv',
ExpressionAttributeValues={':p': p, ':xv': 'cat'})
assert got_items == []
# Test the "BETWEEN/AND" ternary operator on a numeric, string and bytes
# attribute. These keywords are case-insensitive.
def test_filter_expression_between(test_table_sn_with_data):
@@ -655,6 +679,46 @@ def test_filter_expression_and_sort_key_condition(test_table_sn_with_data):
expected_items = [item for item in items if item['i'] > i and item['c'] <= 7]
assert(got_items == expected_items)
# Test that a FilterExpression and ProjectionExpression may be given together.
# In particular, test that FilterExpression may inspect attributes which will
# not be returned by the query, because of the ProjectionExpression.
# This test reproduces issue #6951.
def test_filter_expression_and_projection_expression(test_table):
p = random_string()
test_table.put_item(Item={'p': p, 'c': 'hi', 'x': 'dog', 'y': 'cat'})
test_table.put_item(Item={'p': p, 'c': 'yo', 'x': 'mouse', 'y': 'horse'})
# Note that the filter is on the column x, but x is not included in the
# results because of ProjectionExpression. The filter should still work.
got_items = full_query(test_table,
KeyConditionExpression='p=:p',
FilterExpression='x=:x',
ProjectionExpression='y',
ExpressionAttributeValues={':p': p, ':x': 'mouse'})
# Note that:
# 1. Exactly one item matches the filter on x
# 2. The returned record for that item will include *only* the attribute y
# as requestd by ProjectionExpression. It won't include x - it was just
# needed for the filter, but didn't appear in ProjectionExpression.
expected_items = [{'y': 'horse'}]
assert(got_items == expected_items)
# It is not allowed to combine the new-style FilterExpression with the
# old-style AttributesToGet. You must use ProjectionExpression instead
# (tested in test_filter_expression_and_projection_expression() above).
@pytest.mark.xfail(reason="missing check for expression and non-expression query parameters")
def test_filter_expression_and_attributes_to_get(test_table):
p = random_string()
# DynamoDB complains: "Can not use both expression and non-expression
# parameters in the same request: Non-expression parameters:
# {AttributesToGet} Expression parameters: {FilterExpression,
# KeyConditionExpression}.
with pytest.raises(ClientError, match='ValidationException.* both'):
full_query(test_table,
KeyConditionExpression='p=:p',
FilterExpression='x=:x',
AttributesToGet=['y'],
ExpressionAttributeValues={':p': p, ':x': 'mouse'})
# All the tests above were for the Query operation. Let's verify that
# FilterExpression also works for Scan. We will assume the same code is
# used internally, so don't need to check all the details of the

View File

@@ -386,3 +386,38 @@ def test_query_missing_key(test_table):
full_query(test_table, KeyConditions={})
with pytest.raises(ClientError, match='ValidationException'):
full_query(test_table)
# The paging tests above used a numeric sort key. Let's now also test paging
# with a bytes sort key. We already have above a test that bytes sort keys
# work and are sorted correctly (test_query_sort_order_bytes), but the
# following test adds a check that *paging* works correctly for such keys.
# We used to have a bug in this (issue #7768) - the returned LastEvaluatedKey
# was incorrectly formatted, breaking the boto3's parsing of the response.
# Note we only check the case of bytes *sort* keys in this test. For bytes
# *partition* keys, see test_scan_paging_bytes().
def test_query_paging_bytes(test_table_sb):
p = random_string()
items = [{'p': p, 'c': random_bytes()} for i in range(10)]
with test_table_sb.batch_writer() as batch:
for item in items:
batch.put_item(item)
# Deliberately pass Limit=1 to enforce paging even though we have
# just 10 items in the partition.
got_items = full_query(test_table_sb, Limit=1,
KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}})
got_sort_keys = [x['c'] for x in got_items]
expected_sort_keys = sorted(x['c'] for x in items)
assert got_sort_keys == expected_sort_keys
# Similar for test for string clustering keys
def test_query_paging_string(test_table_ss):
p = random_string()
items = [{'p': p, 'c': random_string()} for i in range(10)]
with test_table_ss.batch_writer() as batch:
for item in items:
batch.put_item(item)
got_items = full_query(test_table_ss, Limit=1,
KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}})
got_sort_keys = [x['c'] for x in got_items]
expected_sort_keys = sorted(x['c'] for x in items)
assert got_sort_keys == expected_sort_keys

View File

@@ -534,3 +534,40 @@ def test_query_filter_paging(test_table_sn_with_data):
# passed the filter), but doesn't know it is really the last item - it
# takes one more query to discover there are no more.
assert(pages == len(items) + 1)
# Test that a QueryFilter and AttributesToGet may be given together.
# In particular, test that QueryFilter may inspect attributes which will
# not be returned by the query, because the AttributesToGet.
# This test reproduces issue #6951.
def test_query_filter_and_attributes_to_get(test_table):
p = random_string()
test_table.put_item(Item={'p': p, 'c': 'hi', 'x': 'dog', 'y': 'cat'})
test_table.put_item(Item={'p': p, 'c': 'yo', 'x': 'mouse', 'y': 'horse'})
# Note that the filter is on the column x, but x is not included in the
# results because of AttributesToGet. The filter should still work.
got_items = full_query(test_table,
KeyConditions={ 'p': { 'AttributeValueList': [p], 'ComparisonOperator': 'EQ' }},
QueryFilter={ 'x': { 'AttributeValueList': ['mouse'], 'ComparisonOperator': 'EQ' }},
AttributesToGet=['y'])
# Note that:
# 1. Exactly one item matches the filter on x
# 2. The returned record for that item will include *only* the attribute y
# as requestd by AttributesToGet. It won't include x - it was just
# needed for the filter, but didn't appear in ProjectionExpression.
expected_items = [{'y': 'horse'}]
assert(got_items == expected_items)
# It is not allowed to combine the old-style QueryFilter with the
# new-style ProjectionExpression. You must use AttributesToGet instead
# (tested in test_query_filter_and_attributes_to_get() above).
@pytest.mark.xfail(reason="missing check for expression and non-expression query parameters")
def test_query_filter_and_projection_expression(test_table):
p = random_string()
# DynamoDB complains: "Can not use both expression and non-expression
# parameters in the same request: Non-expression parameters:
# {QueryFilter} Expression parameters: {ProjectionExpression}.
with pytest.raises(ClientError, match='ValidationException.* both'):
full_query(test_table,
KeyConditions={ 'p': { 'AttributeValueList': [p], 'ComparisonOperator': 'EQ' }},
QueryFilter={ 'x': { 'AttributeValueList': ['mouse'], 'ComparisonOperator': 'EQ' }},
ProjectionExpression='y')

View File

@@ -19,7 +19,7 @@
import pytest
from botocore.exceptions import ClientError
from util import random_string, full_scan, full_scan_and_count, multiset
from util import random_string, random_bytes, full_scan, full_scan_and_count, multiset
from boto3.dynamodb.conditions import Attr
# Test that scanning works fine with/without pagination
@@ -264,3 +264,20 @@ def test_scan_parallel_incorrect(filled_test_table):
for segment in [7, 9]:
with pytest.raises(ClientError, match='ValidationException.*Segment'):
full_scan(test_table, TotalSegments=5, Segment=segment)
# We used to have a bug with formatting of LastEvaluatedKey in the response
# of Query and Scan with bytes keys (issue #7768). In test_query_paging_byte()
# (test_query.py) we tested the case of bytes *sort* keys. In the following
# test we check bytes *partition* keys.
def test_scan_paging_bytes(test_table_b):
# We will not Scan the entire table - we have no idea what it contains.
# But we don't need to scan the entire table - we just need the table
# to contain at least two items, and then Scan it with Limit=1 and stop
# after one page. Before #7768 was fixed, the test failed when the
# LastEvaluatedKey in the response could not be parsed.
items = [{'p': random_bytes()}, {'p': random_bytes()}]
with test_table_b.batch_writer() as batch:
for item in items:
batch.put_item(item)
response = test_table_b.scan(ConsistentRead=True, Limit=1)
assert 'LastEvaluatedKey' in response

View File

@@ -41,8 +41,10 @@ def test_fetch_from_system_tables(scylla_only, dynamodb):
key_columns = [item['column_name'] for item in col_response['Items'] if item['kind'] == 'clustering' or item['kind'] == 'partition_key']
qualified_name = "{}{}.{}".format(internal_prefix, ks_name, table_name)
response = client.scan(TableName=qualified_name, AttributesToGet=key_columns)
print(ks_name, table_name, response)
import time
start = time.time()
response = client.scan(TableName=qualified_name, AttributesToGet=key_columns, Limit=50)
print(ks_name, table_name, len(str(response)), time.time()-start)
def test_block_access_to_non_system_tables_with_virtual_interface(scylla_only, test_table_s, dynamodb):
client = dynamodb.meta.client

View File

@@ -675,6 +675,24 @@ def test_update_expression_add_numbers(test_table_s):
UpdateExpression='ADD b :val1',
ExpressionAttributeValues={':val1': 1})
# In test_update_expression_add_numbers() above we tested ADDing a number to
# an existing number. The following test check that ADD can be used to
# create a *new* number, as if it was added to zero.
def test_update_expression_add_numbers_new(test_table_s):
# Test that "ADD" can create a new number attribute:
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': 'hello'})
test_table_s.update_item(Key={'p': p},
UpdateExpression='ADD b :val1',
ExpressionAttributeValues={':val1': 7})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['b'] == 7
# Test that "ADD" can create an entirely new item:
p = random_string()
test_table_s.update_item(Key={'p': p},
UpdateExpression='ADD b :val1',
ExpressionAttributeValues={':val1': 8})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['b'] == 8
# Test "ADD" operation for sets
def test_update_expression_add_sets(test_table_s):
p = random_string()
@@ -703,6 +721,24 @@ def test_update_expression_add_sets(test_table_s):
UpdateExpression='ADD a :val1',
ExpressionAttributeValues={':val1': 'hello'})
# In test_update_expression_add_sets() above we tested ADDing elements to an
# existing set. The following test checks that ADD can be used to create a
# *new* set, by adding its first item.
def test_update_expression_add_sets_new(test_table_s):
# Test that "ADD" can create a new set attribute:
p = random_string()
test_table_s.put_item(Item={'p': p, 'a': 'hello'})
test_table_s.update_item(Key={'p': p},
UpdateExpression='ADD b :val1',
ExpressionAttributeValues={':val1': set(['dog'])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['b'] == set(['dog'])
# Test that "ADD" can create an entirely new item:
p = random_string()
test_table_s.update_item(Key={'p': p},
UpdateExpression='ADD b :val1',
ExpressionAttributeValues={':val1': set(['cat'])})
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['b'] == set(['cat'])
# Test "DELETE" operation for sets
def test_update_expression_delete_sets(test_table_s):
p = random_string()

View File

@@ -128,12 +128,14 @@ SEASTAR_THREAD_TEST_CASE(test_large_data) {
});
}).get();
// Since deletion of large data entries has been deleted,
// expect the record to be present.
assert_that(e.execute_cql("select partition_key from system.large_rows where table_name = 'tbl' allow filtering;").get0())
.is_rows()
.is_empty();
.with_size(1);
assert_that(e.execute_cql("select partition_key from system.large_cells where table_name = 'tbl' allow filtering;").get0())
.is_rows()
.is_empty();
.with_size(1);
return make_ready_future<>();
}, cfg).get();

View File

@@ -2558,7 +2558,7 @@ SEASTAR_THREAD_TEST_CASE(test_queue_reader) {
}
}
// abort()
// abort() -- check that consumer is aborted
{
auto [reader, handle] = make_queue_reader(gen.schema());
auto fill_buffer_fut = reader.fill_buffer(db::no_timeout);
@@ -2573,6 +2573,28 @@ SEASTAR_THREAD_TEST_CASE(test_queue_reader) {
BOOST_REQUIRE_THROW(fill_buffer_fut.get(), std::runtime_error);
BOOST_REQUIRE_THROW(handle.push(partition_end{}).get(), std::runtime_error);
BOOST_REQUIRE(!reader.is_end_of_stream());
}
// abort() -- check that producer is aborted
{
auto [reader, handle] = make_queue_reader(gen.schema());
reader.set_max_buffer_size(1);
auto expected_reader = flat_mutation_reader_from_mutations(expected_muts);
auto push_fut = make_ready_future<>();
while (push_fut.available()) {
push_fut = handle.push(std::move(*expected_reader(db::no_timeout).get0()));
}
BOOST_REQUIRE(!push_fut.available());
handle.abort(std::make_exception_ptr<std::runtime_error>(std::runtime_error("error")));
BOOST_REQUIRE_THROW(reader.fill_buffer(db::no_timeout).get(), std::runtime_error);
BOOST_REQUIRE_THROW(push_fut.get(), std::runtime_error);
BOOST_REQUIRE(!reader.is_end_of_stream());
}
// Detached handle

View File

@@ -49,7 +49,7 @@ static thread_local mutation_application_stats app_stats_for_tests;
// Verifies that tombstones in "list" are monotonic, overlap with the requested range,
// and have information equivalent with "expected" in that range.
static
void check_tombstone_slice(const schema& s, std::vector<range_tombstone> list,
void check_tombstone_slice(const schema& s, const utils::chunked_vector<range_tombstone>& list,
const query::clustering_range& range,
std::initializer_list<range_tombstone> expected)
{

View File

@@ -29,6 +29,7 @@
#include "types/set.hh"
#include "test/lib/exception_utils.hh"
#include "cql3/statements/select_statement.hh"
#include "test/lib/select_statement_utils.hh"
SEASTAR_TEST_CASE(test_secondary_index_regular_column_query) {
@@ -1208,6 +1209,293 @@ SEASTAR_TEST_CASE(test_indexing_paging_and_aggregation) {
});
}
// Verifies that both "SELECT * [rest_of_query]" and "SELECT count(*) [rest_of_query]"
// return expected count of rows.
void assert_select_count_and_select_rows_has_size(
cql_test_env& e,
const sstring& rest_of_query, int64_t expected_count,
const std::experimental::source_location& loc = std::experimental::source_location::current()) {
eventually([&] {
require_rows(e, "SELECT count(*) " + rest_of_query, {
{ long_type->decompose(expected_count) }
}, loc);
auto res = cquery_nofail(e, "SELECT * " + rest_of_query, nullptr, loc);
try {
assert_that(res).is_rows().with_size(expected_count);
} catch (const std::exception& e) {
BOOST_FAIL(format("is_rows/with_size failed: {}\n{}:{}: originally from here",
e.what(), loc.file_name(), loc.line()));
}
});
}
static constexpr int page_scenarios_page_size = 20;
static constexpr int page_scenarios_row_count = 2 * page_scenarios_page_size + 5;
static constexpr int page_scenarios_initial_count = 3;
static constexpr int page_scenarios_window_size = 4;
static constexpr int page_scenarios_just_before_first_page = page_scenarios_page_size - page_scenarios_window_size;
static constexpr int page_scenarios_just_after_first_page = page_scenarios_page_size + page_scenarios_window_size;
static constexpr int page_scenarios_just_before_second_page = 2 * page_scenarios_page_size - page_scenarios_window_size;
static constexpr int page_scenarios_just_after_second_page = 2 * page_scenarios_page_size + page_scenarios_window_size;
static_assert(page_scenarios_initial_count < page_scenarios_row_count);
static_assert(page_scenarios_window_size < page_scenarios_page_size);
static_assert(page_scenarios_just_after_second_page < page_scenarios_row_count);
// Executes `insert` lambda page_scenarios_row_count times.
// Runs `validate` lambda in a few scenarios:
//
// 1. After a small number of `insert`s
// 2. In a window from just before and just after `insert`s were executed
// DEFAULT_COUNT_PAGE_SIZE times
// 3. In a window from just before and just after `insert`s were executed
// 2 * DEFAULT_COUNT_PAGE_SIZE times
// 4. After all `insert`s
void test_with_different_page_scenarios(
noncopyable_function<void (int)> insert, noncopyable_function<void (int)> validate) {
int current_row = 0;
for (; current_row < page_scenarios_initial_count; current_row++) {
insert(current_row);
validate(current_row + 1);
}
for (; current_row < page_scenarios_just_before_first_page; current_row++) {
insert(current_row);
}
for (; current_row < page_scenarios_just_after_first_page; current_row++) {
insert(current_row);
validate(current_row + 1);
}
for (; current_row < page_scenarios_just_before_second_page; current_row++) {
insert(current_row);
}
for (; current_row < page_scenarios_just_after_second_page; current_row++) {
insert(current_row);
validate(current_row + 1);
}
for (; current_row < page_scenarios_row_count; current_row++) {
insert(current_row);
}
// No +1, because we just left for loop and current_row was incremented.
validate(current_row);
}
SEASTAR_TEST_CASE(test_secondary_index_on_ck_first_column_and_aggregation) {
// Tests aggregation on table with secondary index on first column
// of clustering key. This is the "partition_slices" case of
// indexed_table_select_statement::do_execute.
return do_with_cql_env_thread([] (cql_test_env& e) {
cql3::statements::set_internal_paging_size(page_scenarios_page_size).get();
// Explicitly reproduce the first failing example in issue #7355.
cquery_nofail(e, "CREATE TABLE t1 (pk1 int, pk2 int, ck int, primary key((pk1, pk2), ck))");
cquery_nofail(e, "CREATE INDEX ON t1(ck)");
cquery_nofail(e, "INSERT INTO t1(pk1, pk2, ck) VALUES (1, 2, 3)");
assert_select_count_and_select_rows_has_size(e, "FROM t1 WHERE ck = 3", 1);
cquery_nofail(e, "INSERT INTO t1(pk1, pk2, ck) VALUES (1, 2, 4)");
cquery_nofail(e, "INSERT INTO t1(pk1, pk2, ck) VALUES (1, 2, 5)");
assert_select_count_and_select_rows_has_size(e, "FROM t1 WHERE ck = 3", 1);
cquery_nofail(e, "INSERT INTO t1(pk1, pk2, ck) VALUES (2, 2, 3)");
assert_select_count_and_select_rows_has_size(e, "FROM t1 WHERE ck = 3", 2);
cquery_nofail(e, "INSERT INTO t1(pk1, pk2, ck) VALUES (2, 1, 3)");
assert_select_count_and_select_rows_has_size(e, "FROM t1 WHERE ck = 3", 3);
// Test a case when there are a lot of small partitions (more than a page size).
cquery_nofail(e, "CREATE TABLE t2 (pk int, ck int, primary key(pk, ck))");
cquery_nofail(e, "CREATE INDEX ON t2(ck)");
// "Decoy" rows - they should be not counted (previously they were incorrectly counted in,
// see issue #7355).
cquery_nofail(e, "INSERT INTO t2(pk, ck) VALUES (0, -2)");
cquery_nofail(e, "INSERT INTO t2(pk, ck) VALUES (0, 3)");
cquery_nofail(e, format("INSERT INTO t2(pk, ck) VALUES ({}, 3)", page_scenarios_just_after_first_page).c_str());
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t2(pk, ck) VALUES ({}, 1)", current_row).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t2 WHERE ck = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk FROM t2 WHERE ck = 1 GROUP BY pk");
assert_that(res).is_rows().with_size(rows_inserted);
res = cquery_nofail(e, "SELECT pk, ck FROM t2 WHERE ck = 1 GROUP BY pk, ck");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT sum(pk) FROM t2 WHERE ck = 1", {
{ int32_type->decompose(int32_t(rows_inserted * (rows_inserted - 1) / 2)) }
});
});
});
// Test a case when there is a single large partition (larger than a page size).
cquery_nofail(e, "CREATE TABLE t3 (pk int, ck1 int, ck2 int, primary key(pk, ck1, ck2))");
cquery_nofail(e, "CREATE INDEX ON t3(ck1)");
// "Decoy" rows
cquery_nofail(e, "INSERT INTO t3(pk, ck1, ck2) VALUES (1, 0, 0)");
cquery_nofail(e, "INSERT INTO t3(pk, ck1, ck2) VALUES (1, 2, 0)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t3(pk, ck1, ck2) VALUES (1, 1, {})", current_row).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t3 WHERE ck1 = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk FROM t3 WHERE ck1 = 1 GROUP BY pk");
assert_that(res).is_rows().with_size(1);
res = cquery_nofail(e, "SELECT pk, ck1 FROM t3 WHERE ck1 = 1 GROUP BY pk, ck1");
assert_that(res).is_rows().with_size(1);
res = cquery_nofail(e, "SELECT pk, ck1, ck2 FROM t3 WHERE ck1 = 1 GROUP BY pk, ck1, ck2");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT avg(ck2) FROM t3 WHERE ck1 = 1", {
{ int32_type->decompose(int32_t((rows_inserted * (rows_inserted - 1) / 2) / rows_inserted)) }
});
});
});
cql3::statements::reset_internal_paging_size().get();
});
}
SEASTAR_TEST_CASE(test_secondary_index_on_pk_column_and_aggregation) {
// Tests aggregation on table with secondary index on a column
// of partition key. This is the "whole_partitions" case of
// indexed_table_select_statement::do_execute.
return do_with_cql_env_thread([] (cql_test_env& e) {
cql3::statements::set_internal_paging_size(page_scenarios_page_size).get();
// Explicitly reproduce the second failing example in issue #7355.
// This a case with a single large partition.
cquery_nofail(e, "CREATE TABLE t1 (pk1 int, pk2 int, ck int, primary key((pk1, pk2), ck))");
cquery_nofail(e, "CREATE INDEX ON t1(pk2)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t1(pk1, pk2, ck) VALUES (1, 1, {})", current_row).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t1 WHERE pk2 = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk1, pk2 FROM t1 WHERE pk2 = 1 GROUP BY pk1, pk2");
assert_that(res).is_rows().with_size(1);
res = cquery_nofail(e, "SELECT pk1, pk2, ck FROM t1 WHERE pk2 = 1 GROUP BY pk1, pk2, ck");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT min(pk1) FROM t1 WHERE pk2 = 1", {
{ int32_type->decompose(1) }
});
});
});
// Test a case when there are a lot of small partitions (more than a page size)
// and there is a clustering key in base table.
cquery_nofail(e, "CREATE TABLE t2 (pk1 int, pk2 int, ck int, primary key((pk1, pk2), ck))");
cquery_nofail(e, "CREATE INDEX ON t2(pk2)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t2(pk1, pk2, ck) VALUES ({}, 1, {})",
current_row, current_row % 20).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t2 WHERE pk2 = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk1, pk2 FROM t2 WHERE pk2 = 1 GROUP BY pk1, pk2");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT max(pk1) FROM t2 WHERE pk2 = 1", {
{ int32_type->decompose(int32_t(rows_inserted - 1)) }
});
});
});
// Test a case when there are a lot of small partitions (more than a page size)
// and there is NO clustering key in base table.
cquery_nofail(e, "CREATE TABLE t3 (pk1 int, pk2 int, primary key((pk1, pk2)))");
cquery_nofail(e, "CREATE INDEX ON t3(pk2)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t3(pk1, pk2) VALUES ({}, 1)", current_row).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t3 WHERE pk2 = 1", rows_inserted);
});
cql3::statements::reset_internal_paging_size().get();
});
}
SEASTAR_TEST_CASE(test_secondary_index_on_non_pk_ck_column_and_aggregation) {
// Tests aggregation on table with secondary index on a column
// that is not a part of partition key and clustering key.
// This is the non-"whole_partitions" and non-"partition_slices"
// case of indexed_table_select_statement::do_execute.
return do_with_cql_env_thread([] (cql_test_env& e) {
cql3::statements::set_internal_paging_size(page_scenarios_page_size).get();
// Test a case when there are a lot of small partitions (more than a page size)
// and there is a clustering key in base table.
cquery_nofail(e, "CREATE TABLE t (pk int, ck int, v int, primary key(pk, ck))");
cquery_nofail(e, "CREATE INDEX ON t(v)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t(pk, ck, v) VALUES ({}, {}, 1)",
current_row, current_row % 20).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t WHERE v = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk FROM t WHERE v = 1 GROUP BY pk");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT sum(v) FROM t WHERE v = 1", {
{ int32_type->decompose(int32_t(rows_inserted)) }
});
});
});
// Test a case when there are a lot of small partitions (more than a page size)
// and there is NO clustering key in base table.
cquery_nofail(e, "CREATE TABLE t2 (pk int, v int, primary key(pk))");
cquery_nofail(e, "CREATE INDEX ON t2(v)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t2(pk, v) VALUES ({}, 1)", current_row).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t2 WHERE v = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk FROM t2 WHERE v = 1 GROUP BY pk");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT sum(pk) FROM t2 WHERE v = 1", {
{ int32_type->decompose(int32_t(rows_inserted * (rows_inserted - 1) / 2)) }
});
});
});
// Test a case when there is a single large partition (larger than a page size).
cquery_nofail(e, "CREATE TABLE t3 (pk int, ck int, v int, primary key(pk, ck))");
cquery_nofail(e, "CREATE INDEX ON t3(v)");
test_with_different_page_scenarios([&](int current_row) {
cquery_nofail(e, format("INSERT INTO t3(pk, ck, v) VALUES (1, {}, 1)", current_row).c_str());
}, [&](int rows_inserted) {
assert_select_count_and_select_rows_has_size(e, "FROM t3 WHERE v = 1", rows_inserted);
eventually([&] {
auto res = cquery_nofail(e, "SELECT pk FROM t3 WHERE v = 1 GROUP BY pk");
assert_that(res).is_rows().with_size(1);
res = cquery_nofail(e, "SELECT pk, ck FROM t3 WHERE v = 1 GROUP BY pk, ck");
assert_that(res).is_rows().with_size(rows_inserted);
require_rows(e, "SELECT max(ck) FROM t3 WHERE v = 1", {
{ int32_type->decompose(int32_t(rows_inserted - 1)) }
});
});
});
cql3::statements::reset_internal_paging_size().get();
});
}
SEASTAR_TEST_CASE(test_computed_columns) {
return do_with_cql_env_thread([] (auto& e) {
e.execute_cql("CREATE TABLE t (p1 int, p2 int, c1 int, c2 int, v int, PRIMARY KEY ((p1,p2),c1,c2))").get();

View File

@@ -100,6 +100,13 @@ BOOST_AUTO_TEST_CASE(test_byte_type_string_conversions) {
BOOST_REQUIRE_EQUAL(byte_type->to_string(bytes()), "");
}
BOOST_AUTO_TEST_CASE(test_ascii_type_string_conversions) {
BOOST_REQUIRE(ascii_type->equal(ascii_type->from_string("ascii"), ascii_type->decompose("ascii")));
BOOST_REQUIRE_EQUAL(ascii_type->to_string(ascii_type->decompose("ascii")), "ascii");
test_parsing_fails(ascii_type, "¡Hola!");
}
BOOST_AUTO_TEST_CASE(test_short_type_string_conversions) {
BOOST_REQUIRE(short_type->equal(short_type->from_string("12345"), short_type->decompose(int16_t(12345))));
BOOST_REQUIRE_EQUAL(short_type->to_string(short_type->decompose(int16_t(12345))), "12345");

View File

@@ -0,0 +1,43 @@
# Copyright 2020 ScyllaDB
#
# This file is part of Scylla.
#
# Scylla is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Scylla is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
from util import new_test_table
import requests
def test_create_large_static_cells_and_rows(cql, test_keyspace):
'''Test that `large_data_handler` successfully reports large static cells
and static rows and this doesn't cause a crash of Scylla server.
This is a regression test for https://github.com/scylladb/scylla/issues/6780'''
schema = "pk int, ck int, user_ids set<text> static, PRIMARY KEY (pk, ck)"
with new_test_table(cql, test_keyspace, schema) as table:
insert_stmt = cql.prepare(f"INSERT INTO {table} (pk, ck, user_ids) VALUES (?, ?, ?) USING TIMEOUT 5m")
# Default large data threshold for cells is 1 mb, for rows it is 10 mb.
# Take 10 mb cell to trigger large data reporting code both for
# static cells and static rows simultaneously.
large_set = {'x' * 1024 * 1024 * 10}
cql.execute(insert_stmt, [1, 1, large_set])
# REST API endpoint address for test scylla node
node_address = f'http://{cql.cluster.contact_points[0]}:10000'
# Execute force flush of test table to persistent storage, which is necessary to trigger
# `large_data_handler` execution.
table_without_ks = table[table.find('.') + 1:] # strip keyspace part from the table name
requests.post(f'{node_address}/storage_service/keyspace_flush/{test_keyspace}', params={'cf' : table_without_ks})
# No need to check that the Scylla server is running here, since the test will
# fail automatically in case Scylla crashes.

View File

@@ -0,0 +1,33 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/future.hh>
namespace cql3 {
namespace statements {
future<> set_internal_paging_size(int internal_paging_size);
future<> reset_internal_paging_size();
}
}

View File

@@ -39,7 +39,7 @@ class LiveData(object):
def _discoverMetrics(self):
results = metric.Metric.discover(self._metric_source)
logging.debug('_discoverMetrics: {} results discovered'.format(len(results)))
for symbol in results:
for symbol in list(results):
if not self._matches(symbol, self._metricPatterns):
results.pop(symbol)
logging.debug('_initializeMetrics: {} results matched'.format(len(results)))

115
types.cc
View File

@@ -1387,42 +1387,50 @@ static void validate_aux(const tuple_type_impl& t, bytes_view v, cql_serializati
}
namespace {
template <typename UnderlyingView, typename FragmentRangeView>
struct validate_visitor {
bytes_view v;
UnderlyingView underlying;
FragmentRangeView v;
cql_serialization_format sf;
void operator()(const reversed_type_impl& t) { return t.underlying_type()->validate(v, sf); }
void operator()(const reversed_type_impl& t) { return t.underlying_type()->validate(underlying, sf); }
void operator()(const abstract_type&) {}
template <typename T> void operator()(const integer_type_impl<T>& t) {
if (v.empty()) {
return;
}
if (v.size() != sizeof(T)) {
throw marshal_exception(format("Validation failed for type {}: got {:d} bytes", t.name(), v.size()));
if (v.size_bytes() != sizeof(T)) {
throw marshal_exception(format("Validation failed for type {}: got {:d} bytes", t.name(), v.size_bytes()));
}
}
void operator()(const byte_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != 1) {
throw marshal_exception(format("Expected 1 byte for a tinyint ({:d})", v.size()));
if (v.size_bytes() != 1) {
throw marshal_exception(format("Expected 1 byte for a tinyint ({:d})", v.size_bytes()));
}
}
void operator()(const short_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != 2) {
throw marshal_exception(format("Expected 2 bytes for a smallint ({:d})", v.size()));
if (v.size_bytes() != 2) {
throw marshal_exception(format("Expected 2 bytes for a smallint ({:d})", v.size_bytes()));
}
}
void operator()(const ascii_type_impl&) {
if (!utils::ascii::validate(v)) {
throw marshal_exception("Validation failed - non-ASCII character in an ASCII string");
}
with_linearized(v, [this] (bytes_view bv) {
if (!utils::ascii::validate(bv)) {
throw marshal_exception("Validation failed - non-ASCII character in an ASCII string");
}
});
}
void operator()(const utf8_type_impl&) {
if (!utils::utf8::validate(v)) {
auto error = with_linearized(v, [this] (bytes_view bv) {
return !utils::utf8::validate(bv);
});
if (error) {
throw marshal_exception("Validation failed - non-UTF8 character in a UTF8 string");
}
}
@@ -1431,19 +1439,20 @@ struct validate_visitor {
if (v.empty()) {
return;
}
if (v.size() != 1) {
throw marshal_exception(format("Validation failed for boolean, got {:d} bytes", v.size()));
if (v.size_bytes() != 1) {
throw marshal_exception(format("Validation failed for boolean, got {:d} bytes", v.size_bytes()));
}
}
void operator()(const timeuuid_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != 16) {
throw marshal_exception(format("Validation failed for timeuuid - got {:d} bytes", v.size()));
if (v.size_bytes() != 16) {
throw marshal_exception(format("Validation failed for timeuuid - got {:d} bytes", v.size_bytes()));
}
auto msb = read_simple<uint64_t>(v);
auto lsb = read_simple<uint64_t>(v);
auto in = utils::linearizing_input_stream(v);
auto msb = in.template read_trivial<uint64_t>();
auto lsb = in.template read_trivial<uint64_t>();
utils::UUID uuid(msb, lsb);
if (uuid.version() != 1) {
throw marshal_exception(format("Unsupported UUID version ({:d})", uuid.version()));
@@ -1453,17 +1462,19 @@ struct validate_visitor {
if (v.empty()) {
return;
}
if (v.size() != sizeof(uint64_t)) {
throw marshal_exception(format("Validation failed for timestamp - got {:d} bytes", v.size()));
if (v.size_bytes() != sizeof(uint64_t)) {
throw marshal_exception(format("Validation failed for timestamp - got {:d} bytes", v.size_bytes()));
}
}
void operator()(const duration_type_impl& t) {
if (v.size() < 3) {
throw marshal_exception(format("Expected at least 3 bytes for a duration, got {:d}", v.size()));
if (v.size_bytes() < 3) {
throw marshal_exception(format("Expected at least 3 bytes for a duration, got {:d}", v.size_bytes()));
}
common_counter_type months, days, nanoseconds;
std::tie(months, days, nanoseconds) = deserialize_counters(v);
std::tie(months, days, nanoseconds) = with_linearized(v, [] (bytes_view bv) {
return deserialize_counters(bv);
});
auto check_counter_range = [] (common_counter_type value, auto counter_value_type_instance,
std::string_view counter_name) {
@@ -1490,51 +1501,71 @@ struct validate_visitor {
if (v.empty()) {
return;
}
if (v.size() != sizeof(T)) {
throw marshal_exception(format("Expected {:d} bytes for a floating type, got {:d}", sizeof(T), v.size()));
if (v.size_bytes() != sizeof(T)) {
throw marshal_exception(format("Expected {:d} bytes for a floating type, got {:d}", sizeof(T), v.size_bytes()));
}
}
void operator()(const simple_date_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != 4) {
throw marshal_exception(format("Expected 4 byte long for date ({:d})", v.size()));
if (v.size_bytes() != 4) {
throw marshal_exception(format("Expected 4 byte long for date ({:d})", v.size_bytes()));
}
}
void operator()(const time_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != 8) {
throw marshal_exception(format("Expected 8 byte long for time ({:d})", v.size()));
if (v.size_bytes() != 8) {
throw marshal_exception(format("Expected 8 byte long for time ({:d})", v.size_bytes()));
}
}
void operator()(const uuid_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != 16) {
throw marshal_exception(format("Validation failed for uuid - got {:d} bytes", v.size()));
if (v.size_bytes() != 16) {
throw marshal_exception(format("Validation failed for uuid - got {:d} bytes", v.size_bytes()));
}
}
void operator()(const inet_addr_type_impl& t) {
if (v.empty()) {
return;
}
if (v.size() != sizeof(uint32_t) && v.size() != 16) {
throw marshal_exception(format("Validation failed for inet_addr - got {:d} bytes", v.size()));
if (v.size_bytes() != sizeof(uint32_t) && v.size_bytes() != 16) {
throw marshal_exception(format("Validation failed for inet_addr - got {:d} bytes", v.size_bytes()));
}
}
void operator()(const map_type_impl& t) { validate_aux(t, v, sf); }
void operator()(const set_type_impl& t) { validate_aux(t, v, sf); }
void operator()(const list_type_impl& t) { validate_aux(t, v, sf); }
void operator()(const tuple_type_impl& t) { validate_aux(t, v, sf); }
void operator()(const map_type_impl& t) {
with_linearized(v, [this, &t] (bytes_view bv) {
validate_aux(t, bv, sf);
});
}
void operator()(const set_type_impl& t) {
with_linearized(v, [this, &t] (bytes_view bv) {
validate_aux(t, bv, sf);
});
}
void operator()(const list_type_impl& t) {
with_linearized(v, [this, &t] (bytes_view bv) {
validate_aux(t, bv, sf);
});
}
void operator()(const tuple_type_impl& t) {
with_linearized(v, [this, &t] (bytes_view bv) {
validate_aux(t, bv, sf);
});
}
};
}
void abstract_type::validate(const fragmented_temporary_buffer::view& view, cql_serialization_format sf) const {
visit(*this, validate_visitor{view, view, sf});
}
void abstract_type::validate(bytes_view v, cql_serialization_format sf) const {
visit(*this, validate_visitor{v, sf});
visit(*this, validate_visitor{v, single_fragment_range(v), sf});
}
static void serialize_aux(const tuple_type_impl& type, const tuple_type_impl::native_type* val, bytes::iterator& out) {
@@ -2319,6 +2350,14 @@ struct from_string_visitor {
sstring_view s;
bytes operator()(const reversed_type_impl& r) { return r.underlying_type()->from_string(s); }
template <typename T> bytes operator()(const integer_type_impl<T>& t) { return decompose_value(parse_int(t, s)); }
bytes operator()(const ascii_type_impl&) {
auto bv = bytes_view(reinterpret_cast<const int8_t*>(s.begin()), s.size());
if (utils::ascii::validate(bv)) {
return to_bytes(bv);
} else {
throw marshal_exception(format("Value not compatible with type {}: '{}'", ascii_type_name, s));
}
}
bytes operator()(const string_type_impl&) {
return to_bytes(bytes_view(reinterpret_cast<const int8_t*>(s.begin()), s.size()));
}

View File

@@ -380,6 +380,14 @@ public:
data_value(const std::string&);
data_value(const sstring&);
// Do not allow construction of a data_value from nullptr. The reason is
// that this is error prone, for example: it conflicts with `const char*` overload
// which tries to allocate a value from it and will cause UB.
//
// We want the null value semantics here instead. So the user will be forced
// to explicitly call `make_null()` instead.
data_value(std::nullptr_t) = delete;
data_value(ascii_native_type);
data_value(bool);
data_value(int8_t);
@@ -508,12 +516,8 @@ public:
data_value deserialize_value(bytes_view v) const {
return deserialize(v);
};
void validate(bytes_view v, cql_serialization_format sf) const;
virtual void validate(const fragmented_temporary_buffer::view& view, cql_serialization_format sf) const {
with_linearized(view, [this, sf] (bytes_view bv) {
validate(bv, sf);
});
}
void validate(const fragmented_temporary_buffer::view& view, cql_serialization_format sf) const;
void validate(bytes_view view, cql_serialization_format sf) const;
bool is_compatible_with(const abstract_type& previous) const;
/*
* Types which are wrappers over other types return the inner type.