Compare commits

...

120 Commits

Author SHA1 Message Date
Beni Peled
7c79c513d1 release: prepare for 4.6.7 2022-09-07 11:17:55 +03:00
Karol Baryła
9a8e73f0c3 transport/server.cc: Return correct size of decompressed lz4 buffer
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.

Fixes #11476

(cherry picked from commit 1c2eef384d)
2022-09-07 10:58:54 +03:00
Benny Halevy
fac0443200 snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
Detecting a secondary index by checking for a dot
in the table name is wrong as tables generated by Alternator
may contain a dot in their name.

Instead detect bot hmaterialized view and secondary indexes
using the schema()->is_view() method.

Fixes #10526

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aa127a2dbb)
2022-09-06 17:56:30 +03:00
Piotr Sarna
6bcfef2cfa cql3: fix misleading error message for service level timeouts
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).

Fixes #10286

Closes #10294

(cherry picked from commit 85e95a8cc3)
2022-09-01 20:34:22 +03:00
Juliusz Stasiewicz
d2c67a2429 cdc/check_and_repair_cdc_streams: ignore LEFT endpoints
When `check_and_repair_cdc_streams` encountered a node with status LEFT, Scylla
would throw. This behavior is fixed so that LEFT nodes are simply ignored.

Fixes #9771

Closes #9778

(cherry picked from commit 351f142791)
2022-09-01 15:44:35 +03:00
Avi Kivity
d6c2f228e7 Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239

Closes #11240

* github.com:scylladb/scylladb:
  test: mvcc: Fix illegal use of maybe_refresh()
  tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
  tests: row_cache_test: Introduce one_shot mode to throttle
  row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
2022-08-11 19:19:30 +02:00
Yaron Kaikov
a1b1df2074 release: prepare for 4.6.6 2022-08-07 16:24:51 +03:00
Avi Kivity
14e13ecbd4 Merge 'Backport: Fix map subscript crashes when map or subscript is null' from Nadav Har'El
This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0.
Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug.

Refs #10535.

The original cover letter from https://github.com/scylladb/scylla/pull/10420:

In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically.

In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL.

However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.

Fixes https://github.com/scylladb/scylla/issues/10361
Fixes https://github.com/scylladb/scylla/issues/10399
Fixes https://github.com/scylladb/scylla/pull/10401

Closes #11142

* github.com:scylladb/scylla:
  test/cql-pytest: reproducer for CONTAINS NULL bug
  expressions: don't dereference invalid map subscript in filter
  expressions: fix invalid dereference in map subscript evaluation
  test/cql-pytest: improve tests for map subscripts and nulls

(cherry picked from commit 23a34d7e42)
2022-07-31 15:44:00 +03:00
Benny Halevy
b8740bde6e multishard_mutation_query: do_query: stop ctx if lookup_readers fails
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.

Fixes #10351

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10425

(cherry picked from commit 055141fc2e)
2022-07-25 14:52:58 +03:00
Benny Halevy
1b23f8d038 sstables: time_series_sstable_set: insert: make exception safe
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.

Fixes #10787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cd68b04fbf)
2022-07-25 14:22:08 +03:00
Tomasz Grabiec
05a228e4c5 memtable: Fix missing range tombstones during reads under ceratin rare conditions
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.

_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.

The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.

For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:

[a, c] @ t0
[a, b) @ t1, [b, d] @ t2

c > b

The proper sequence for such versions is (assuming d > c):

[a, b) @ t1,
[b, d] @ t2

Due to the bug, the reader will produce:

[a, b) @ t1,
[b, c] @ t0

The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).

The problem goes away once MVCC versions merge.

Fixes #10913
Fixes #10830

Closes #10914

(cherry picked from commit a6aef60b93)

[avi: backport prerequisite position_range_to_clustering_range() too]
2022-07-19 19:27:15 +03:00
Yaron Kaikov
2ec293ab0e release: prepare for 4.6.5 2022-07-19 16:02:46 +03:00
Pavel Emelyanov
b60f14601e azure_snitch: Do nothing on non-io-cpu
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: #10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc87d0)
2022-07-17 14:22:29 +03:00
Raphael S. Carvalho
284dd21ef7 compaction_manager: Fix race when selecting sstables for rewrite operations
Rewrite operations are scrub, cleanup and upgrade.

Race can happen because 'selection of sstables' and 'mark sstables as
compacting' are decoupled. So any deferring point in between can lead
to a parallel compaction picking the same files. After commit 2cf0c4bbf,
files are marked as compacting before rewrite starts, but it didn't
take into account the commit c84217ad which moved retrieval of
candidates to a deferring thread, before rewrite_sstables() is even
called.

Scrub isn't affected by this because it uses a coarse grained approach
where whole operation is run with compaction disabled, which isn't good
because regular compaction cannot run until its completion.

From now on, selection of files and marking them as compacting will
be serialized by running them with compaction disabled.

Now cleanup will also retrieve sstables with compaction disabled,
meaning it will no longer leave uncleaned files behind, which is
important to avoid data resurrection if node regains ownership of
data in uncleaned files.

Fixes #8168.
Refs #8155.

[backport notes:
- minor conflict around run_with_compaction_disabled()
- bumped into our old friend
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95111,
so I had to use std::ref() on local copy of lambda
- with the yielding part of candidate retrieval now happening in
rewrite_sstables(), task registration is moved to after run_with_
compaction_disabled() call, so the latter won't incorrectly try
to stop the task that called it, which triggers an assert in
debug mode.
]

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com>
(cherry picked from commit 80a1ebf0f3)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10963
2022-07-13 18:45:36 +03:00
Pavel Emelyanov
8b52f1d6e7 view: Fix trace-state pointer use after move
It's moved into .mutate_locally() but it captured and used in its
continuation. It works well just because moved-from pointer looks like
nullptr and all the tracing code checks for it to be non-such.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/
       (CI job failed on post-actions thus it's red)

Fixes #11015

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220711134152.30346-1-xemul@scylladb.com>
(cherry picked from commit 5526738794)
2022-07-12 14:21:11 +03:00
Piotr Sarna
157951f756 view: exclude using static columns in the view filter
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).

Fixes #10851

Closes #10855

(cherry picked from commit bc3a635c42)
2022-07-11 17:07:22 +03:00
Juliusz Stasiewicz
4f643ed4a5 cdc: check_and_repair_cdc_streams: regenerate if too many streams are present
If the number of streams exceeds the number of token ranges
it indicates that some spurious streams from decommissioned
nodes are present.

In such a situation - simply regenerate.

Fixes #9772

Closes #9780

(cherry picked from commit ea46439858)
2022-07-07 18:53:14 +02:00
Avi Kivity
b598629b7f messaging: do isolate default tenants
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.

In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:

 scheduling_group
 messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
     return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
     if (must_compress) {
         opts.compressor_factory = &compressor_factory;
     }
     opts.tcp_nodelay = must_tcp_nodelay;
     opts.reuseaddr = true;
-    opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    // We send cookies only for non-default statement tenant clients.
+    if (idx > 3) {
+        opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    }

This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.

Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.

Fixes #9505.

Closes #10673

(cherry picked from commit c83393e819)
2022-07-05 13:42:10 +03:00
Nadav Har'El
43f82047b9 Merge 'types: fix is_string for reversed types' from Piotr Sarna
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.

Fixes #10183

Closes #10181

* github.com:scylladb/scylla:
  test: add a case for LIKE operator on a descending order column
  types: fix is_string for reversed types

(cherry picked from commit 733672fc54)
2022-07-03 17:59:56 +03:00
Benny Halevy
ec3c07de6e compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.

However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.

Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.

The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.

Fixes #10151

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
(cherry picked from commit 0764e511bb)
2022-07-03 14:30:54 +03:00
Takuya ASADA
82572e8cfe scylla_coredump_setup: support new format of Storage field
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.

Fixes #10669

Closes #10714

(cherry picked from commit ad2344a864)
2022-07-03 13:55:25 +03:00
Nadav Har'El
2b9ed79c6f alternator: forbid empty AttributesToGet
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.

Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax

However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.

So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).

Fixes #10332

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
(cherry picked from commit 9c1ebdceea)
2022-07-03 13:36:02 +03:00
Avi Kivity
ab0b6fd372 Update seastar submodule (json crash in describe_ring)
* seastar 7a430a0830...8b2c13b346 (1):
  > Merge 'stream_range_as_array: always close output stream' from Benny Halevy

Fixes #10592.
2022-06-08 16:49:53 +03:00
Nadav Har'El
12f1718ef4 alternator: allow DescribeTimeToLive even without TTL enabled
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!

This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.

The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.

After this patch, the following test now works on Scylla without
experimental flags turned on:

    test/alternator/run test_ttl.py::test_describe_ttl_without_ttl

Refs #10660

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8ecf1e306f)
2022-05-30 20:40:34 +03:00
Tomasz Grabiec
322dfe8403 sstable: partition_index_cache: Fix abort on bad_alloc during page loading
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.

The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.

The assert manifested like this:

scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.

Fixes #10617

Closes #10653

(cherry picked from commit f87274f66a)
2022-05-30 13:00:46 +03:00
Beni Peled
11f008e8fd release: prepare for 4.6.4 2022-05-16 15:20:35 +03:00
Benny Halevy
fd7314a362 table: clear: serialize with ongoing flush
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.

Fixes #10423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
2022-05-15 13:43:43 +03:00
Raphael S. Carvalho
d27468f078 compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508

(cherry picked from commit 8e99d3912e)
2022-05-15 13:20:30 +03:00
Juliusz Stasiewicz
74ef1ee961 CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503

(cherry picked from commit 603dd72f9e)
2022-05-10 14:03:03 +02:00
Benny Halevy
07549d159c compaction: time_window_compaction_strategy: reset estimated_remaining_tasks when running out of candidates
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.

Refs #10418
(to be closed when the fix reaches branch-4.6)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10419

(cherry picked from commit 01f41630a5)
2022-05-09 09:36:22 +03:00
Eliran Sinvani
189bbcd82d prepared_statements: Invalidate batch statement too
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.

Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
2022-05-08 12:33:00 +03:00
Eliran Sinvani
70e6921125 cql3 statements: Change dependency test API to express better it's
purpose

Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.

(cherry picked from commit bf50dbd35b)

Ref #10129
2022-05-08 12:32:41 +03:00
Calle Wilund
e314158708 cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474

(cherry picked from commit 78350a7e1b)
2022-05-05 11:34:56 +02:00
Tomasz Grabiec
46586532c9 loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
2022-05-04 15:38:11 +03:00
Avi Kivity
0114244363 Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private

(cherry picked from commit 7f1e368e92)
2022-05-01 17:11:52 +03:00
Avi Kivity
f154c8b719 Update tools/java submodule (bad IPv6 addresses in nodetool)
* tools/java 05ec511bbb...46744a92ff (1):
  > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index

Fixes #10442
2022-04-28 11:35:09 +03:00
Beni Peled
8bf149fdd6 release: prepare for 4.6.3 2022-04-14 14:16:52 +03:00
Tomasz Grabiec
0265d56173 utils/chunked_managed_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no user impact.

Fixes #10364.

Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
(cherry picked from commit 0c365818c3)
2022-04-13 10:29:30 +03:00
Tomasz Grabiec
e50452ba43 utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)

[avi: make max_chunk_capacity() public for backport]
2022-04-13 10:29:03 +03:00
Avi Kivity
a205f644cb transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.

Fix by updating the error codes.

A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.

Fixes #5610.

Closes #10362

(cherry picked from commit 987e6533d2)
2022-04-13 09:49:02 +03:00
Tomasz Grabiec
f136b5b950 utils/chunked_managed_vector: Fix corruption in case there is more than one chunk
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.

Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.

Currently, the container is only used in the sstable partition index
cache.

Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.

Introduced in 78e5b9fd85 (4.6.0)

Fixes #10290

Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
(cherry picked from commit 41fe01ecff)
2022-04-08 10:53:52 +03:00
Takuya ASADA
69a1325884 docker: enable --log-to-stdout which mistakenly disabled
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.

We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.

Fixes #10270

Closes #10280

(cherry picked from commit bdefea7c82)
2022-04-07 12:13:35 +03:00
Avi Kivity
ab153c9b94 Update seastar submodule (logger deadlock with large messages)
* seastar 34e58f9995...94a462d94b (2):
  > log: Fix silencer to be shard-local and logger-global
  > log: Silence logger when logging

Fixes #10336.
2022-04-05 19:43:49 +03:00
Beni Peled
eb372d7f03 release: prepare for 4.6.2 2022-04-05 16:59:53 +03:00
Takuya ASADA
e232711e7e docker: run scylla as root
Previous versions of Docker image runs scylla as root, but cb19048
accidently modified it to scylla user.
To keep compatibility we need to revert this to root.

Fixes #10261

Closes #10325

(cherry picked from commit f95a531407)
2022-04-05 12:46:12 +03:00
Takuya ASADA
0a440b6d4a docker: revert scylla-server.conf service name change
We changed supervisor service name at cb19048, but this breaks
compatibility with scylla-operator.
To fix the issue we need to revert the service name to previous one.

Fixes #10269

Closes #10323

(cherry picked from commit 41edc045d9)
2022-04-05 12:42:36 +03:00
Piotr Sarna
00bb1e8145 cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302

(cherry picked from commit c0fd53a9d7)
2022-04-03 11:21:43 +03:00
Avi Kivity
e30dbee2db Update seastar submodule (pidof command not installed)
* seastar 50e1549b2c...34e58f9995 (1):
  > seastar-cpu-map.sh: switch from pidof to pgrep
Fixes #10238.
2022-03-29 12:40:17 +03:00
Beni Peled
2309d6b51e release: prepare for 4.6.1 2022-03-28 10:57:31 +03:00
Benny Halevy
b77ca07709 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
2022-03-24 18:08:07 +02:00
Benny Halevy
bb0a38f889 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
2022-03-24 18:08:07 +02:00
Benny Halevy
c48fd03463 atomic_cell: compare_atomic_cell_for_merge: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-2-bhalevy@scylladb.com>
(cherry picked from commit d43da5d6dc)

Ref #10156
2022-03-24 18:07:54 +02:00
Benny Halevy
eb78e6d4b8 atomic_cell: compare_atomic_cell_for_merge: simplify expiry/deltion_time comparison
No need to check first the the cells' expiry is different
or that deletion_time is different before comparing them
with `<=>`.

If they are the same the function returns std::strong_ordering::equal
anyhow and that is the same as `<=>` comparing identical values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>
(cherry picked from commit be865a29b8)

Ref #10156
2022-03-24 18:07:32 +02:00
Avi Kivity
4b1b0a55c0 replica, atomic_cell: move atomic_cell merge code from replica module to atomic_cell.cc
compare_atomic_cell_for_merge() was placed in database.cc, before
atomic_cell.cc existed. Move it to its correct place.

Closes #9889

(cherry picked from commit 6c53717a39)
2022-03-24 18:07:11 +02:00
Benny Halevy
172a8628d5 main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
2022-03-24 14:49:24 +02:00
Nadav Har'El
5688b125e6 Seastar: backport Seastar fix for missing scring escape in JSON output
Backported Seastar fix:
  > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz

Fixes #9061

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-23 21:27:13 +02:00
Piotr Sarna
6da4acb41e expression: fix get_value for mismatched column definitions
As observed in #10026, after schema changes it somehow happened
that a column defition that does not match any of the base table
columns was passed to expression verification code.
The function that looks up the index of a column happens to return
-1 when it doesn't find anything, so using this returned index
without checking if it's nonnegative results in accessing invalid
vector data, and a segfault or silent memory corruption.
Therefore, an explicit check is added to see if the column was actually
found. This serves two purposes:
 - avoiding segfaults/memory corruption
 - making it easier to investigate the root cause of #10026

Closes #10039

(cherry picked from commit 7b364fec9849e9a342af1c240e3a7185bf5401ef)
2022-03-21 10:46:34 +01:00
Botond Dénes
f09cc9a01d Merge 'service: storage_service: announce new CDC generation immediately with RBNO' from Kamil Braun
When a new CDC generation is created (during bootstrap or otherwise), it
is assigned a timestamp. The timestamp must be propagated as soon as
possible, so all live nodes can learn about the generation before their
clocks reach the generation's timestamp. The propagation mechanism for
generation timestamps is gossip.

When bootstrap RBNO was enabled this was not the case: the generation
timestamp was inserted into gossiper state too late, after the repair
phase finished. Fix this.

Also remove an obsolete comment.

Fixes https://github.com/scylladb/scylla/issues/10149.

Closes #10154

* github.com:scylladb/scylla:
  service: storage_service: announce new CDC generation immediately with RBNO
  service: storage_service: fix indentation

(cherry picked from commit f1b2ff1722)
2022-03-16 12:27:24 +01:00
Raphael S. Carvalho
cd2e33ede4 compaction_manager: Abort reshape for tables waiting for a chance to run
Tables waiting for a chance to run reshape wouldn't trigger stop
exception, as the exception was only being triggered for ongoing
compactions. Given that stop reshape API must abort all ongoing
tasks and all pending ones, let's change run_custom_job() to
trigger the exception if it found that the pending task was
asked to stop.

Tests:
dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces
unit: dev

Fixes #9836.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>
(cherry picked from commit 07fba4ab5d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311183053.46625-1-raphaelsc@scylladb.com>
2022-03-15 16:58:47 +02:00
Benny Halevy
32d0698d78 compaction_manager: rewrite_sstables: do not acquire table write lock
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.

Fixes #10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10177

(cherry picked from commit 11ea2ffc3c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220314151416.2496374-1-bhalevy@scylladb.com>
2022-03-14 18:15:49 +02:00
Piotr Jastrzebski
93cf43ae4b cdc: Handle compact storage correctly in preimage
Base tables that use compact storage may have a special artificial
column that has an empty type.

c010cefc4d fixed the main CDC path to
handle such columns correctly and to not include them in the CDC Log
schema.

This patch makes sure that generation of preimage ignores such empty
column as well.

Fixes #9876
Closes #9910

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 09d4438a0d)
2022-03-10 14:25:02 +02:00
Nadav Har'El
2f2d22a864 cql: INSERT JSON should refuse empty-string partition key
Add the missing partition-key validation in INSERT JSON statements.

Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).

Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.

The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.

Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.

This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.

Reported by @FortTell
Fixes #9853.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
(cherry picked from commit 8fd5041092)
2022-03-02 22:00:15 +02:00
Avi Kivity
5f92f54f06 Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

The series also adds two safety checks to LSA to catch such problems earlier.

Fixes #10056

\cc @slivne @bhalevy

Closes #10130

* github.com:scylladb/scylla:
  lsa: Abort when trying to free a standard allocator object not allocated through the region
  lsa: Abort when _non_lsa_memory_in_use goes negative
  tests: utils: cached_file: Validate occupancy after eviction
  test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
  utils: cached_file: Fix alloc-dealloc mismatch during eviction

(cherry picked from commit ff2cd72766)
2022-02-26 11:28:53 +02:00
Benny Halevy
395f2459b4 cql3: result_set: remove std::ref from comperator&
Applying std::ref on `RowComparator& cmp` hits the
following compilation error on Fedora 34 with
libstdc++-devel-11.2.1-9.fc34.x86_64

```
FAILED: build/dev/cql3/statements/select_statement.o
clang++ -MD -MT build/dev/cql3/statements/select_statement.o -MF build/dev/cql3/statements/select_statement.o.d -I/home/bhalevy/dev/scylla/seastar/include -I/home/bhalevy/dev/scylla/build/dev/seastar/gen/include -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_LOCALE -DFMT_SHARED -I/usr/include/p11-kit-1  -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_ENABLE_WASMTIME -iquote. -iquote build/dev/gen --std=gnu++20  -ffile-prefix-map=/home/bhalevy/dev/scylla=.  -march=westmere -DBOOST_TEST_DYN_LINK   -Iabseil -fvisibility=hidden  -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-sometimes-uninitialized -Wno-return-stack-address -Wno-missing-braces -Wno-unused-lambda-capture -Wno-overflow -Wno-noexcept-type -Wno-error=cpp -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-defaulted-function-deleted -Wno-redeclared-class-member -Wno-unsupported-friend -Wno-unused-variable -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-uninitialized-const-reference -Wno-psabi -Wno-narrowing -Wno-array-bounds -Wno-nonnull -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DHAVE_LZ4_COMPRESS_DEFAULT  -c -o build/dev/cql3/statements/select_statement.o cql3/statements/select_statement.cc
In file included from cql3/statements/select_statement.cc:14:
In file included from ./cql3/statements/select_statement.hh:16:
In file included from ./cql3/statements/raw/select_statement.hh:16:
In file included from ./cql3/statements/raw/cf_statement.hh:16:
In file included from ./cql3/cf_name.hh:16:
In file included from ./cql3/keyspace_element_name.hh:16:
In file included from /home/bhalevy/dev/scylla/seastar/include/seastar/core/sstring.hh:25:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/algorithm:74:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/pstl/glue_algorithm_defs.h:13:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/functional:58:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: error: exception specification of 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' uses itself
                = decltype(reference_wrapper::_S_fun(std::declval<_Up>()))>
                                                     ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: note: in instantiation of exception specification for 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' requested here
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:321:2: note: in instantiation of default argument for 'reference_wrapper<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' required here
        reference_wrapper(_Up&& __uref)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1017:57: note: while substituting deduced template arguments into function template 'reference_wrapper' [with _Up = __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, $1 = (no value), $2 = (no value)]
      = __bool_constant<__is_nothrow_constructible(_Tp, _Args...)>;
                                                        ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1023:14: note: in instantiation of template type alias '__is_nothrow_constructible_impl' requested here
    : public __is_nothrow_constructible_impl<_Tp, _Args...>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:153:14: note: in instantiation of template class 'std::is_nothrow_constructible<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:298:11: note: (skipping 8 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
          return __and_<typename _Base::_Local_storage,
                 ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1933:13: note: in instantiation of function template specialization 'std::__partial_sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
              std::__partial_sort(__first, __last, __last, __comp);
                   ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1954:9: note: in instantiation of function template specialization 'std::__introsort_loop<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, long, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
          std::__introsort_loop(__first, __last,
               ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:4875:12: note: in instantiation of function template specialization 'std::__sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
      std::__sort(__first, __last, __gnu_cxx::__ops::__iter_comp_iter(__comp));
           ^
./cql3/result_set.hh:168:14: note: in instantiation of function template specialization 'std::sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>' requested here
        std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
             ^
cql3/statements/select_statement.cc:773:21: note: in instantiation of function template specialization 'cql3::result_set::sort<std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>' requested here
                rs->sort(_ordering_comparator);
                    ^
1 error generated.
ninja: build stopped: subcommand failed.
```

Fixes #10079.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215071955.316895-3-bhalevy@scylladb.com>
(cherry picked from commit 3e20fee070)

[avi: backport for developer quality-of-life rather than as a bug fix]
2022-02-16 10:08:24 +02:00
Raphael S. Carvalho
019d50bb5c Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME"
This reverts commit 4c05e5f966.

Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.

Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.

Fixes #10060.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213165147.56204-1-raphaelsc@scylladb.com>

Ref: a9427f150a
2022-02-14 12:10:38 +02:00
Avi Kivity
bbe775b926 utils: logalloc: correct and adjust timing unit in stall report
The stall report uses the millisecond unit, but actually reports
nanoseconds.

Switch to microseconds (milliseconds are a bit too coarse) and
use the safer "duration / 1us" style rather than "duration::count()"
that leads to unit confusion.

Fixes #9733.

Closes #9734

(cherry picked from commit f907205b92)
2022-02-12 15:56:42 +02:00
Yaron Kaikov
469c94ea17 release: prepare for 4.6.0 2022-02-08 16:45:50 +02:00
Nadav Har'El
4c780d0265 alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
2022-02-08 11:48:18 +02:00
Michael Livshin
0181de1f2c shard_reader: check that _reader is valid before dereferencing
After fc729a804, `shard_reader::close()` is not interrupted with an
exception any more if read-ahead fails, so `_reader` may in fact be
null.

Fixes #9923

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220117120405.152927-1-michael.livshin@scylladb.com>
(cherry picked from commit d7a993043d)
2022-02-07 10:10:58 +02:00
Benny Halevy
7597a79ef9 shard_reader: Continue after read_ahead error
If read ahead failed, just issue a log warning
and proceed to close the reader.

Currently co_await will throw and the evictable reader
won't be closed.

This is seen occasionally in testing, e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/1010/artifact/logs-all.debug.2/1640918573898_lwt_banking_load_test.py%3A%3ATestLWTBankingLoad%3A%3Atest_bank_with_nemesis/node2.log
```
ERROR 2021-12-31 02:40:56,160 [shard 0] mutation_reader - shard_reader::close(): failed to stop reader on shard 1: seastar::named_semaphore_timed_out (Semaphore timed out: _system_read_concurrency_sem)
```

Fixes #9865.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220102124636.2791544-1-bhalevy@scylladb.com>
(cherry picked from commit fc729a804b)
2022-02-07 10:09:05 +02:00
Nadav Har'El
8f5148e921 docker: don't repeat "--alternator-address" option twice
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.

So this patch fixes the scyllasetup.py script to only pass this
parameter once.

Fixes #10016.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
2022-02-03 18:39:47 +02:00
Yaron Kaikov
5694ec189f release: prepare for 4.6.rc5 2022-02-03 16:19:46 +02:00
Calle Wilund
34d470967a commitlog: Fix double clearing of _segment_allocating shared_future.
Fixes #10020

Previous fix 445e1d3 tried to close one double invocation,  but added
another, since it failed to ensure all potential nullings of the opt
shared_future happened before a new allocator could reset it.

This simplifies the code by making clearing the shared_future a
pre-requisite for resolving its contents (as read by waiters).

Also removes any need for try-catch etc.

Closes #10024

(cherry picked from commit 1e66043412)
2022-02-03 07:43:18 +02:00
Calle Wilund
61db571a44 commitlog: Ensure we never have more than one new_segment call at a time
Refs #9896

Found by @eliransin. Call to new_segment was wrapped in with_timeout.
This means that if primary caller timed out, we would leave new_segment
calls running, but potentially issue new ones for next caller.

This could lead to reserve segment queue being read simultanously. And
it is not what we want.

Change to always use the shared_future wait, all callers, and clear it
only on result (exception or segment)

Closes #10001

(cherry picked from commit 445e1d3e41)
2022-02-01 09:10:27 +02:00
Tomasz Grabiec
5b5a300a9e util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
(cherry picked from commit b734615f51)
2022-01-31 01:24:47 +02:00
Avi Kivity
148a65d0d6 Update seastar submodule (gratuitous exceptions on allocation failure)
* seastar a189cdc45d...a375681303 (1):
  > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed

Fixes #9982.
2022-01-30 20:02:24 +02:00
Avi Kivity
e3ad14d55f Point seastar submodule at scylla-seastar.git
This allows us to backport fixes to seastar selectively.
2022-01-30 20:01:12 +02:00
Calle Wilund
2b506c2d4a commitlog: Ensure we don't run continuation (task switch) with queues modified
Fixes #9955

In #9348 we handled the problem of failing to delete segment files on disk, and
the need to recompute disk footprint to keep data flow consistent across intermittent
failures. However, because _reserve_segments and _recycled_segments are queues, we
have to empty them to inspect the contents. One would think it is ok for these
queues to be empty for a while, whilst we do some recaclulating, including
disk listing -> continuation switching. But then one (i.e. I) misses the fact
that these queues use the pop_eventually mechanism, which does _not_ handle
a scenario where we push something into an empty queue, thus triggering the
future that resumes a waiting task, but then pop the element immediately, before
the waiting task is run. In fact, _iff_ one does this, not only will things break,
they will in fact start creating undefined behaviour, because the underlying
std::queue<T, circular_buffer> will _not_ do any bounds checks on the pop/push
operations -> we will pop an empty queue, immediately making it non-empty, but
using undefined memory (with luck null/zeroes).

Strictly speakging, seastar::queue::pop_eventually should be fixed to handle
the scenario, but nontheless we can fix the usage here as well, by simply copy
objects and do the calculation "in background" while we potentially start
popping queue again.

Closes #9966

(cherry picked from commit 43f51e9639)
2022-01-27 10:24:03 +02:00
Avi Kivity
50aad1c668 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"

(cherry picked from commit df22396a34)
2022-01-27 10:21:25 +02:00
Yaron Kaikov
7bf3f37cd1 release: prepare for 4.6.rc4 2022-01-23 10:44:09 +02:00
Botond Dénes
0f7f8585f2 reader_permit: release_base_resources(): also update _resources
If the permit was admitted, _base_resources was already accounted in
_resource and therefore has to be deducted from it, otherwise the permit
will think it leaked some resources on destruction.

Test:
dtest(repair_additional_test.py.test_repair_one_missing_row_diff_shard_count)

Refs: #9751
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220119132550.532073-1-bdenes@scylladb.com>
(cherry picked from commit a65b38a9f7)
2022-01-20 18:39:25 +02:00
Pavel Emelyanov
2c65c4a569 Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec
Appending an empty range adjacent to an existing range tombstone would
not deoverlap (by dropping the empty range tombstone) resulting in
different (non canoncial) result depending on the order of appending.

Suppose that range tombstone [a, b] covers range tombstone [x, x), and [a, x) and [x, b) are range tombstones which correspond to [a, b] split around position x.

Appending [a, x) then [x, b) then [x, x) would give [a, b)
Appending [a, x) then [x, x) then [x, b) would give [a, x), [x, x), [x, b)

The fix is to drop empty range tombstones in range_tombstone_list so that the result is canonical.

Fixes #9661

Closes #9764

* github.com:scylladb/scylla:
  range_tombstone_list: Deoverlap adjacent empty ranges
  range_tombstone_list: Convert to work in terms of position_in_partition

(cherry picked from commit b2a62d2b59)
2022-01-20 12:35:21 +02:00
Avi Kivity
f85cd289bc Merge "repair: make sure there is one permit per repair with count res" from Botond
"
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.

Fixes: #9751
"

* 'repair-double-permit-block/v4' of https://github.com/denesb/scylla:
  repair: make sure there is one permit per repair with count res
  reader_permit: add release_base_resource()

(cherry picked from commit 52b7778ae6)
2022-01-17 16:02:55 +02:00
Beni Peled
5e661af9a4 release: prepare for 4.6.rc3 2022-01-17 13:11:54 +02:00
Calle Wilund
5629b67d25 messaging_service: Make dc/rack encryption check for connection more strict
Fixes #9653

When doing an outgoing connection, in a internode_encryption=dc/rack situation
we should not use endpoint/local broadcast solely to determine if we can
downgrade a connection.

If gossip/message_service determines that we will connect to a different
address than the "official" endpoint address, we should use this to determine
association of target node, and similarly, if we bind outgoing connection
to interface != bc we need to use this to decide local one.

Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc
until such time that gossip can give accurate info on dc/rack for "internal"
ip addresses of nodes.

(cherry picked from commit 4778770814)
2022-01-16 19:10:57 +02:00
Takuya ASADA
ad632cf7fc dist: fix scylla-housekeeping uuid file chmod call
Should use chmod() on a file, not fchmod()

Fixes #9683

Closes #9802

(cherry picked from commit 7064ae3d90)
2022-01-10 16:57:34 +02:00
Botond Dénes
ca24bebcf2 sstables/partition_index_cache: destroy entry ptr on error
The error-handling code removes the cache entry but this leads to an
assertion because the entry is still referenced by the entry pointer
instance which is returned on the normal path. To avoid this clear the
pointer on the error path and make sure there are no additional
references kept to it.

Fixes #9887

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220105140859.586234-2-bdenes@scylladb.com>
(cherry picked from commit 92727ac36c)
2022-01-07 21:21:44 +01:00
Calle Wilund
7dc5abb6f8 commitlog: Don't allow error_handler to swallow exception
Fixes #9798

If an exception in allocate_segment_ex is (sub)type of std::system_error,
commit_error_handler might _not_ cause throw (doh), in which case the error
handling code would forget the current exception and return an unusable
segment.

Now only used as an exception pointer replacer.

Closes #9870

(cherry picked from commit 3c02cab2f7)
2022-01-06 14:10:18 +02:00
Yaron Kaikov
e8a1cfb6f8 release: prepare for 4.6.rc2 2022-01-02 09:15:47 +02:00
Tomasz Grabiec
fc312b3021 lsa: Fix segment leak on memory reclamation during alloc_buf
alloc_buf() calls new_buf_active() when there is no active segment to
allocate a new active segment. new_buf_active() allocates memory
(e.g. a new segment) so may cause memory reclamation, which may cause
segment compaction, which may call alloc_buf() and re-enter
new_buf_active(). The first call to new_buf_active() would then
override _buf_active and cause the segment allocated during segment
compaction to be leaked.

This then causes abort when objects from the leaked segment are freed
because the segment is expected to be present in _closed_segments, but
isn't. boost::intrusive::list::erase() will fail on assertion that the
object being erased is linked.

Introduced in b5ca0eb2a2.

Fixes #9821
Fixes #9192
Fixes #9825
Fixes #9544
Fixes #9508
Refs #9573

Message-Id: <20211229201443.119812-1-tgrabiec@scylladb.com>
(cherry picked from commit 7038dc7003)
2021-12-30 18:56:28 +02:00
Nadav Har'El
7b82aaf939 alternator: fix error on UpdateTable for non-existent table
When the UpdateTable operation is called for a non-existent table, the
appropriate error is ResourceNotFoundException, but before this patch
we ran into an exception, which resulted in an ugly "internal server
error".

In this patch we use the existing get_table() function which most other
operations use, and which does all the appropriate verifications and
generates the appropriate Alternator api_error instead of letting
internal Scylla exceptions escape to the user.

This patch also includes a test for UpdateTable on a non-existent table,
which used to fail before this patch and pass afterwards. We also add a
test for DeleteTable in the same scenario, and see it didn't have this
bug. As usual, both tests pass on DynamoDB, which confirms we generate
the right error codes.

Fixes #9747.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211206181605.1182431-1-nyh@scylladb.com>
(cherry picked from commit 31eeb44d28)
2021-12-29 22:59:25 +02:00
Nadav Har'El
894a4abfae commitlog: fix missing wait for semaphore units
Commit dcc73c5d4e introduced a semaphore
for excluding concurrent recalculations - _reserve_recalculation_guard.

Unfortunately, the two places in the code which tried to take this
guard just called get_units() - which returns a future<units>, not
units - and never waited for this future to become available.

So this patch adds the missing "co_await" needed to wait for the
units to become available.

Fixes #9770.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211214122612.1462436-1-nyh@scylladb.com>
(cherry picked from commit b8786b96f4)
2021-12-29 13:18:59 +02:00
Takuya ASADA
4dcf023470 scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).

Fixes #9540

Closes #9782

(cherry picked from commit 0d8f932f0b)
2021-12-28 11:38:04 +02:00
Benny Halevy
283788828e compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.

Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.

Fixes #9766

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
(cherry picked from commit c89876c975)
2021-12-15 15:03:59 +02:00
Pavel Emelyanov
730a147ba6 row-cache: Handle exception (un)safety of rows_entry insertion
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.

In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.

In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.

fixes: #9728

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ee103636ac)
2021-12-14 15:53:42 +02:00
Pavel Emelyanov
9897e83029 partition_snapshot_row_cursor: Shuffle ensure_result creation
Both places get the C-pointer on the freshly allocated rows_entry,
insert it where needed and return back the dereferenced pointer.

The C-pointer is going to become smart-pointer that would go out
of scope before return. This change prepares for that by constructing
the ensure_result from the iterator, that's returned from insertion
of the entry.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 9fd8db318d)

Ref #9728
2021-12-14 15:52:37 +02:00
Asias He
1a9b64e6f6 storage_service: Wait for seastar::get_units in node_ops
The seastar::get_units returns a future, we have to wait for it.

Fixes #9767

Closes #9768

(cherry picked from commit 9859c76de1)
2021-12-12 18:42:20 +02:00
Takuya ASADA
49fe9e2c8e dist: allow running scylla-housekeeping with strict umask setting
To avoid failing scylla-housekeeping in strict umask environment,
we need to chmod a+r on repository file and housekeeping.uuid.

Fixes #9683

Closes #9739

(cherry picked from commit ea20f89c56)
2021-12-12 14:25:57 +02:00
Takuya ASADA
d0580c41ee dist: add support im4gn/is4gen instance on AWS
Add support next-generation, storage-optimized ARM64 instance types.

Fixes #9711

Closes #9730

(cherry picked from commit 097a6ee245)
2021-12-08 14:29:44 +02:00
Beni Peled
542394c82f release: prepare for 4.6.rc1 2021-12-08 11:08:45 +02:00
Avi Kivity
018ad3f6f4 test: refine test suite names exposed via xunit format
The test suite names seen by Jenkins are suboptimal: there is
no distinction between modes, and the ".cc" suffix of file names
is interpreted as a class name, which is converted to a tree node
that must be clicked to expand. Massage the names to remove
unnecessary information and add the mode.

Closes #9696

(cherry picked from commit ef3edcf848)

Fixes #9738.
2021-12-05 19:58:22 +02:00
Avi Kivity
9b8b7efb54 tests: consolidate boost xunit result files
The recent parallelization of boost unit tests caused an increase
in xml result files. This is challenging to Jenkins, since it
appears to use rpc-over-ssh to read the result files, and as a result
it takes more than an hour to read all result files when the Jenkins
main node is not on the same continent as the agent.

To fix this, merge the result files in test.py and leave one result
file per mode. Later we can leave one result file overall (integrating
the mode into the testsuite name), but that can wait.

Tested on a local Jenkins instance (just reading the result files,
not the entire build).

Closes #9668

(cherry picked from commit b23af15432)

Fixes #9738
2021-12-05 19:57:39 +02:00
Botond Dénes
1c3e63975f Merge 'Backport of #9348 (xceptions in commitlog::segment_manager::delete_segments could cause footprint counters to loose track)' from Calle Wilund
Backport of series to 4.6
Upstream merge commit: e2c27ee743.
Refs #9348

Closes #9702

* github.com:scylladb/scylla:
  commitlog: Recalculate footprint on delete_segment exceptions
  commitlog_test: Add test for exception in alloc w. deleted underlying file
  commitlog: Ensure failed-to-create-segment is re-deleted
  commitlog::allocate_segment_ex: Don't re-throw out of function
2021-12-02 09:22:19 +02:00
Calle Wilund
11bb03e46d commitlog: Recalculate footprint on delete_segment exceptions
Fixes #9348

If we get exceptions in delete_segments, we can, and probably will, loose
track of footprint counters. We need to recompute the used disk footprint,
otherwise we will flush too often, and even block indefinately on new_seg
iff using hard limits.
2021-11-29 14:56:48 +00:00
Calle Wilund
810e410c5d commitlog_test: Add test for exception in alloc w. deleted underlying file
Tests that we can handle exception-in-alloc cleanup if the file actually
does not exist. This however uncovers another weakness (addressed in next
patch) - that we can loose track of disk footprint here, and w. hard limits
end up waiting for disk space that never comes. Thus test does not use hard
limit.
2021-11-29 14:56:43 +00:00
Calle Wilund
97f6da0c3e commitlog: Ensure failed-to-create-segment is re-deleted
Fixes #9343

If we fail in allocate_segment_ex, we should push the file opened/created
to the delete set to ensure we reclaim the disk space. We should also
ensure that if we did not recycle a file in delete_segments, we still
wake up any recycle waiters iff we made a file delete instead.

Included a small unit test.
2021-11-29 14:51:39 +00:00
Calle Wilund
c229fe9694 commitlog::allocate_segment_ex: Don't re-throw out of function
Fixes #9342

commitlog_error_handler rethrows. But we want to not. And run post-handler
cleanup (co_await)
2021-11-29 14:51:39 +00:00
Tomasz Grabiec
ee1ca8ae4d lsa: Add sanity checks around lsa_buffer operations
We've been observing hard to explain crashes recently around
lsa_buffer destruction, where the containing segment is absent in
_segment_descs which causes log_heap::adjust_up to abort. Add more
checks to catch certain impossible senarios which can lead to this
sooner.

Refs #9192.
Message-Id: <20211116122346.814437-1-tgrabiec@scylladb.com>

(cherry picked from commit bf6898a5a0)
2021-11-24 15:17:37 +01:00
Tomasz Grabiec
6bfd322e3b lsa: Mark compact_segment_locked() as noexcept
We cannot recover from a failure in this method. The implementation
makes sure it never happens. Invariants will be broken if this
throws. Detect violations early by marking as noexcept.

We could make it exception safe and try to leave the data structures
in a consistent state but the reclaimer cannot make progress if this throws, so
it's pointless.

Refs #9192
Message-Id: <20211116122019.813418-1-tgrabiec@scylladb.com>

(cherry picked from commit 4d627affc3)
2021-11-24 15:17:35 +01:00
Tomasz Grabiec
afc18d5070 cql: Fix missing data in indexed queries with base table short reads
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.

Fix by restoring the "remaining" count when adjusting the paging state
on short read.

Tests:

  - index_with_paging_test
  - secondary_index_test

Fixes #9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>

(cherry picked from commit 1e4da2dcce)
2021-11-23 11:22:00 +02:00
Tomasz Grabiec
2ec22c2404 sstables: partition_index_cache: Avoid abort due to benign bad_alloc inside allocating section
shared_promise::get_shared_future() is marked noexcept, but can
allocate memory. It is invoked by sstable partition index cache inside
an allocating section, which means that allocations can throw
bad_alloc even though there is memory to reclaim, so under normal
conditions.

Fix by allocating the shared_promise in a stable memory, in the
standard allocator via lw_shared_ptr<>, so that it can be accessed outside
allocating section.

Fixes #9666

Tests:

  - build/dev/test/boost/sstable_partition_index_cache_test

Message-Id: <20211122165100.1606854-1-tgrabiec@scylladb.com>
(cherry picked from commit 1d84bc6c3b)
2021-11-23 11:21:27 +02:00
Avi Kivity
19da778271 Merge "Run gossiper message handlers in a gate" from Pavel E
"
When gossiper processes its messages in the background some of
the continuations may pop up after the gossiper is shutdown.
This, in turn, may result in unwanted code to be executed when
it doesn't expect.

In particular, storage_service notification hooks may try to
update system keyspace (with "fresh" peer info/state/tokens/etc).
This update doesn't work after drain because drain shuts down
commitlog. The intention was that gossiper did _not_ notify
anyone after drain, because it's shut down during drain too.
But since there are background continuations left, it's not
working as expected.

refs: #9567
tests: unit(dev), dtest.concurrent_schema_changes.snapshot(dev)
"

* 'br-gossiper-background-messages-2' of https://github.com/xemul/scylla:
  gossiper: Guard background processing with gate
  gossiper: Helper for background messaging processing

(cherry picked from commit 9e2b6176a2)
2021-11-19 07:25:26 +02:00
Avi Kivity
cbd4c13ba6 Merge 'Revert "scylla_util.py: return bool value on systemd_unit.is_active()"' from Takuya ASADA
On scylla_unit.py, we provide `systemd_unit.is_active()` to return `systemctl is-active` output.
When we introduced systemd_unit class, we just returned `systemctl is-active` output as string, but we changed the return value to bool after that (2545d7fd43).
This was because `if unit.is_active():` always becomes True even it returns "failed" or "inactive", to avoid such scripting bug.
However, probably this was mistake.
Because systemd unit state is not 2 state, like "start" / "stop", there are many state.

And we already using multiple unit state ("activating", "failed", "inactive", "active") in our Cloud image login prompt:
https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_login#L135
After we merged 2545d7fd43, the login prompt is broken, because it does not return string as script expected (https://github.com/scylladb/scylla-machine-image/issues/241).

I think we should revert 2545d7fd43, it should return exactly same value as `systemctl is-active` says.

Fixes #9627
Fixes scylladb/scylla-machine-image#241

Closes #9628

* github.com:scylladb/scylla:
  scylla_ntp_setup: use string in systemd_unit.is_active()
  Revert "scylla_util.py: return bool value on systemd_unit.is_active()"

(cherry picked from commit c17101604f)
2021-11-18 11:44:11 +02:00
Pavel Emelyanov
338871802d generic_server: Keep server alive during conn background processing
There's at least one tiny race in generic_server code. The trailing
.handle_exception after the conn->process() captures this, but since the
whole continuation chain happens in the background, that this can be
released thus causing the whole lambda to execute on freed generic_server
instance. This, in turn, is not nice because captured this is used to get
a _logger from.

The fix is based on the observation that all connections pin the server
in memory until all of them (connections) are destructed. Said that, to
keep the server alive in the aforementioned lambda it's enough to make
sure the conn variable (it's lw_shared_ptr on the connection) is alive in
it. Not to generate a bunch of tiny continuations with identical set of
captures -- tail the single .then_wrapped() one and do whatever is needed
to wrap up the connection processing in it.

tests: unit(dev)
fixes: #9316

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211115105818.11348-1-xemul@scylladb.com>
(cherry picked from commit ba16318457)
2021-11-17 10:21:11 +02:00
Yaron Kaikov
8b5b1b8af6 dist/docker/debian/build_docker.sh: debian version fix for rc releases
When building a docker we relay on `VERSION` value from
`SCYLLA-VERSION-GEN` . For `rc` releases only there is a different
between the configured version (X.X.rcX) and the actualy debian package
we generate (X.X~rcX)

Using a similar solution as i did in dcb10374a5

Fixes: #9616

Closes #9617

(cherry picked from commit 060a91431d)
2021-11-12 20:07:19 +02:00
Takuya ASADA
ea89eff95d dist/docker: fix bashrc filename for Ubuntu
For Debian variants, correct filename is /etc/bash.bashrc.

Fixes #9588

Closes #9589

(cherry picked from commit 201a97e4a4)
2021-11-10 14:25:27 +02:00
Michał Radwański
96421e7779 memtable: fix gcc function argument evaluation order induced use after move
clang evaluates function arguments from left to right, while gcc does so
in reverse. Therefore, this code can be correct on clang and incorrect
on gcc:
```
f(x.sth(), std::move(x))
```

This patch fixes one such instance of this bug, in memtable.cc.

Fixes #9605.

Closes #9606

(cherry picked from commit eff392073c)
2021-11-10 08:58:09 +02:00
Botond Dénes
142336ca53 mutation_writer/feed_writer: don't drop readers with small amount of content
Due to an error in transforming the above routine, readers who have <= a
buffer worth of content are dropped without consuming them.
This is due to the outer consume loop being conditioned on
`is_end_of_stream()`, which will be set for readers that eagerly
pre-fill their buffer and also have no more data then what is in their
buffer.
Change the condition to also check for `is_buffer_empty()` and only drop
the reader if both of these are true.

Fixes: #9594

Tests: unit(mutation_writer_test --repeat=200, dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211108092923.104504-1-bdenes@scylladb.com>
(cherry picked from commit 4b6c0fe592)
2021-11-09 14:13:21 +02:00
Calle Wilund
492f12248c commitlog: Add explicit track var for "wasted space" to avoid double counting
Refs #9331

In segment::close() we add space to managers "wasted" counter. In destructor,
if we can cleanly delete/recycle the file we remove it. However, if we never
went through close (shutdown - ok, exception in batch_cycle - not ok), we can
end up subtracting numbers that were never added in the first place.
Just keep track of the bytes added in a var.

Observed behaviour in above issue is timeouts in batch_cycle, where we
declare the segment closed early (because we cannot add anything more safely
- chunks could get partial/misplaced). Exception will propagate to caller(s),
but the segment will not go through actual close() call -> destructor should
not assume such.

Closes #9598

(cherry picked from commit 3929b7da1f)
2021-11-09 14:07:04 +02:00
Yaron Kaikov
7eb7a0e5fe release: prepare for 4.6.rc0 2021-11-08 09:18:26 +02:00
115 changed files with 2234 additions and 507 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -60,7 +60,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=4.6.dev
VERSION=4.6.7
if test -f version
then

View File

@@ -1017,18 +1017,16 @@ future<executor::request_return_type> executor::update_table(client_state& clien
_stats.api_operations.update_table++;
elogger.trace("Updating table {}", request);
std::string table_name = get_table_name(request);
if (table_name.find(INTERNAL_TABLE_PREFIX) == 0) {
schema_ptr tab = get_table(_proxy, request);
// the ugly but harmless conversion to string_view here is because
// Seastar's sstring is missing a find(std::string_view) :-()
if (std::string_view(tab->cf_name()).find(INTERNAL_TABLE_PREFIX) == 0) {
return make_ready_future<request_return_type>(api_error::validation(
format("Prefix {} is reserved for accessing internal tables", INTERNAL_TABLE_PREFIX)));
}
std::string keyspace_name = executor::KEYSPACE_NAME_PREFIX + table_name;
tracing::add_table_name(trace_state, keyspace_name, table_name);
tracing::add_table_name(trace_state, tab->ks_name(), tab->cf_name());
auto& db = _proxy.get_db().local();
auto& cf = db.find_column_family(keyspace_name, table_name);
schema_builder builder(cf.schema());
schema_builder builder(tab);
rjson::value* stream_specification = rjson::find(request, "StreamSpecification");
if (stream_specification && stream_specification->IsObject()) {
@@ -2080,6 +2078,9 @@ static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unorder
for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
attribute_path_map_add("AttributesToGet", ret, it->GetString());
}
if (ret.empty()) {
throw api_error::validation("Empty AttributesToGet is not allowed. Consider using Select=COUNT instead.");
}
return ret;
} else if (has_projection_expression) {
const rjson::value& projection_expression = req["ProjectionExpression"];
@@ -2481,8 +2482,8 @@ static bool hierarchy_actions(
// attr member so we can use add()
rjson::add_with_string_name(v, attr, std::move(*newv));
} else {
throw api_error::validation(format("Can't remove document path {} - not present in item",
subh.get_value()._path));
// Removing a.b when a is a map but a.b doesn't exist
// is silently ignored. It's not considered an error.
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));

View File

@@ -94,10 +94,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.get_db().local().features().cluster_supports_alternator_ttl()) {
co_return api_error::unknown_operation("DescribeTimeToLive not yet supported. Experimental support is available if the 'alternator_ttl' experimental feature is enabled on all nodes.");
}
_stats.api_operations.describe_time_to_live++;
schema_ptr schema = get_table(_proxy, request);
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
rjson::value desc = rjson::empty_object();

View File

@@ -79,6 +79,49 @@ atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
std::strong_ordering
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
if (left.timestamp() != right.timestamp()) {
return left.timestamp() <=> right.timestamp();
}
if (left.is_live() != right.is_live()) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value()) <=> 0;
if (c != 0) {
return c;
}
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
return right.ttl() <=> left.ttl();
}
}
} else {
// Both are deleted
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
<=> (uint64_t) right.deletion_time().time_since_epoch().count();
}
return std::strong_ordering::equal;
}
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (_data.empty()) {
return atomic_cell_or_collection();

View File

@@ -593,8 +593,8 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
clogger.trace("csm {}: insert dummy at {}", fmt::ptr(this), _lower_bound);
auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
auto& rows = _snp->version()->partition().mutable_clustered_rows();
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no));
return rows.insert_before(_next_row.get_iterator_in_latest_version(), std::move(new_entry));
});
_snp->tracker()->insert(*it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);

View File

@@ -765,8 +765,12 @@ future<> generation_service::check_and_repair_cdc_streams() {
std::optional<cdc::generation_id> latest = _gen_id;
const auto& endpoint_states = _gossiper.get_endpoint_states();
for (const auto& [addr, state] : endpoint_states) {
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL state while performing check_and_repair_cdc_streams"
if (_gossiper.is_left(addr)) {
cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);
continue;
}
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
@@ -830,6 +834,11 @@ future<> generation_service::check_and_repair_cdc_streams() {
latest, db_clock::now());
should_regenerate = true;
} else {
if (tmptr->sorted_tokens().size() != gen->entries().size()) {
// We probably have garbage streams from old generations
cdc_log.info("Generation size does not match the token ring, regenerating");
should_regenerate = true;
} else {
std::unordered_set<dht::token> gen_ends;
for (const auto& entry : gen->entries()) {
gen_ends.insert(entry.token_range_end);
@@ -841,6 +850,7 @@ future<> generation_service::check_and_repair_cdc_streams() {
break;
}
}
}
}
}

View File

@@ -73,7 +73,7 @@ using namespace std::chrono_literals;
logging::logger cdc_log("cdc");
namespace cdc {
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {});
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
}
static constexpr auto cdc_group_name = "cdc";
@@ -220,7 +220,7 @@ public:
return;
}
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt);
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);
auto log_mut = log_schema
? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
@@ -503,7 +503,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
}
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid) {
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner("com.scylladb.dht.CDCPartitioner");
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
@@ -590,6 +590,20 @@ static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID>
b.set_uuid(*uuid);
}
/**
* #10473 - if we are redefining the log table, we need to ensure any dropped
* columns are registered in "dropped_columns" table, otherwise clients will not
* be able to read data older than now.
*/
if (old) {
// not super efficient, but we don't do this often.
for (auto& col : old->all_columns()) {
if (!b.has_column({col.name(), col.name_as_text() })) {
b.without_column(col.name_as_text(), col.type, api::new_timestamp());
}
}
}
return b.build();
}
@@ -1511,6 +1525,11 @@ public:
}
auto process_cell = [&, this] (const column_definition& cdef) {
// If table uses compact storage it may contain a column of type empty
// and we need to ignore such a field because it is not present in CDC log.
if (cdef.type->get_kind() == abstract_type::kind::empty) {
return;
}
if (auto current = get_col_from_row_state(row_state, cdef)) {
_builder->set_value(image_ck, cdef, *current);
} else if (op == operation::pre_image) {

View File

@@ -1634,7 +1634,7 @@ future<bool> scrub_validate_mode_validate_reader(flat_mutation_reader reader, co
while (auto mf_opt = co_await reader()) {
if (cdata.is_stop_requested()) [[unlikely]] {
// Compaction manager will catch this exception and re-schedule the compaction.
co_return coroutine::make_exception(compaction_stopped_exception(schema->ks_name(), schema->cf_name(), cdata.stop_requested));
throw compaction_stopped_exception(schema->ks_name(), schema->cf_name(), cdata.stop_requested);
}
const auto& mf = *mf_opt;

View File

@@ -326,6 +326,11 @@ future<> compaction_manager::run_custom_job(column_family* cf, sstables::compact
task->compaction_done = with_semaphore(_custom_job_sem, 1, [this, task, cf, &job = *job_ptr] () mutable {
// take read lock for cf, so major compaction and resharding can't proceed in parallel.
return with_lock(_compaction_locks[cf].for_read(), [this, task, cf, &job] () mutable {
// Allow caller to know that task (e.g. reshape) was asked to stop while waiting for a chance to run.
if (task->compaction_data.is_stop_requested()) {
throw sstables::compaction_stopped_exception(task->compacting_cf->schema()->ks_name(), task->compacting_cf->schema()->cf_name(),
task->compaction_data.stop_requested);
}
_stats.active_tasks++;
if (!can_proceed(task)) {
return make_ready_future<>();
@@ -676,6 +681,7 @@ void compaction_manager::submit_offstrategy(column_family* cf) {
_stats.active_tasks++;
task->setup_new_compaction();
return with_scheduling_group(_maintenance_sg.cpu, [this, task, cf] {
return cf->run_offstrategy_compaction(task->compaction_data).then_wrapped([this, task] (future<> f) mutable {
_stats.active_tasks--;
task->finish_compaction();
@@ -698,6 +704,7 @@ void compaction_manager::submit_offstrategy(column_family* cf) {
_tasks.remove(task);
return make_ready_future<stop_iteration>(stop_iteration::yes);
});
});
});
});
});
@@ -714,9 +721,20 @@ inline bool compaction_manager::check_for_cleanup(column_family* cf) {
future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compaction_type_options options, get_candidates_func get_func, can_purge_tombstones can_purge) {
auto task = make_lw_shared<compaction_manager::task>(cf, options.type());
_tasks.push_back(task);
auto sstables = std::make_unique<std::vector<sstables::shared_sstable>>(get_func(*cf));
std::unique_ptr<std::vector<sstables::shared_sstable>> sstables;
lw_shared_ptr<compacting_sstable_registration> compacting;
// since we might potentially have ongoing compactions, and we
// must ensure that all sstables created before we run are included
// in the re-write, we need to barrier out any previously running
// compaction.
auto get_and_register_candidates_func = [this, &sstables, &compacting, &get_func] () mutable -> future<> {
sstables = std::make_unique<std::vector<sstables::shared_sstable>>(co_await get_func());
compacting = make_lw_shared<compacting_sstable_registration>(this, *sstables);
};
co_await cf->run_with_compaction_disabled(std::ref(get_and_register_candidates_func));
// sort sstables by size in descending order, such that the smallest files will be rewritten first
// (as sstable to be rewritten is popped off from the back of container), so rewrite will have higher
// chance to succeed when the biggest files are reached.
@@ -724,10 +742,11 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
return a->data_size() > b->data_size();
});
auto compacting = make_lw_shared<compacting_sstable_registration>(this, *sstables);
auto sstables_ptr = sstables.get();
_stats.pending_tasks += sstables->size();
_tasks.push_back(task);
task->compaction_done = do_until([this, sstables_ptr, task] { return sstables_ptr->empty() || !can_proceed(task); },
[this, task, options, sstables_ptr, compacting, can_purge] () mutable {
auto sst = sstables_ptr->back();
@@ -737,8 +756,10 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
column_family& cf = *task->compacting_cf;
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
auto sstable_set_snapshot = can_purge ? std::make_optional(cf.get_sstable_set()) : std::nullopt;
auto descriptor = sstables::compaction_descriptor({ sst }, std::move(sstable_set_snapshot), _maintenance_sg.io,
// FIXME: this compaction should run with maintenance priority.
auto descriptor = sstables::compaction_descriptor({ sst }, std::move(sstable_set_snapshot), service::get_local_compaction_priority(),
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, options);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
@@ -747,15 +768,14 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
};
return with_semaphore(_rewrite_sstables_sem, 1, [this, task, &cf, descriptor = std::move(descriptor), compacting] () mutable {
// Take write lock for cf to serialize cleanup/upgrade sstables/scrub with major compaction/reshape/reshard.
return with_lock(_compaction_locks[&cf].for_write(), [this, task, &cf, descriptor = std::move(descriptor), compacting] () mutable {
return with_lock(_compaction_locks[&cf].for_read(), [this, task, &cf, descriptor = std::move(descriptor), compacting] () mutable {
_stats.pending_tasks--;
_stats.active_tasks++;
task->setup_new_compaction();
task->output_run_identifier = descriptor.run_identifier;
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
return do_with(std::move(user_initiated), [this, &cf, descriptor = std::move(descriptor), task] (compaction_backlog_tracker& bt) mutable {
return with_scheduling_group(_maintenance_sg.cpu, [this, &cf, descriptor = std::move(descriptor), task]() mutable {
return with_scheduling_group(_compaction_controller.sg(), [this, &cf, descriptor = std::move(descriptor), task]() mutable {
return cf.compact_sstables(std::move(descriptor), task->compaction_data);
});
});
@@ -783,7 +803,7 @@ future<> compaction_manager::rewrite_sstables(column_family* cf, sstables::compa
_tasks.remove(task);
});
return task->compaction_done.get_future().then([task] {});
co_return co_await task->compaction_done.get_future();
}
future<> compaction_manager::perform_sstable_scrub_validate_mode(column_family* cf) {
@@ -865,31 +885,29 @@ future<> compaction_manager::perform_cleanup(database& db, column_family* cf) {
return make_exception_future<>(std::runtime_error(format("cleanup request failed: there is an ongoing cleanup on {}.{}",
cf->schema()->ks_name(), cf->schema()->cf_name())));
}
return seastar::async([this, cf, &db] {
// FIXME: indentation
auto sorted_owned_ranges = db.get_keyspace_local_ranges(cf->schema()->ks_name());
auto get_sstables = [this, &db, cf, sorted_owned_ranges] () -> future<std::vector<sstables::shared_sstable>> {
return seastar::async([this, &db, cf, sorted_owned_ranges = std::move(sorted_owned_ranges)] {
auto schema = cf->schema();
auto sorted_owned_ranges = db.get_keyspace_local_ranges(schema->ks_name());
auto sstables = std::vector<sstables::shared_sstable>{};
const auto candidates = get_candidates(*cf);
std::copy_if(candidates.begin(), candidates.end(), std::back_inserter(sstables), [&sorted_owned_ranges, schema] (const sstables::shared_sstable& sst) {
seastar::thread::maybe_yield();
return sorted_owned_ranges.empty() || needs_cleanup(sst, sorted_owned_ranges, schema);
});
return std::tuple<dht::token_range_vector, std::vector<sstables::shared_sstable>>(sorted_owned_ranges, sstables);
}).then_unpack([this, cf, &db] (dht::token_range_vector owned_ranges, std::vector<sstables::shared_sstable> sstables) {
return rewrite_sstables(cf, sstables::compaction_type_options::make_cleanup(std::move(owned_ranges)),
[sstables = std::move(sstables)] (const table&) { return sstables; });
return sstables;
});
};
return rewrite_sstables(cf, sstables::compaction_type_options::make_cleanup(std::move(sorted_owned_ranges)), std::move(get_sstables));
}
// Submit a column family to be upgraded and wait for its termination.
future<> compaction_manager::perform_sstable_upgrade(database& db, column_family* cf, bool exclude_current_version) {
using shared_sstables = std::vector<sstables::shared_sstable>;
return do_with(shared_sstables{}, [this, &db, cf, exclude_current_version](shared_sstables& tables) {
// since we might potentially have ongoing compactions, and we
// must ensure that all sstables created before we run are included
// in the re-write, we need to barrier out any previously running
// compaction.
return cf->run_with_compaction_disabled([this, cf, &tables, exclude_current_version] {
auto get_sstables = [this, &db, cf, exclude_current_version] {
// FIXME: indentation
std::vector<sstables::shared_sstable> tables;
auto last_version = cf->get_sstables_manager().get_highest_supported_format();
for (auto& sst : get_candidates(*cf)) {
@@ -900,21 +918,17 @@ future<> compaction_manager::perform_sstable_upgrade(database& db, column_family
tables.emplace_back(sst);
}
}
return make_ready_future<>();
}).then([&db, cf] {
return db.get_keyspace_local_ranges(cf->schema()->ks_name());
}).then([this, &db, cf, &tables] (dht::token_range_vector owned_ranges) {
// doing a "cleanup" is about as compacting as we need
// to be, provided we get to decide the tables to process,
// and ignoring any existing operations.
// Note that we potentially could be doing multiple
// upgrades here in parallel, but that is really the users
// problem.
return rewrite_sstables(cf, sstables::compaction_type_options::make_upgrade(std::move(owned_ranges)), [&](auto&) mutable {
return std::exchange(tables, {});
});
});
});
return make_ready_future<std::vector<sstables::shared_sstable>>(tables);
};
// doing a "cleanup" is about as compacting as we need
// to be, provided we get to decide the tables to process,
// and ignoring any existing operations.
// Note that we potentially could be doing multiple
// upgrades here in parallel, but that is really the users
// problem.
return rewrite_sstables(cf, sstables::compaction_type_options::make_upgrade(db.get_keyspace_local_ranges(cf->schema()->ks_name())), std::move(get_sstables));
}
// Submit a column family to be scrubbed and wait for its termination.
@@ -922,14 +936,10 @@ future<> compaction_manager::perform_sstable_scrub(column_family* cf, sstables::
if (scrub_mode == sstables::compaction_type_options::scrub::mode::validate) {
return perform_sstable_scrub_validate_mode(cf);
}
// since we might potentially have ongoing compactions, and we
// must ensure that all sstables created before we run are scrubbed,
// we need to barrier out any previously running compaction.
return cf->run_with_compaction_disabled([this, cf, scrub_mode] {
return rewrite_sstables(cf, sstables::compaction_type_options::make_scrub(scrub_mode), [this] (const table& cf) {
return get_candidates(cf);
// FIXME: indentation
return rewrite_sstables(cf, sstables::compaction_type_options::make_scrub(scrub_mode), [this, cf] {
return make_ready_future<std::vector<sstables::shared_sstable>>(get_candidates(*cf));
}, can_purge_tombstones::no);
});
}
future<> compaction_manager::remove(column_family* cf) {
@@ -979,7 +989,7 @@ void compaction_manager::stop_compaction(sstring type) {
}
// FIXME: switch to task_stop(), and wait for their termination, so API user can know when compactions actually stopped.
for (auto& task : _tasks) {
if (task->compaction_running && target_type == task->type) {
if (target_type == task->type) {
task->compaction_data.stop("user request");
}
}

View File

@@ -178,7 +178,7 @@ private:
maintenance_scheduling_group _maintenance_sg;
size_t _available_memory;
using get_candidates_func = std::function<std::vector<sstables::shared_sstable>(const column_family&)>;
using get_candidates_func = std::function<future<std::vector<sstables::shared_sstable>>()>;
class can_purge_tombstones_tag;
using can_purge_tombstones = bool_class<can_purge_tombstones_tag>;

View File

@@ -80,7 +80,11 @@ compaction_descriptor leveled_compaction_strategy::get_major_compaction_job(colu
}
void leveled_compaction_strategy::notify_completion(const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added) {
if (removed.empty() || added.empty()) {
// All the update here is only relevant for regular compaction's round-robin picking policy, and if
// last_compacted_keys wasn't generated by regular, it means regular is disabled since last restart,
// therefore we can skip the updates here until regular runs for the first time. Once it runs,
// it will be able to generate last_compacted_keys correctly by looking at metadata of files.
if (removed.empty() || added.empty() || !_last_compacted_keys) {
return;
}
auto min_level = std::numeric_limits<uint32_t>::max();

View File

@@ -225,6 +225,7 @@ time_window_compaction_strategy::get_sstables_for_compaction(column_family& cf,
auto gc_before = gc_clock::now() - cf.schema()->gc_grace_seconds();
if (candidates.empty()) {
_estimated_remaining_tasks = 0;
return compaction_descriptor();
}

View File

@@ -109,9 +109,7 @@ public:
virtual seastar::future<seastar::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor& qp, service::query_state& state, const query_options& options) const = 0;
virtual bool depends_on_keyspace(const seastar::sstring& ks_name) const = 0;
virtual bool depends_on_column_family(const seastar::sstring& cf_name) const = 0;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const = 0;
virtual seastar::shared_ptr<const metadata> get_result_metadata() const = 0;

View File

@@ -117,10 +117,44 @@ managed_bytes_opt get_value(const column_value& col, const column_value_eval_bag
if (!col_type->is_map()) {
throw exceptions::invalid_request_exception(format("subscripting non-map column {}", cdef->name_as_text()));
}
const auto deserialized = cdef->type->deserialize(managed_bytes_view(*data.other_columns[data.sel.index_of(*cdef)]));
int32_t index = data.sel.index_of(*cdef);
if (index == -1) {
throw std::runtime_error(
format("Column definition {} does not match any column in the query selection",
cdef->name_as_text()));
}
const managed_bytes_opt& serialized = data.other_columns[index];
if (!serialized) {
// For null[i] we return null.
return std::nullopt;
}
const auto deserialized = cdef->type->deserialize(managed_bytes_view(*serialized));
const auto& data_map = value_cast<map_type_impl::native_type>(deserialized);
const auto key = evaluate_to_raw_view(col.sub, options);
auto&& key_type = col_type->name_comparator();
if (key.is_null()) {
// For m[null] return null.
// This is different from Cassandra - which treats m[null]
// as an invalid request error. But m[null] -> null is more
// consistent with our usual null treatement (e.g., both
// null[2] and null < 2 return null). It will also allow us
// to support non-constant subscripts (e.g., m[a]) where "a"
// may be null in some rows and non-null in others, and it's
// not an error.
return std::nullopt;
}
if (key.is_unset_value()) {
// An m[?] with ? bound to UNSET_VALUE is a invalid query.
// We could have detected it earlier while binding, but since
// we currently don't, we must protect the following code
// which can't work with an UNSET_VALUE. Note that the
// placement of this check here means that in an empty table,
// where we never need to evaluate the filter expression, this
// error will not be detected.
throw exceptions::invalid_request_exception(
format("Unsupported unset map key for column {}",
cdef->name_as_text()));
}
const auto found = key.with_linearized([&] (bytes_view key_bv) {
using entry = std::pair<data_value, data_value>;
return std::find_if(data_map.cbegin(), data_map.cend(), [&] (const entry& element) {
@@ -135,8 +169,16 @@ managed_bytes_opt get_value(const column_value& col, const column_value_eval_bag
case column_kind::clustering_key:
return managed_bytes(data.clustering_key[cdef->id]);
case column_kind::static_column:
case column_kind::regular_column:
return managed_bytes_opt(data.other_columns[data.sel.index_of(*cdef)]);
[[fallthrough]];
case column_kind::regular_column: {
int32_t index = data.sel.index_of(*cdef);
if (index == -1) {
throw std::runtime_error(
format("Column definition {} does not match any column in the query selection",
cdef->name_as_text()));
}
return managed_bytes_opt(data.other_columns[index]);
}
default:
throw exceptions::unsupported_operation_exception("Unknown column kind");
}

View File

@@ -970,7 +970,7 @@ bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
return statement->depends_on(ks_name, cf_name);
}
future<> query_processor::query_internal(

View File

@@ -528,7 +528,7 @@ statement_restrictions::statement_restrictions(database& db,
}
if (!_nonprimary_key_restrictions->empty()) {
if (_has_queriable_regular_index) {
if (_has_queriable_regular_index && _partition_range_is_simple) {
_uses_secondary_indexing = true;
} else if (!allow_filtering) {
throw exceptions::invalid_request_exception("Cannot execute this query as it might involve data filtering and "

View File

@@ -193,7 +193,7 @@ public:
template<typename RowComparator>
void sort(const RowComparator& cmp) {
std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
std::sort(_rows.begin(), _rows.end(), cmp);
}
metadata& get_metadata();

View File

@@ -46,13 +46,7 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() const {
return 0;
}
bool cql3::statements::authentication_statement::depends_on_keyspace(
const sstring& ks_name) const {
return false;
}
bool cql3::statements::authentication_statement::depends_on_column_family(
const sstring& cf_name) const {
bool cql3::statements::authentication_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return false;
}

View File

@@ -55,9 +55,7 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;

View File

@@ -48,13 +48,7 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() const {
return 0;
}
bool cql3::statements::authorization_statement::depends_on_keyspace(
const sstring& ks_name) const {
return false;
}
bool cql3::statements::authorization_statement::depends_on_column_family(
const sstring& cf_name) const {
bool cql3::statements::authorization_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return false;
}

View File

@@ -59,9 +59,7 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;

View File

@@ -98,14 +98,9 @@ batch_statement::batch_statement(type type_,
{
}
bool batch_statement::depends_on_keyspace(const sstring& ks_name) const
bool batch_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}
bool batch_statement::depends_on_column_family(const sstring& cf_name) const
{
return false;
return boost::algorithm::any_of(_statements, [&ks_name, &cf_name] (auto&& s) { return s.statement->depends_on(ks_name, cf_name); });
}
uint32_t batch_statement::get_bound_terms() const

View File

@@ -115,9 +115,7 @@ public:
std::unique_ptr<attributes> attrs,
cql_stats& stats);
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual uint32_t get_bound_terms() const override;

View File

@@ -571,12 +571,8 @@ modification_statement::validate(service::storage_proxy&, const service::client_
}
}
bool modification_statement::depends_on_keyspace(const sstring& ks_name) const {
return keyspace() == ks_name;
}
bool modification_statement::depends_on_column_family(const sstring& cf_name) const {
return column_family() == cf_name;
bool modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return keyspace() == ks_name && (!cf_name || column_family() == *cf_name);
}
void modification_statement::add_operation(::shared_ptr<operation> op) {

View File

@@ -165,9 +165,7 @@ public:
// Validate before execute, using client state and current schema
void validate(service::storage_proxy&, const service::client_state& state) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
void add_operation(::shared_ptr<operation> op);

View File

@@ -67,12 +67,7 @@ future<> schema_altering_statement::grant_permissions_to_creator(const service::
return make_ready_future<>();
}
bool schema_altering_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
}
bool schema_altering_statement::depends_on_column_family(const sstring& cf_name) const
bool schema_altering_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}

View File

@@ -79,9 +79,7 @@ protected:
*/
virtual future<> grant_permissions_to_creator(const service::client_state&) const;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual uint32_t get_bound_terms() const override;

View File

@@ -194,12 +194,8 @@ void select_statement::validate(service::storage_proxy&, const service::client_s
// Nothing to do, all validation has been done by raw_statemet::prepare()
}
bool select_statement::depends_on_keyspace(const sstring& ks_name) const {
return keyspace() == ks_name;
}
bool select_statement::depends_on_column_family(const sstring& cf_name) const {
return column_family() == cf_name;
bool select_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return keyspace() == ks_name && (!cf_name || column_family() == *cf_name);
}
const sstring& select_statement::keyspace() const {
@@ -995,6 +991,7 @@ lw_shared_ptr<const service::pager::paging_state> indexed_table_select_statement
}
auto paging_state_copy = make_lw_shared<service::pager::paging_state>(service::pager::paging_state(*paging_state));
paging_state_copy->set_remaining(internal_paging_size);
paging_state_copy->set_partition_key(std::move(index_pk));
paging_state_copy->set_clustering_key(std::move(index_ck));
return std::move(paging_state_copy);

View File

@@ -127,8 +127,7 @@ public:
virtual uint32_t get_bound_terms() const override;
virtual future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;
virtual void validate(service::storage_proxy&, const service::client_state& state) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual future<::shared_ptr<cql_transport::messages::result_message>> execute(query_processor& qp,
service::query_state& state, const query_options& options) const override;

View File

@@ -30,13 +30,7 @@ uint32_t service_level_statement::get_bound_terms() const {
return 0;
}
bool service_level_statement::depends_on_keyspace(
const sstring &ks_name) const {
return false;
}
bool service_level_statement::depends_on_column_family(
const sstring &cf_name) const {
bool service_level_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return false;
}

View File

@@ -56,9 +56,7 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
future<> check_access(service::storage_proxy& sp, const service::client_state& state) const override;

View File

@@ -43,7 +43,7 @@ void sl_prop_defs::validate() {
data_value v = duration_type->deserialize(duration_type->from_string(*repr));
cql_duration duration = static_pointer_cast<const duration_type_impl>(duration_type)->from_value(v);
if (duration.months || duration.days) {
throw exceptions::invalid_request_exception("Timeout values cannot be longer than 24h");
throw exceptions::invalid_request_exception("Timeout values cannot be expressed in days/months");
}
if (duration.nanoseconds % 1'000'000 != 0) {
throw exceptions::invalid_request_exception("Timeout values must be expressed in millisecond granularity");

View File

@@ -67,12 +67,7 @@ std::unique_ptr<prepared_statement> truncate_statement::prepare(database& db,cql
return std::make_unique<prepared_statement>(::make_shared<truncate_statement>(*this));
}
bool truncate_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
}
bool truncate_statement::depends_on_column_family(const sstring& cf_name) const
bool truncate_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}

View File

@@ -58,9 +58,7 @@ public:
virtual std::unique_ptr<prepared_statement> prepare(database& db, cql_stats& stats) override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;

View File

@@ -53,6 +53,7 @@
#include "types/list.hh"
#include "types/user.hh"
#include "concrete_types.hh"
#include "validation.hh"
namespace cql3 {
@@ -251,6 +252,7 @@ insert_prepared_json_statement::build_partition_keys(const query_options& option
exploded.emplace_back(json_value->second);
}
auto pkey = partition_key::from_optional_exploded(*s, std::move(exploded));
validation::validate_cql_key(*s, pkey);
auto k = query::range<query::ring_position>::make_singular(dht::decorate_key(*s, std::move(pkey)));
ranges.emplace_back(std::move(k));
return ranges;

View File

@@ -74,12 +74,7 @@ std::unique_ptr<prepared_statement> use_statement::prepare(database& db, cql_sta
}
bool use_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
}
bool use_statement::depends_on_column_family(const sstring& cf_name) const
bool use_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}

View File

@@ -59,9 +59,7 @@ public:
virtual uint32_t get_bound_terms() const override;
virtual bool depends_on_keyspace(const seastar::sstring& ks_name) const override;
virtual bool depends_on_column_family(const seastar::sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual seastar::future<> check_access(service::storage_proxy& proxy, const service::client_state& state) const override;

View File

@@ -926,10 +926,9 @@ bool database::update_column_family(schema_ptr new_schema) {
return columns_changed;
}
future<> database::remove(const column_family& cf) noexcept {
void database::remove(const table& cf) noexcept {
auto s = cf.schema();
auto& ks = find_keyspace(s->ks_name());
co_await _querier_cache.evict_all_for_table(s->id());
_column_families.erase(s->id());
ks.metadata()->remove_column_family(s);
_ks_cf_to_uuid.erase(std::make_pair(s->ks_name(), s->cf_name()));
@@ -946,13 +945,20 @@ future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_
auto& ks = find_keyspace(ks_name);
auto uuid = find_uuid(ks_name, cf_name);
auto cf = _column_families.at(uuid);
co_await remove(*cf);
remove(*cf);
cf->clear_views();
co_return co_await cf->await_pending_ops().then([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return truncate(ks, *cf, std::move(tsf), snapshot).finally([this, cf] {
return cf->stop();
});
}).finally([cf] {});
co_await cf->await_pending_ops();
co_await _querier_cache.evict_all_for_table(cf->schema()->id());
std::exception_ptr ex;
try {
co_await truncate(ks, *cf, std::move(tsf), snapshot);
} catch (...) {
ex = std::current_exception();
}
co_await cf->stop();
if (ex) {
std::rethrow_exception(std::move(ex));
}
}
const utils::UUID& database::find_uuid(std::string_view ks, std::string_view cf) const {
@@ -1348,44 +1354,6 @@ database::existing_index_names(const sstring& ks_name, const sstring& cf_to_excl
return names;
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
std::strong_ordering
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
if (left.timestamp() != right.timestamp()) {
return left.timestamp() <=> right.timestamp();
}
if (left.is_live() != right.is_live()) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value()) <=> 0;
if (c != 0) {
return c;
}
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
if (left.is_live_and_has_ttl() && left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
}
} else {
// Both are deleted
if (left.deletion_time() != right.deletion_time()) {
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
<=> (uint64_t) right.deletion_time().time_since_epoch().count();
}
}
return std::strong_ordering::equal;
}
future<std::tuple<lw_shared_ptr<query::result>, cache_temperature>>
database::query(schema_ptr s, const query::read_command& cmd, query::result_options opts, const dht::partition_range_vector& ranges,
tracing::trace_state_ptr trace_state, db::timeout_clock::time_point timeout) {

View File

@@ -1384,6 +1384,7 @@ private:
Future update_write_metrics(Future&& f);
void update_write_metrics_for_timed_out_write();
future<> create_keyspace(const lw_shared_ptr<keyspace_metadata>&, bool is_bootstrap, system_keyspace system);
void remove(const table&) noexcept;
public:
static utils::UUID empty_version;
@@ -1582,7 +1583,6 @@ public:
bool update_column_family(schema_ptr s);
future<> drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func, bool with_snapshot = true);
future<> remove(const column_family&) noexcept;
const logalloc::region_group& dirty_memory_region_group() const {
return _dirty_memory_manager.region_group();

View File

@@ -428,6 +428,8 @@ private:
void abort_recycled_list(std::exception_ptr);
void abort_deletion_promise(std::exception_ptr);
future<> recalculate_footprint();
future<> rename_file(sstring, sstring) const;
size_t max_request_controller_units() const;
segment_id_type _ids = 0;
@@ -444,6 +446,7 @@ private:
seastar::gate _gate;
uint64_t _new_counter = 0;
std::optional<size_t> _disk_write_alignment;
seastar::semaphore _reserve_recalculation_guard;
};
template<typename T>
@@ -512,6 +515,7 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
uint64_t _file_pos = 0;
uint64_t _flush_pos = 0;
uint64_t _size_on_disk = 0;
uint64_t _waste = 0;
size_t _alignment;
@@ -598,7 +602,7 @@ public:
clogger.debug("Segment {} is no longer active and will submitted for delete now", *this);
++_segment_manager->totals.segments_destroyed;
_segment_manager->totals.active_size_on_disk -= file_position();
_segment_manager->totals.wasted_size_on_disk -= (_size_on_disk - file_position());
_segment_manager->totals.wasted_size_on_disk -= _waste;
_segment_manager->add_file_to_delete(_file_name, _desc);
} else if (_segment_manager->cfg.warn_about_segments_left_on_disk_after_shutdown) {
clogger.warn("Segment {} is dirty and is left on disk.", *this);
@@ -725,7 +729,8 @@ public:
auto s = co_await sync();
co_await flush();
co_await terminate();
_segment_manager->totals.wasted_size_on_disk += (_size_on_disk - file_position());
_waste = _size_on_disk - file_position();
_segment_manager->totals.wasted_size_on_disk += _waste;
co_return s;
}
future<sseg_ptr> do_flush(uint64_t pos) {
@@ -1223,6 +1228,7 @@ db::commitlog::segment_manager::segment_manager(config c)
, _recycled_segments(std::numeric_limits<size_t>::max())
, _reserve_replenisher(make_ready_future<>())
, _background_sync(make_ready_future<>())
, _reserve_recalculation_guard(1)
{
assert(max_size > 0);
assert(max_mutation_size < segment::multi_entry_size_magic);
@@ -1248,6 +1254,11 @@ future<> db::commitlog::segment_manager::replenish_reserve() {
}
try {
gate::holder g(_gate);
auto guard = co_await get_units(_reserve_recalculation_guard, 1);
if (_reserve_segments.full()) {
// can happen if we recalculate
continue;
}
// note: if we were strict with disk size, we would refuse to do this
// unless disk footprint is lower than threshold. but we cannot (yet?)
// trust that flush logic will absolutely free up an existing
@@ -1519,7 +1530,7 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
if (cfg.extensions && !cfg.extensions->commitlog_file_extensions().empty()) {
for (auto * ext : cfg.extensions->commitlog_file_extensions()) {
auto nf = co_await ext->wrap_file(std::move(filename), f, flags);
auto nf = co_await ext->wrap_file(filename, f, flags);
if (nf) {
f = std::move(nf);
align = is_overwrite ? f.disk_overwrite_dma_alignment() : f.disk_write_dma_alignment();
@@ -1530,12 +1541,21 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
f = make_checked_file(commit_error_handler, std::move(f));
} catch (...) {
ep = std::current_exception();
commit_error_handler(ep);
}
if (ep) {
// do this early, so iff we are to fast-fail server,
// we do it before anything else can go wrong.
try {
commit_error_handler(ep);
} catch (...) {
ep = std::current_exception();
}
}
if (ep && f) {
co_await f.close();
}
if (ep) {
add_file_to_delete(filename, d);
co_return coroutine::exception(std::move(ep));
}
@@ -1594,6 +1614,8 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
}
future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager::new_segment() {
gate::holder g(_gate);
if (_shutdown) {
co_return coroutine::make_exception(std::runtime_error("Commitlog has been shut down. Cannot add data"));
}
@@ -1628,22 +1650,23 @@ future<db::commitlog::segment_manager::sseg_ptr> db::commitlog::segment_manager:
co_return _segments.back();
}
if (_segment_allocating) {
co_await _segment_allocating->get_future(timeout);
continue;
}
promise<> p;
_segment_allocating.emplace(p.get_future());
auto finally = defer([&] () noexcept { _segment_allocating = std::nullopt; });
try {
gate::holder g(_gate);
auto s = co_await with_timeout(timeout, new_segment());
p.set_value();
} catch (...) {
p.set_exception(std::current_exception());
throw;
// #9896 - we don't want to issue a new_segment call until
// the old one has terminated with either result or exception.
// Do all waiting through the shared_future
if (!_segment_allocating) {
auto f = new_segment();
// must check that we are not already done.
if (f.available()) {
f.get(); // maybe force exception
continue;
}
_segment_allocating.emplace(f.discard_result().finally([this] {
// clear the shared_future _before_ resolving its contents
// (i.e. with result of this finally)
_segment_allocating = std::nullopt;
}));
}
co_await _segment_allocating->get_future(timeout);
}
}
@@ -1865,6 +1888,8 @@ future<> db::commitlog::segment_manager::delete_segments(std::vector<sstring> fi
std::exception_ptr recycle_error;
size_t num_deleted = 0;
bool except = false;
while (!files.empty()) {
auto filename = std::move(files.back());
files.pop_back();
@@ -1914,8 +1939,10 @@ future<> db::commitlog::segment_manager::delete_segments(std::vector<sstring> fi
}
}
co_await delete_file(filename);
++num_deleted;
} catch (...) {
clogger.error("Could not delete segment {}: {}", filename, std::current_exception());
except = true;
}
}
@@ -1928,6 +1955,16 @@ future<> db::commitlog::segment_manager::delete_segments(std::vector<sstring> fi
if (recycle_error && _recycled_segments.empty()) {
abort_recycled_list(recycle_error);
}
// If recycle failed and turned into a delete, we should fake-wakeup waiters
// since we might still have cleaned up disk space.
if (!recycle_error && num_deleted && cfg.reuse_segments && _recycled_segments.empty()) {
abort_recycled_list(std::make_exception_ptr(std::runtime_error("deleted files")));
}
// #9348 - if we had an exception, we can't trust our bookeep any more. recalculate.
if (except) {
co_await recalculate_footprint();
}
}
void db::commitlog::segment_manager::abort_recycled_list(std::exception_ptr ep) {
@@ -1942,6 +1979,67 @@ void db::commitlog::segment_manager::abort_deletion_promise(std::exception_ptr e
std::exchange(_disk_deletions, {}).set_exception(ep);
}
future<> db::commitlog::segment_manager::recalculate_footprint() {
try {
co_await do_pending_deletes();
auto guard = co_await get_units(_reserve_recalculation_guard, 1);
auto segments_copy = _segments;
std::vector<sseg_ptr> reserves;
std::vector<sstring> recycles;
// this causes haywire things while we steal stuff, but...
while (!_reserve_segments.empty()) {
reserves.push_back(_reserve_segments.pop());
}
while (!_recycled_segments.empty()) {
recycles.push_back(_recycled_segments.pop());
}
// #9955 - must re-stock the queues before we do anything
// interruptable/continuation. Because both queues are
// used with push/pop eventually which _waits_ for signal
// but does _not_ verify that the condition is true once
// we return. So copy the objects and look at instead.
for (auto& filename : recycles) {
_recycled_segments.push(sstring(filename));
}
for (auto& s : reserves) {
_reserve_segments.push(sseg_ptr(s)); // you can have it back now.
}
// first, guesstimate sizes
uint64_t recycle_size = recycles.size() * max_size;
auto old = totals.total_size_on_disk;
totals.total_size_on_disk = recycle_size;
for (auto& s : _segments) {
totals.total_size_on_disk += s->_size_on_disk;
}
for (auto& s : reserves) {
totals.total_size_on_disk += s->_size_on_disk;
}
// now we need to adjust the actual sizes of recycled files
uint64_t actual_recycled_size = 0;
try {
for (auto& filename : recycles) {
auto s = co_await seastar::file_size(filename);
actual_recycled_size += s;
}
} catch (...) {
clogger.error("Exception reading disk footprint ({}).", std::current_exception());
actual_recycled_size = recycle_size; // best we got
}
totals.total_size_on_disk += actual_recycled_size - recycle_size;
// pushing things to reserve/recycled queues will have resumed any
// waiters, so we should be done.
} catch (...) {
clogger.error("Exception recalculating disk footprint ({}). Values might be off...", std::current_exception());
}
}
future<> db::commitlog::segment_manager::do_pending_deletes() {
auto ftc = std::exchange(_files_to_close, {});
auto ftd = std::exchange(_files_to_delete, {});

View File

@@ -119,8 +119,9 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
return check_snapshot_not_exist(ks_name, tag, tables).then([this, ks_name, tables, tag, sf] {
return do_with(std::vector<sstring>(std::move(tables)),[this, ks_name, tag, sf](const std::vector<sstring>& tables) {
return do_for_each(tables, [ks_name, tag, sf, this] (const sstring& table_name) {
if (table_name.find(".") != sstring::npos) {
throw std::invalid_argument("Cannot take a snapshot of a secondary index by itself. Run snapshot on the table that owns the index.");
auto& cf = _db.local().find_column_family(ks_name, table_name);
if (cf.schema()->is_view()) {
throw std::invalid_argument("Do not take a snapshot of a materialized view or a secondary index by itself. Run snapshot on the base table instead.");
}
return _db.invoke_on_all([ks_name, table_name, tag, sf] (database &db) {
auto& cf = db.find_column_family(ks_name, table_name);

View File

@@ -350,7 +350,11 @@ public:
view_filter_checking_visitor(const schema& base, const view_info& view)
: _base(base)
, _view(view)
, _selection(cql3::selection::selection::wildcard(_base.shared_from_this()))
, _selection(cql3::selection::selection::for_columns(_base.shared_from_this(),
boost::copy_range<std::vector<const column_definition*>>(
_base.regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return &cdef; }))
)
)
{}
void accept_new_partition(const partition_key& key, uint64_t row_count) {
@@ -1320,7 +1324,7 @@ future<> mutate_MV(
auto mut_ptr = remote_endpoints.empty() ? std::make_unique<frozen_mutation>(std::move(mut.fm)) : std::make_unique<frozen_mutation>(mut.fm);
tracing::trace(tr_state, "Locally applying view update for {}.{}; base token = {}; view token = {}",
mut.s->ks_name(), mut.s->cf_name(), base_token, view_token);
local_view_update = service::get_local_storage_proxy().mutate_locally(mut.s, *mut_ptr, std::move(tr_state), db::commitlog::force_sync::no).then_wrapped(
local_view_update = service::get_local_storage_proxy().mutate_locally(mut.s, *mut_ptr, tr_state, db::commitlog::force_sync::no).then_wrapped(
[s = mut.s, &stats, &cf_stats, tr_state, base_token, view_token, my_address, mut_ptr = std::move(mut_ptr),
units = sem_units.split(sem_units.count())] (future<>&& f) {
--stats.writes;

View File

@@ -215,6 +215,12 @@ public:
});
}
future<flush_permit> get_all_flush_permits() {
return get_units(_background_work_flush_serializer, _max_background_work).then([this] (auto&& units) {
return this->get_flush_permit(std::move(units));
});
}
bool has_extraneous_flushes_requested() const {
return _extraneous_flushes > 0;
}

View File

@@ -100,6 +100,7 @@ def version_compare(a, b):
def create_uuid_file(fl):
with open(args.uuid_file, 'w') as myfile:
myfile.write(str(uuid.uuid1()) + "\n")
os.chmod(args.uuid_file, 0o644)
def sanitize_version(version):

View File

@@ -127,10 +127,14 @@ WantedBy=multi-user.target
# - Storage: /path/to/file (inacessible)
# - Storage: /path/to/file
#
# After systemd-v248, available coredump file output changed like this:
# - Storage: /path/to/file (present)
# We need to support both versions.
#
# reference: https://github.com/systemd/systemd/commit/47f50642075a7a215c9f7b600599cbfee81a2913
corefail = False
res = re.findall(r'Storage: (.*)$', coreinfo, flags=re.MULTILINE)
res = re.findall(r'Storage: (\S+)(?: \(.+\))?$', coreinfo, flags=re.MULTILINE)
# v232 or later
if res:
corepath = res[0]

View File

@@ -278,6 +278,66 @@ if __name__ == "__main__":
disk_properties["read_bandwidth"] = 2527296683 * nr_disks
disk_properties["write_iops"] = 156326 * nr_disks
disk_properties["write_bandwidth"] = 1063657088 * nr_disks
elif idata.instance() == "im4gn.large":
disk_properties["read_iops"] = 33943
disk_properties["read_bandwidth"] = 288433525
disk_properties["write_iops"] = 27877
disk_properties["write_bandwidth"] = 126864680
elif idata.instance() == "im4gn.xlarge":
disk_properties["read_iops"] = 68122
disk_properties["read_bandwidth"] = 576603520
disk_properties["write_iops"] = 55246
disk_properties["write_bandwidth"] = 254534954
elif idata.instance() == "im4gn.2xlarge":
disk_properties["read_iops"] = 136422
disk_properties["read_bandwidth"] = 1152663765
disk_properties["write_iops"] = 92184
disk_properties["write_bandwidth"] = 508926453
elif idata.instance() == "im4gn.4xlarge":
disk_properties["read_iops"] = 273050
disk_properties["read_bandwidth"] = 1638427264
disk_properties["write_iops"] = 92173
disk_properties["write_bandwidth"] = 1027966826
elif idata.instance() == "im4gn.8xlarge":
disk_properties["read_iops"] = 250241 * nr_disks
disk_properties["read_bandwidth"] = 1163130709 * nr_disks
disk_properties["write_iops"] = 86374 * nr_disks
disk_properties["write_bandwidth"] = 977617664 * nr_disks
elif idata.instance() == "im4gn.16xlarge":
disk_properties["read_iops"] = 273030 * nr_disks
disk_properties["read_bandwidth"] = 1638211413 * nr_disks
disk_properties["write_iops"] = 92607 * nr_disks
disk_properties["write_bandwidth"] = 1028340266 * nr_disks
elif idata.instance() == "is4gen.medium":
disk_properties["read_iops"] = 33965
disk_properties["read_bandwidth"] = 288462506
disk_properties["write_iops"] = 27876
disk_properties["write_bandwidth"] = 126954200
elif idata.instance() == "is4gen.large":
disk_properties["read_iops"] = 68131
disk_properties["read_bandwidth"] = 576654869
disk_properties["write_iops"] = 55257
disk_properties["write_bandwidth"] = 254551002
elif idata.instance() == "is4gen.xlarge":
disk_properties["read_iops"] = 136413
disk_properties["read_bandwidth"] = 1152747904
disk_properties["write_iops"] = 92180
disk_properties["write_bandwidth"] = 508889546
elif idata.instance() == "is4gen.2xlarge":
disk_properties["read_iops"] = 273038
disk_properties["read_bandwidth"] = 1628982613
disk_properties["write_iops"] = 92182
disk_properties["write_bandwidth"] = 1027983530
elif idata.instance() == "is4gen.4xlarge":
disk_properties["read_iops"] = 260493 * nr_disks
disk_properties["read_bandwidth"] = 1217396928 * nr_disks
disk_properties["write_iops"] = 83169 * nr_disks
disk_properties["write_bandwidth"] = 1000390784 * nr_disks
elif idata.instance() == "is4gen.8xlarge":
disk_properties["read_iops"] = 273021 * nr_disks
disk_properties["read_bandwidth"] = 1656354602 * nr_disks
disk_properties["write_iops"] = 92233 * nr_disks
disk_properties["write_bandwidth"] = 1028010325 * nr_disks
properties_file = open(etcdir() + "/scylla.d/io_properties.yaml", "w")
yaml.dump({ "disks": [ disk_properties ] }, properties_file, default_flow_style=False)
ioconf = open(etcdir() + "/scylla.d/io.conf", "w")

View File

@@ -66,18 +66,18 @@ if __name__ == '__main__':
target = None
if os.path.exists('/lib/systemd/systemd-timesyncd'):
if systemd_unit('systemd-timesyncd').is_active():
if systemd_unit('systemd-timesyncd').is_active() == 'active':
print('ntp is already configured, skip setup')
sys.exit(0)
target = 'systemd-timesyncd'
if shutil.which('chronyd'):
if get_chrony_unit().is_active():
if get_chrony_unit().is_active() == 'active':
print('ntp is already configured, skip setup')
sys.exit(0)
if not target:
target = 'chrony'
if shutil.which('ntpd'):
if get_ntp_unit().is_active():
if get_ntp_unit().is_active() == 'active':
print('ntp is already configured, skip setup')
sys.exit(0)
if not target:

View File

@@ -117,10 +117,11 @@ if __name__ == '__main__':
pkg_install('xfsprogs')
if not shutil.which('mdadm'):
pkg_install('mdadm')
try:
md_service = systemd_unit('mdmonitor.service')
except SystemdException:
md_service = systemd_unit('mdadm.service')
if args.raid_level != '0':
try:
md_service = systemd_unit('mdmonitor.service')
except SystemdException:
md_service = systemd_unit('mdadm.service')
print('Creating {type} for scylla using {nr_disk} disk(s): {disks}'.format(type='fRAID{args.raid_level}' if raid else 'XFS volume', nr_disk=len(disks), disks=args.disks))
procs=[]
@@ -164,14 +165,15 @@ if __name__ == '__main__':
uuid = run(f'blkid -s UUID -o value {fsdev}', shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
after = 'local-fs.target'
if raid:
wants = ''
if raid and args.raid_level != '0':
after += f' {md_service}'
wants = f'\nWants={md_service}'
unit_data = f'''
[Unit]
Description=Scylla data directory
Before=scylla-server.service
After={after}
Wants={md_service}
After={after}{wants}
DefaultDependencies=no
[Mount]
@@ -195,7 +197,8 @@ WantedBy=multi-user.target
f.write(f'RequiresMountsFor={mount_at}\n')
systemd_unit.reload()
md_service.start()
if args.raid_level != '0':
md_service.start()
mount = systemd_unit(mntunit_bn)
mount.start()
if args.enable_on_nextboot:

View File

@@ -370,6 +370,10 @@ if __name__ == '__main__':
version_check = interactive_ask_service('Do you want to enable Scylla to check if there is a newer version of Scylla available?', 'Yes - start the Scylla-housekeeping service to check for a newer version. This check runs periodically. No - skips this step.', version_check)
args.no_version_check = not version_check
if version_check:
cfg = sysconfig_parser(sysconfdir_p() / 'scylla-housekeeping')
repo_files = cfg.get('REPO_FILES')
for f in glob.glob(repo_files):
os.chmod(f, 0o644)
with open('/etc/scylla.d/housekeeping.cfg', 'w') as f:
f.write('[housekeeping]\ncheck-version: True\n')
os.chmod('/etc/scylla.d/housekeeping.cfg', 0o644)

View File

@@ -674,7 +674,7 @@ class aws_instance:
return self._type.split(".")[0]
def is_supported_instance_class(self):
if self.instance_class() in ['i2', 'i3', 'i3en', 'c5d', 'm5d', 'm5ad', 'r5d', 'z1d', 'c6gd', 'm6gd', 'r6gd', 'x2gd']:
if self.instance_class() in ['i2', 'i3', 'i3en', 'c5d', 'm5d', 'm5ad', 'r5d', 'z1d', 'c6gd', 'm6gd', 'r6gd', 'x2gd', 'im4gn', 'is4gen']:
return True
return False
@@ -683,7 +683,7 @@ class aws_instance:
instance_size = self.instance_size()
if instance_class in ['c3', 'c4', 'd2', 'i2', 'r3']:
return 'ixgbevf'
if instance_class in ['a1', 'c5', 'c5a', 'c5d', 'c5n', 'c6g', 'c6gd', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'm6gd', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5b', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d', 'c6g', 'c6gd', 'm6g', 'm6gd', 't4g', 'r6g', 'r6gd', 'x2gd']:
if instance_class in ['a1', 'c5', 'c5a', 'c5d', 'c5n', 'c6g', 'c6gd', 'f1', 'g3', 'g4', 'h1', 'i3', 'i3en', 'inf1', 'm5', 'm5a', 'm5ad', 'm5d', 'm5dn', 'm5n', 'm6g', 'm6gd', 'p2', 'p3', 'r4', 'r5', 'r5a', 'r5ad', 'r5b', 'r5d', 'r5dn', 'r5n', 't3', 't3a', 'u-6tb1', 'u-9tb1', 'u-12tb1', 'u-18tn1', 'u-24tb1', 'x1', 'x1e', 'z1d', 'c6g', 'c6gd', 'm6g', 'm6gd', 't4g', 'r6g', 'r6gd', 'x2gd', 'im4gn', 'is4gen']:
return 'ena'
if instance_class == 'm4':
if instance_size == '16xlarge':
@@ -1041,7 +1041,7 @@ class systemd_unit:
return run('systemctl {} disable {}'.format(self.ctlparam, self._unit), shell=True, check=True)
def is_active(self):
return True if run('systemctl {} is-active {}'.format(self.ctlparam, self._unit), shell=True, capture_output=True, encoding='utf-8').stdout.strip() == 'active' else False
return run('systemctl {} is-active {}'.format(self.ctlparam, self._unit), shell=True, capture_output=True, encoding='utf-8').stdout.strip()
def mask(self):
return run('systemctl {} mask {}'.format(self.ctlparam, self._unit), shell=True, check=True)

View File

@@ -6,12 +6,16 @@ is_nonroot() {
[ -f "$scylladir"/SCYLLA-NONROOT-FILE ]
}
is_container() {
[ -f "$scylladir"/SCYLLA-CONTAINER-FILE ]
}
is_privileged() {
[ ${EUID:-${UID}} = 0 ]
}
execsudo() {
if is_nonroot; then
if is_nonroot || is_container; then
exec "$@"
else
exec sudo -u scylla -g scylla "$@"

View File

@@ -25,6 +25,10 @@ product="$(<build/SCYLLA-PRODUCT-FILE)"
version="$(<build/SCYLLA-VERSION-FILE)"
release="$(<build/SCYLLA-RELEASE-FILE)"
if [[ "$version" = *rc* ]]; then
version=$(echo $version |sed 's/\(.*\)\.)*/\1~/')
fi
mode="release"
if uname -m | grep x86_64 ; then
@@ -93,12 +97,14 @@ run apt-get -y install hostname supervisor openssh-server openssh-client openjdk
run locale-gen en_US.UTF-8
run bash -ec "dpkg -i packages/*.deb"
run apt-get -y clean all
run bash -ec "cat /scylla_bashrc >> /etc/bashrc"
run bash -ec "cat /scylla_bashrc >> /etc/bash.bashrc"
run mkdir -p /etc/supervisor.conf.d
run mkdir -p /var/log/scylla
run chown -R scylla:scylla /var/lib/scylla
run sed -i -e 's/^SCYLLA_ARGS=".*"$/SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix"/' /etc/default/scylla-server
run mkdir -p /opt/scylladb/supervisor
run touch /opt/scylladb/SCYLLA-CONTAINER-FILE
bcp dist/common/supervisor/scylla-server.sh /opt/scylladb/supervisor/scylla-server.sh
bcp dist/common/supervisor/scylla-jmx.sh /opt/scylladb/supervisor/scylla-jmx.sh
bcp dist/common/supervisor/scylla-node-exporter.sh /opt/scylladb/supervisor/scylla-node-exporter.sh

View File

@@ -1,4 +1,4 @@
[program:scylla-server]
[program:scylla]
command=/opt/scylladb/supervisor/scylla-server.sh
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0

View File

@@ -1,41 +0,0 @@
# choose following mode: virtio, dpdk, posix
NETWORK_MODE=posix
# tap device name(virtio)
TAP=tap0
# bridge device name (virtio)
BRIDGE=virbr0
# ethernet device name
IFNAME=eth0
# setup NIC's and disks' interrupts, RPS, XPS, nomerges and I/O scheduler (posix)
SET_NIC_AND_DISKS=no
# ethernet device driver (dpdk)
ETHDRV=
# ethernet device PCI ID (dpdk)
ETHPCIID=
# number of hugepages
NR_HUGEPAGES=64
# user for process (must be root for dpdk)
USER=scylla
# group for process
GROUP=scylla
# scylla home dir
SCYLLA_HOME=/var/lib/scylla
# scylla config dir
SCYLLA_CONF=/etc/scylla
# scylla arguments
SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix"
# setup as AMI instance
AMI=no

View File

@@ -121,12 +121,13 @@ class ScyllaSetup:
if self._apiAddress is not None:
args += ["--api-address %s" % self._apiAddress]
if self._alternatorPort is not None:
if self._alternatorAddress is not None:
args += ["--alternator-address %s" % self._alternatorAddress]
if self._alternatorPort is not None:
args += ["--alternator-port %s" % self._alternatorPort]
if self._alternatorHttpsPort is not None:
args += ["--alternator-address %s" % self._alternatorAddress]
args += ["--alternator-https-port %s" % self._alternatorHttpsPort]
if self._alternatorWriteIsolation is not None:

View File

@@ -184,14 +184,18 @@ future<> server::do_accepts(int which, bool keepalive, socket_address server_add
_logger.info("exception while advertising new connection: {}", std::current_exception());
}
// Block while monitoring for lifetime/errors.
return conn->process().finally([this, conn] {
return unadvertise_connection(conn);
}).handle_exception([this] (std::exception_ptr ep) {
if (is_broken_pipe_or_connection_reset(ep)) {
// expected if another side closes a connection or we're shutting down
return;
return conn->process().then_wrapped([this, conn] (auto f) {
try {
f.get();
} catch (...) {
auto ep = std::current_exception();
if (!is_broken_pipe_or_connection_reset(ep)) {
// some exceptions are expected if another side closes a connection
// or we're shutting down
_logger.info("exception while processing connection: {}", ep);
}
}
_logger.info("exception while processing connection: {}", ep);
return unadvertise_connection(conn);
});
});
return stop_iteration::no;

View File

@@ -477,49 +477,42 @@ gossiper::handle_get_endpoint_states_msg(gossip_get_endpoint_states_request requ
return make_ready_future<gossip_get_endpoint_states_response>(gossip_get_endpoint_states_response{std::move(map)});
}
rpc::no_wait_type gossiper::background_msg(sstring type, noncopyable_function<future<>(gossiper&)> fn) {
(void)with_gate(_background_msg, [this, type = std::move(type), fn = std::move(fn)] () mutable {
return container().invoke_on(0, std::move(fn)).handle_exception([type = std::move(type)] (auto ep) {
logger.warn("Failed to handle {}: {}", type, ep);
});
});
return messaging_service::no_wait();
}
void gossiper::init_messaging_service_handler() {
_messaging.register_gossip_digest_syn([this] (const rpc::client_info& cinfo, gossip_digest_syn syn_msg) {
auto from = netw::messaging_service::get_source(cinfo);
// In a new fiber.
(void)container().invoke_on(0, [from, syn_msg = std::move(syn_msg)] (gms::gossiper& gossiper) mutable {
return background_msg("GOSSIP_DIGEST_SYN", [from, syn_msg = std::move(syn_msg)] (gms::gossiper& gossiper) mutable {
return gossiper.handle_syn_msg(from, std::move(syn_msg));
}).handle_exception([] (auto ep) {
logger.warn("Fail to handle GOSSIP_DIGEST_SYN: {}", ep);
});
return messaging_service::no_wait();
});
_messaging.register_gossip_digest_ack([this] (const rpc::client_info& cinfo, gossip_digest_ack msg) {
auto from = netw::messaging_service::get_source(cinfo);
// In a new fiber.
(void)container().invoke_on(0, [from, msg = std::move(msg)] (gms::gossiper& gossiper) mutable {
return background_msg("GOSSIP_DIGEST_ACK", [from, msg = std::move(msg)] (gms::gossiper& gossiper) mutable {
return gossiper.handle_ack_msg(from, std::move(msg));
}).handle_exception([] (auto ep) {
logger.warn("Fail to handle GOSSIP_DIGEST_ACK: {}", ep);
});
return messaging_service::no_wait();
});
_messaging.register_gossip_digest_ack2([this] (const rpc::client_info& cinfo, gossip_digest_ack2 msg) {
auto from = netw::messaging_service::get_source(cinfo);
// In a new fiber.
(void)container().invoke_on(0, [from, msg = std::move(msg)] (gms::gossiper& gossiper) mutable {
return background_msg("GOSSIP_DIGEST_ACK2", [from, msg = std::move(msg)] (gms::gossiper& gossiper) mutable {
return gossiper.handle_ack2_msg(from, std::move(msg));
}).handle_exception([] (auto ep) {
logger.warn("Fail to handle GOSSIP_DIGEST_ACK2: {}", ep);
});
return messaging_service::no_wait();
});
_messaging.register_gossip_echo([this] (const rpc::client_info& cinfo, rpc::optional<int64_t> generation_number_opt) {
auto from = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
return handle_echo_msg(from, generation_number_opt);
});
_messaging.register_gossip_shutdown([this] (inet_address from, rpc::optional<int64_t> generation_number_opt) {
// In a new fiber.
(void)container().invoke_on(0, [from, generation_number_opt] (gms::gossiper& gossiper) {
return background_msg("GOSSIP_SHUTDOWN", [from, generation_number_opt] (gms::gossiper& gossiper) {
return gossiper.handle_shutdown_msg(from, generation_number_opt);
}).handle_exception([] (auto ep) {
logger.warn("Fail to handle GOSSIP_SHUTDOWN: {}", ep);
});
return messaging_service::no_wait();
});
_messaging.register_gossip_get_endpoint_states([this] (const rpc::client_info& cinfo, gossip_get_endpoint_states_request request) {
return container().invoke_on(0, [request = std::move(request)] (gms::gossiper& gossiper) mutable {
@@ -1679,6 +1672,10 @@ bool gossiper::is_normal(const inet_address& endpoint) const {
return get_gossip_status(endpoint) == sstring(versioned_value::STATUS_NORMAL);
}
bool gossiper::is_left(const inet_address& endpoint) const {
return get_gossip_status(endpoint) == sstring(versioned_value::STATUS_LEFT);
}
bool gossiper::is_normal_ring_member(const inet_address& endpoint) const {
auto status = get_gossip_status(endpoint);
return status == sstring(versioned_value::STATUS_NORMAL) || status == sstring(versioned_value::SHUTDOWN);
@@ -2178,6 +2175,9 @@ future<> gossiper::start() {
}
future<> gossiper::shutdown() {
if (!_background_msg.is_closed()) {
co_await _background_msg.close();
}
if (this_shard_id() == 0) {
co_await do_stop_gossiping();
}

View File

@@ -41,7 +41,9 @@
#include "unimplemented.hh"
#include <seastar/core/distributed.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/gate.hh>
#include <seastar/core/print.hh>
#include <seastar/rpc/rpc_types.hh>
#include "utils/atomic_vector.hh"
#include "utils/UUID.hh"
#include "utils/fb_utilities.hh"
@@ -138,12 +140,16 @@ private:
bool _enabled = false;
semaphore _callback_running{1};
semaphore _apply_state_locally_semaphore{100};
seastar::gate _background_msg;
std::unordered_map<gms::inet_address, syn_msg_pending> _syn_handlers;
std::unordered_map<gms::inet_address, ack_msg_pending> _ack_handlers;
bool _advertise_myself = true;
// Map ip address and generation number
std::unordered_map<gms::inet_address, int32_t> _advertise_to_nodes;
future<> _failure_detector_loop_done{make_ready_future<>()} ;
rpc::no_wait_type background_msg(sstring type, noncopyable_function<future<>(gossiper&)> fn);
public:
// Get current generation number for the given nodes
future<std::unordered_map<gms::inet_address, int32_t>>
@@ -565,6 +571,7 @@ public:
bool is_seed(const inet_address& endpoint) const;
bool is_shutdown(const inet_address& endpoint) const;
bool is_normal(const inet_address& endpoint) const;
bool is_left(const inet_address& endpoint) const;
// Check if a node is in NORMAL or SHUTDOWN status which means the node is
// part of the token ring from the gossip point of view and operates in
// normal status or was in normal status but is shutdown.

View File

@@ -520,8 +520,13 @@ relocate_python3 "$rprefix"/scyllatop tools/scyllatop/scyllatop.py
if $supervisor; then
install -d -m755 `supervisor_dir $retc`
for service in scylla-server scylla-jmx scylla-node-exporter; do
if [ "$service" = "scylla-server" ]; then
program="scylla"
else
program=$service
fi
cat << EOS > `supervisor_conf $retc $service`
[program:$service]
[program:$program]
directory=$rprefix
command=/bin/bash -c './supervisor/$service.sh'
EOS

View File

@@ -61,6 +61,10 @@ azure_snitch::azure_snitch(const sstring& fname, unsigned io_cpuid) : production
}
future<> azure_snitch::load_config() {
if (this_shard_id() != io_cpu_id()) {
co_return;
}
sstring region = co_await azure_api_call(REGION_NAME_QUERY_PATH);
sstring azure_zone = co_await azure_api_call(ZONE_NAME_QUERY_PATH);

33
main.cc
View File

@@ -377,11 +377,38 @@ static auto defer_verbose_shutdown(const char* what, Func&& func) {
startlog.info("Shutting down {}", what);
try {
func();
startlog.info("Shutting down {} was successful", what);
} catch (...) {
startlog.error("Unexpected error shutting down {}: {}", what, std::current_exception());
throw;
auto ex = std::current_exception();
bool do_abort = true;
try {
std::rethrow_exception(ex);
} catch (const std::system_error& e) {
// System error codes we consider "environmental",
// i.e. not scylla's fault, therefore there is no point in
// aborting and dumping core.
for (int i : {EIO, EACCES, ENOSPC}) {
if (e.code() == std::error_code(i, std::system_category())) {
do_abort = false;
break;
}
}
} catch (...) {
}
auto msg = fmt::format("Unexpected error shutting down {}: {}", what, ex);
if (do_abort) {
startlog.error("{}: aborting", msg);
abort();
} else {
startlog.error("{}: exiting, at {}", msg, current_backtrace());
// Call _exit() rather than exit() to exit immediately
// without calling exit handlers, avoiding
// boost::intrusive::detail::destructor_impl assert failure
// from ~segment_pool exit handler.
_exit(255);
}
}
startlog.info("Shutting down {} was successful", what);
};
auto ret = deferred_action(std::move(vfunc));

View File

@@ -613,7 +613,8 @@ static flat_mutation_reader make_partition_snapshot_flat_reader_from_snp_schema(
schema_ptr rev_snp_schema = snp->schema()->make_reversed();
return make_partition_snapshot_flat_reader<true, partition_snapshot_read_accounter>(std::move(rev_snp_schema), std::move(permit), std::move(dk), std::move(crr), std::move(snp), digest_requested, region, read_section, pointer_to_container, fwd, memtable);
} else {
return make_partition_snapshot_flat_reader<false, partition_snapshot_read_accounter>(snp->schema(), std::move(permit), std::move(dk), std::move(crr), std::move(snp), digest_requested, region, read_section, pointer_to_container, fwd, memtable);
schema_ptr snp_schema = snp->schema();
return make_partition_snapshot_flat_reader<false, partition_snapshot_read_accounter>(std::move(snp_schema), std::move(permit), std::move(dk), std::move(crr), std::move(snp), digest_requested, region, read_section, pointer_to_container, fwd, memtable);
}
}

View File

@@ -628,7 +628,12 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
remove_error_rpc_client(verb, id);
}
auto must_encrypt = [&id, &verb, this] {
auto addr = get_preferred_ip(id.addr);
auto broadcast_address = utils::fb_utilities::get_broadcast_address();
bool listen_to_bc = _cfg.listen_on_broadcast_address && _cfg.ip != broadcast_address;
auto laddr = socket_address(listen_to_bc ? broadcast_address : _cfg.ip, 0);
auto must_encrypt = [&] {
if (_cfg.encrypt == encrypt_what::none) {
return false;
}
@@ -646,13 +651,27 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
auto& snitch_ptr = locator::i_endpoint_snitch::get_local_snitch_ptr();
// either rack/dc need to be in same dc to use non-tls
if (snitch_ptr->get_datacenter(id.addr) != snitch_ptr->get_datacenter(utils::fb_utilities::get_broadcast_address())) {
auto my_dc = snitch_ptr->get_datacenter(broadcast_address);
if (snitch_ptr->get_datacenter(addr) != my_dc) {
return true;
}
// #9653 - if our idea of dc for bind address differs from our official endpoint address,
// we cannot trust downgrading. We need to ensure either (local) bind address is same as
// broadcast or that the dc info we get for it is the same.
if (broadcast_address != laddr && snitch_ptr->get_datacenter(laddr) != my_dc) {
return true;
}
// if cross-rack tls, check rack.
return _cfg.encrypt == encrypt_what::rack &&
snitch_ptr->get_rack(id.addr) != snitch_ptr->get_rack(utils::fb_utilities::get_broadcast_address())
;
if (_cfg.encrypt == encrypt_what::dc) {
return false;
}
auto my_rack = snitch_ptr->get_rack(broadcast_address);
if (snitch_ptr->get_rack(addr) != my_rack) {
return true;
}
// See above: We need to ensure either (local) bind address is same as
// broadcast or that the rack info we get for it is the same.
return broadcast_address != laddr && snitch_ptr->get_rack(laddr) != my_rack;
}();
auto must_compress = [&id, this] {
@@ -681,7 +700,7 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
return true;
}();
auto remote_addr = socket_address(get_preferred_ip(id.addr), must_encrypt ? _cfg.ssl_port : _cfg.port);
auto remote_addr = socket_address(addr, must_encrypt ? _cfg.ssl_port : _cfg.port);
rpc::client_options opts;
// send keepalive messages each minute if connection is idle, drop connection after 10 failures
@@ -691,13 +710,8 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
// We send cookies only for non-default statement tenant clients.
if (idx > 3) {
opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
}
opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
bool listen_to_bc = _cfg.listen_on_broadcast_address && _cfg.ip != utils::fb_utilities::get_broadcast_address();
auto laddr = socket_address(listen_to_bc ? utils::fb_utilities::get_broadcast_address() : _cfg.ip, 0);
auto client = must_encrypt ?
::make_shared<rpc_protocol_client_wrapper>(_rpc->protocol(), std::move(opts),
remote_addr, laddr, _credentials) :

View File

@@ -694,11 +694,11 @@ future<typename ResultBuilder::result_type> do_query(
ResultBuilder&& result_builder) {
auto ctx = seastar::make_shared<read_context>(db, s, cmd, ranges, trace_state, timeout);
co_await ctx->lookup_readers(timeout);
std::exception_ptr ex;
try {
co_await ctx->lookup_readers(timeout);
auto [last_ckey, result, unconsumed_buffer, compaction_state] = co_await read_page<ResultBuilder>(ctx, s, cmd, ranges, trace_state,
std::move(result_builder));

View File

@@ -1545,18 +1545,20 @@ public:
};
future<> shard_reader::close() noexcept {
// Nothing to do if there was no reader created, nor is there a background
// read ahead in progress which will create one.
if (!_reader && !_read_ahead) {
co_return;
if (_read_ahead) {
try {
co_await *std::exchange(_read_ahead, std::nullopt);
} catch (...) {
mrlog.warn("shard_reader::close(): read_ahead on shard {} failed: {}", _shard, std::current_exception());
}
}
try {
if (_read_ahead) {
co_await *std::exchange(_read_ahead, std::nullopt);
}
co_await smp::submit_to(_shard, [this] {
if (!_reader) {
return make_ready_future<>();
}
auto irh = std::move(*_reader).inactive_read_handle();
return with_closeable(flat_mutation_reader(_reader.release()), [this] (flat_mutation_reader& reader) mutable {
auto permit = reader.permit();

View File

@@ -54,7 +54,7 @@ future<> feed_writer(flat_mutation_reader&& rd_ref, Writer wr) {
auto rd = std::move(rd_ref);
std::exception_ptr ex;
try {
while (!rd.is_end_of_stream()) {
while (!rd.is_end_of_stream() || !rd.is_buffer_empty()) {
co_await rd.fill_buffer();
while (!rd.is_buffer_empty()) {
co_await rd.pop_mutation_fragment().consume(wr);

View File

@@ -305,14 +305,23 @@ class partition_snapshot_flat_reader : public flat_mutation_reader::impl, public
const std::optional<position_in_partition>& last_row,
const std::optional<position_in_partition>& last_rts,
position_in_partition_view pos) {
if (!_rt_stream.empty()) {
return _rt_stream.get_next(std::move(pos));
}
return in_alloc_section([&] () -> mutation_fragment_opt {
maybe_refresh_state(ck_range_snapshot, last_row, last_rts);
position_in_partition::less_compare rt_less(_query_schema);
// The while below moves range tombstones from partition versions
// into _rt_stream, just enough to produce the next range tombstone
// The main goal behind moving to _rt_stream is to deoverlap range tombstones
// which have the same starting position. This is not in order to satisfy
// flat_mutation_reader stream requirements, the reader can emit range tombstones
// which have the same position incrementally. This is to guarantee forward
// progress in the case iterators get invalidated and maybe_refresh_state()
// above needs to restore them. It does so using last_rts, which tracks
// the position of the last emitted range tombstone. All range tombstones
// with positions <= than last_rts are skipped on refresh. To make progress,
// we need to make sure that all range tombstones with duplicated positions
// are emitted before maybe_refresh_state().
while (has_more_range_tombstones()
&& !rt_less(pos, peek_range_tombstone().position())
&& (_rt_stream.empty() || !rt_less(_rt_stream.peek_next().position(), peek_range_tombstone().position()))) {

View File

@@ -325,7 +325,7 @@ public:
// When throws, the cursor is invalidated and its position is not changed.
bool advance_to(position_in_partition_view lower_bound) {
prepare_heap(lower_bound);
bool found = no_clustering_row_between(_schema, lower_bound, _heap[0].it->position());
bool found = no_clustering_row_between_weak(_schema, lower_bound, _heap[0].it->position());
recreate_current_row();
return found;
}
@@ -411,11 +411,11 @@ public:
} else {
// Copy row from older version because rows in evictable versions must
// hold values which are independently complete to be consistent on eviction.
auto e = current_allocator().construct<rows_entry>(_schema, *_current_row[0].it);
auto e = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(_schema, *_current_row[0].it));
e->set_continuous(latest_i && latest_i->continuous());
_snp.tracker()->insert(*e);
rows.insert_before(latest_i, *e);
return {*e, true};
auto e_i = rows.insert_before(latest_i, std::move(e));
return ensure_result{*e_i, true};
}
}
@@ -447,11 +447,11 @@ public:
}
auto&& rows = _snp.version()->partition().mutable_clustered_rows();
auto latest_i = get_iterator_in_latest_version();
auto e = current_allocator().construct<rows_entry>(_schema, pos, is_dummy(!pos.is_clustering_row()),
is_continuous(latest_i && latest_i->continuous()));
auto e = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(_schema, pos, is_dummy(!pos.is_clustering_row()),
is_continuous(latest_i && latest_i->continuous())));
_snp.tracker()->insert(*e);
rows.insert_before(latest_i, *e);
return ensure_result{*e, true};
auto e_i = rows.insert_before(latest_i, std::move(e));
return ensure_result{*e_i, true};
}
// Brings the entry pointed to by the cursor to the front of the LRU

View File

@@ -575,6 +575,20 @@ bool no_clustering_row_between(const schema& s, position_in_partition_view a, po
}
}
// Returns true if and only if there can't be any clustering_row with position >= a and < b.
// It is assumed that a <= b.
inline
bool no_clustering_row_between_weak(const schema& s, position_in_partition_view a, position_in_partition_view b) {
clustering_key_prefix::equality eq(s);
if (a.has_key() && b.has_key()) {
return eq(a.key(), b.key())
&& (a.get_bound_weight() == bound_weight::after_all_prefixed
|| b.get_bound_weight() != bound_weight::after_all_prefixed);
} else {
return !a.has_key() && !b.has_key();
}
}
// Includes all position_in_partition objects "p" for which: start <= p < end
// And only those.
class position_range {
@@ -659,3 +673,9 @@ inline
bool position_range::is_all_clustered_rows(const schema& s) const {
return _start.is_before_all_clustered_rows(s) && _end.is_after_all_clustered_rows(s);
}
// Assumes that the bounds of `r` are of 'clustered' type
// and that `r` is non-empty (the left bound is smaller than the right bound).
//
// If `r` does not contain any keys, returns nullopt.
std::optional<query::clustering_range> position_range_to_clustering_range(const position_range& r, const schema&);

View File

@@ -379,3 +379,52 @@ foreign_ptr<lw_shared_ptr<query::result>> result_merger::get() {
}
}
std::optional<query::clustering_range> position_range_to_clustering_range(const position_range& r, const schema& s) {
assert(r.start().get_type() == partition_region::clustered);
assert(r.end().get_type() == partition_region::clustered);
if (r.start().has_key() && r.end().has_key()
&& clustering_key_prefix::equality(s)(r.start().key(), r.end().key())) {
assert(r.start().get_bound_weight() != r.end().get_bound_weight());
if (r.end().get_bound_weight() == bound_weight::after_all_prefixed
&& r.start().get_bound_weight() != bound_weight::after_all_prefixed) {
// [before x, after x) and [for x, after x) get converted to [x, x].
return query::clustering_range::make_singular(r.start().key());
}
// [before x, for x) does not contain any keys.
return std::nullopt;
}
// position_range -> clustering_range
// (recall that position_ranges are always left-closed, right opened):
// [before x, ...), [for x, ...) -> [x, ...
// [after x, ...) -> (x, ...
// [..., before x), [..., for x) -> ..., x)
// [..., after x) -> ..., x]
auto to_bound = [&s] (const position_in_partition& p, bool left) -> std::optional<query::clustering_range::bound> {
if (p.is_before_all_clustered_rows(s)) {
assert(left);
return {};
}
if (p.is_after_all_clustered_rows(s)) {
assert(!left);
return {};
}
assert(p.has_key());
auto bw = p.get_bound_weight();
bool inclusive = left
? bw != bound_weight::after_all_prefixed
: bw == bound_weight::after_all_prefixed;
return query::clustering_range::bound{p.key(), inclusive};
};
return query::clustering_range{to_bound(r.start(), true), to_bound(r.end(), false)};
}

View File

@@ -42,28 +42,34 @@ static auto construct_range_tombstone_entry(Args&&... args) {
}
void range_tombstone_list::apply_reversibly(const schema& s,
clustering_key_prefix start, bound_kind start_kind,
clustering_key_prefix end,
clustering_key_prefix start_key, bound_kind start_kind,
clustering_key_prefix end_key,
bound_kind end_kind,
tombstone tomb,
reverter& rev)
{
position_in_partition::less_compare less(s);
position_in_partition start(position_in_partition::range_tag_t(), bound_view(std::move(start_key), start_kind));
position_in_partition end(position_in_partition::range_tag_t(), bound_view(std::move(end_key), end_kind));
if (!less(start, end)) {
return;
}
if (!_tombstones.empty()) {
bound_view::compare less(s);
bound_view start_bound(start, start_kind);
auto last = --_tombstones.end();
range_tombstones_type::iterator it;
if (less(start_bound, last->end_bound())) {
it = _tombstones.upper_bound(start_bound, [less](auto&& sb, auto&& rt) {
return less(sb, rt.end_bound());
if (less(start, last->end_position())) {
it = _tombstones.upper_bound(start, [less](auto&& sb, auto&& rt) {
return less(sb, rt.end_position());
});
} else {
it = _tombstones.end();
}
insert_from(s, std::move(it), std::move(start), start_kind, std::move(end), end_kind, std::move(tomb), rev);
insert_from(s, std::move(it), std::move(start), std::move(end), std::move(tomb), rev);
return;
}
auto rt = construct_range_tombstone_entry(std::move(start), start_kind, std::move(end), end_kind, std::move(tomb));
auto rt = construct_range_tombstone_entry(std::move(start), std::move(end), std::move(tomb));
rev.insert(_tombstones.end(), *rt);
rt.release();
}
@@ -81,35 +87,31 @@ void range_tombstone_list::apply_reversibly(const schema& s,
*/
void range_tombstone_list::insert_from(const schema& s,
range_tombstones_type::iterator it,
clustering_key_prefix start,
bound_kind start_kind,
clustering_key_prefix end,
bound_kind end_kind,
position_in_partition start,
position_in_partition end,
tombstone tomb,
reverter& rev)
{
bound_view::compare less(s);
bound_view end_bound(end, end_kind);
position_in_partition::tri_compare cmp(s);
if (it != _tombstones.begin()) {
auto prev = std::prev(it);
if (prev->tombstone().tomb == tomb && prev->end_bound().adjacent(s, bound_view(start, start_kind))) {
start = prev->tombstone().start;
start_kind = prev->tombstone().start_kind;
if (prev->tombstone().tomb == tomb && cmp(prev->end_position(), start) == 0) {
start = prev->position();
rev.erase(prev);
}
}
while (it != _tombstones.end()) {
bound_view start_bound(start, start_kind);
if (less(end_bound, start_bound)) {
if (cmp(end, start) <= 0) {
return;
}
if (less(end_bound, it->start_bound())) {
if (cmp(end, it->position()) < 0) {
// not overlapping
if (it->tombstone().tomb == tomb && end_bound.adjacent(s, it->start_bound())) {
rev.update(it, {std::move(start), start_kind, it->tombstone().end, it->tombstone().end_kind, tomb});
if (it->tombstone().tomb == tomb && cmp(end, it->position()) == 0) {
rev.update(it, {std::move(start), std::move(start), tomb});
} else {
auto rt = construct_range_tombstone_entry(std::move(start), start_kind, std::move(end), end_kind, tomb);
auto rt = construct_range_tombstone_entry(std::move(start), std::move(end), tomb);
rev.insert(it, *rt);
rt.release();
}
@@ -119,34 +121,29 @@ void range_tombstone_list::insert_from(const schema& s,
auto c = tomb <=> it->tombstone().tomb;
if (c == 0) {
// same timestamp, overlapping or adjacent, so merge.
if (less(it->start_bound(), start_bound)) {
start = it->tombstone().start;
start_kind = it->tombstone().start_kind;
if (cmp(it->position(), start) < 0) {
start = it->position();
}
if (less(end_bound, it->end_bound())) {
end = it->tombstone().end;
end_kind = it->tombstone().end_kind;
end_bound = bound_view(end, end_kind);
if (cmp(end, it->end_position()) < 0) {
end = it->end_position();
}
it = rev.erase(it);
} else if (c > 0) {
// We overwrite the current tombstone.
if (less(it->start_bound(), start_bound)) {
auto new_end = bound_view(start, invert_kind(start_kind));
if (!less(new_end, it->start_bound())) {
// Here it->start < start
auto rt = construct_range_tombstone_entry(it->start_bound(), new_end, it->tombstone().tomb);
rev.update(it, {start_bound, it->end_bound(), it->tombstone().tomb});
if (cmp(it->position(), start) < 0) {
{
auto rt = construct_range_tombstone_entry(it->position(), start, it->tombstone().tomb);
rev.update(it, {start, it->end_position(), it->tombstone().tomb});
rev.insert(it, *rt);
rt.release();
}
}
if (less(end_bound, it->end_bound())) {
if (cmp(end, it->end_position()) < 0) {
// Here start <= it->start and end < it->end.
auto rt = construct_range_tombstone_entry(std::move(start), start_kind, end, end_kind, std::move(tomb));
rev.update(it, {std::move(end), invert_kind(end_kind), it->tombstone().end, it->tombstone().end_kind, it->tombstone().tomb});
auto rt = construct_range_tombstone_entry(std::move(start), end, std::move(tomb));
rev.update(it, {std::move(end), it->end_position(), it->tombstone().tomb});
rev.insert(it, *rt);
rt.release();
return;
@@ -157,30 +154,28 @@ void range_tombstone_list::insert_from(const schema& s,
} else {
// We don't overwrite the current tombstone.
if (less(start_bound, it->start_bound())) {
if (cmp(start, it->position()) < 0) {
// The new tombstone starts before the current one.
if (less(it->start_bound(), end_bound)) {
if (cmp(it->position(), end) < 0) {
// Here start < it->start and it->start < end.
auto new_end_kind = invert_kind(it->tombstone().start_kind);
if (!less(bound_view(it->tombstone().start, new_end_kind), start_bound)) {
auto rt = construct_range_tombstone_entry(std::move(start), start_kind, it->tombstone().start, new_end_kind, tomb);
{
auto rt = construct_range_tombstone_entry(std::move(start), it->position(), tomb);
it = rev.insert(it, *rt);
rt.release();
++it;
}
} else {
// Here start < it->start and end <= it->start, so just insert the new tombstone.
auto rt = construct_range_tombstone_entry(std::move(start), start_kind, std::move(end), end_kind, std::move(tomb));
auto rt = construct_range_tombstone_entry(std::move(start), std::move(end), std::move(tomb));
rev.insert(it, *rt);
rt.release();
return;
}
}
if (less(it->end_bound(), end_bound)) {
if (cmp(it->end_position(), end) < 0) {
// Here the current tombstone overwrites a range of the new one.
start = it->tombstone().end;
start_kind = invert_kind(it->tombstone().end_kind);
start = it->end_position();
++it;
} else {
// Here the current tombstone completely overwrites the new one.
@@ -190,7 +185,7 @@ void range_tombstone_list::insert_from(const schema& s,
}
// If we got here, then just insert the remainder at the end.
auto rt = construct_range_tombstone_entry(std::move(start), start_kind, std::move(end), end_kind, std::move(tomb));
auto rt = construct_range_tombstone_entry(std::move(start), std::move(end), std::move(tomb));
rev.insert(it, *rt);
rt.release();
}

View File

@@ -297,7 +297,13 @@ public:
private:
void apply_reversibly(const schema& s, clustering_key_prefix start, bound_kind start_kind,
clustering_key_prefix end, bound_kind end_kind, tombstone tomb, reverter& rev);
void insert_from(const schema& s, range_tombstones_type::iterator it, clustering_key_prefix start,
bound_kind start_kind, clustering_key_prefix end, bound_kind end_kind, tombstone tomb, reverter& rev);
void insert_from(const schema& s,
range_tombstones_type::iterator it,
position_in_partition start,
position_in_partition end,
tombstone tomb,
reverter& rev);
range_tombstones_type::iterator find(const schema& s, const range_tombstone_entry& rt);
};

View File

@@ -249,6 +249,14 @@ public:
return _base_resources;
}
void release_base_resources() noexcept {
if (_base_resources_consumed) {
_resources -= _base_resources;
_base_resources_consumed = false;
}
_semaphore.signal(std::exchange(_base_resources, {}));
}
sstring description() const {
return format("{}.{}:{}",
_schema ? _schema->ks_name() : "*",
@@ -394,6 +402,10 @@ reader_resources reader_permit::base_resources() const {
return _impl->base_resources();
}
void reader_permit::release_base_resources() noexcept {
return _impl->release_base_resources();
}
sstring reader_permit::description() const {
return _impl->description();
}

View File

@@ -161,6 +161,8 @@ public:
reader_resources base_resources() const;
void release_base_resources() noexcept;
sstring description() const;
db::timeout_clock::time_point timeout() const noexcept;

View File

@@ -407,6 +407,10 @@ public:
{},
mutation_reader::forwarding::no);
} else {
// We can't have two permits with count resource for 1 repair.
// So we release the one on _permit so the only one is the one the
// shard reader will obtain.
_permit.release_base_resources();
_reader = make_multishard_streaming_reader(db, _schema, _permit, [this] {
auto shard_range = _sharder.next();
if (shard_range) {

Submodule seastar updated: a189cdc45d...6217d6ff4e

View File

@@ -635,16 +635,16 @@ void storage_service::bootstrap() {
// Update pending ranges now, so we correctly count ourselves as a pending replica
// when inserting the new CDC generation.
if (!bootstrap_rbno) {
// When is_repair_based_node_ops_enabled is true, the bootstrap node
// will use node_ops_cmd to bootstrap, node_ops_cmd will update the pending ranges.
slogger.debug("bootstrap: update pending ranges: endpoint={} bootstrap_tokens={}", get_broadcast_address(), _bootstrap_tokens);
mutate_token_metadata([this] (mutable_token_metadata_ptr tmptr) {
auto endpoint = get_broadcast_address();
tmptr->add_bootstrap_tokens(_bootstrap_tokens, endpoint);
return update_pending_ranges(std::move(tmptr), format("bootstrapping node {}", endpoint));
}).get();
}
if (!bootstrap_rbno) {
// When is_repair_based_node_ops_enabled is true, the bootstrap node
// will use node_ops_cmd to bootstrap, node_ops_cmd will update the pending ranges.
slogger.debug("bootstrap: update pending ranges: endpoint={} bootstrap_tokens={}", get_broadcast_address(), _bootstrap_tokens);
mutate_token_metadata([this] (mutable_token_metadata_ptr tmptr) {
auto endpoint = get_broadcast_address();
tmptr->add_bootstrap_tokens(_bootstrap_tokens, endpoint);
return update_pending_ranges(std::move(tmptr), format("bootstrapping node {}", endpoint));
}).get();
}
// After we pick a generation timestamp, we start gossiping it, and we stick with it.
// We don't do any other generation switches (unless we crash before complecting bootstrap).
@@ -652,19 +652,23 @@ void storage_service::bootstrap() {
_cdc_gen_id = _cdc_gen_service.local().make_new_generation(_bootstrap_tokens, !is_first_node()).get0();
if (!bootstrap_rbno) {
// When is_repair_based_node_ops_enabled is true, the bootstrap node
// will use node_ops_cmd to bootstrap, bootstrapping gossip status is not needed for bootstrap.
_gossiper.add_local_application_state({
// Order is important: both the CDC streams timestamp and tokens must be known when a node handles our status.
{ gms::application_state::TOKENS, versioned_value::tokens(_bootstrap_tokens) },
{ gms::application_state::CDC_GENERATION_ID, versioned_value::cdc_generation_id(_cdc_gen_id) },
{ gms::application_state::STATUS, versioned_value::bootstrapping(_bootstrap_tokens) },
}).get();
if (!bootstrap_rbno) {
// When is_repair_based_node_ops_enabled is true, the bootstrap node
// will use node_ops_cmd to bootstrap, bootstrapping gossip status is not needed for bootstrap.
_gossiper.add_local_application_state({
{ gms::application_state::TOKENS, versioned_value::tokens(_bootstrap_tokens) },
{ gms::application_state::CDC_GENERATION_ID, versioned_value::cdc_generation_id(_cdc_gen_id) },
{ gms::application_state::STATUS, versioned_value::bootstrapping(_bootstrap_tokens) },
}).get();
set_mode(mode::JOINING, format("sleeping {} ms for pending range setup", get_ring_delay().count()), true);
_gossiper.wait_for_range_setup().get();
}
set_mode(mode::JOINING, format("sleeping {} ms for pending range setup", get_ring_delay().count()), true);
_gossiper.wait_for_range_setup().get();
} else {
// Even with RBNO bootstrap we need to announce the new CDC generation immediately after it's created.
_gossiper.add_local_application_state({
{ gms::application_state::CDC_GENERATION_ID, versioned_value::cdc_generation_id(_cdc_gen_id) },
}).get();
}
} else {
// Wait until we know tokens of existing node before announcing replacing status.
set_mode(mode::JOINING, fmt::format("Wait until local node knows tokens of peer nodes"), true);
@@ -3670,7 +3674,7 @@ shared_ptr<abort_source> node_ops_meta_data::get_abort_source() {
void storage_service::node_ops_update_heartbeat(utils::UUID ops_uuid) {
slogger.debug("node_ops_update_heartbeat: ops_uuid={}", ops_uuid);
auto permit = seastar::get_units(_node_ops_abort_sem, 1);
auto permit = seastar::get_units(_node_ops_abort_sem, 1).get0();
auto it = _node_ops.find(ops_uuid);
if (it != _node_ops.end()) {
node_ops_meta_data& meta = it->second;
@@ -3680,7 +3684,7 @@ void storage_service::node_ops_update_heartbeat(utils::UUID ops_uuid) {
void storage_service::node_ops_done(utils::UUID ops_uuid) {
slogger.debug("node_ops_done: ops_uuid={}", ops_uuid);
auto permit = seastar::get_units(_node_ops_abort_sem, 1);
auto permit = seastar::get_units(_node_ops_abort_sem, 1).get0();
auto it = _node_ops.find(ops_uuid);
if (it != _node_ops.end()) {
node_ops_meta_data& meta = it->second;
@@ -3691,7 +3695,7 @@ void storage_service::node_ops_done(utils::UUID ops_uuid) {
void storage_service::node_ops_abort(utils::UUID ops_uuid) {
slogger.debug("node_ops_abort: ops_uuid={}", ops_uuid);
auto permit = seastar::get_units(_node_ops_abort_sem, 1);
auto permit = seastar::get_units(_node_ops_abort_sem, 1).get0();
auto it = _node_ops.find(ops_uuid);
if (it != _node_ops.end()) {
node_ops_meta_data& meta = it->second;

View File

@@ -49,12 +49,13 @@ private:
public:
partition_index_cache* _parent;
key_type _key;
std::variant<shared_promise<>, partition_index_page> _page;
std::variant<lw_shared_ptr<shared_promise<>>, partition_index_page> _page;
size_t _size_in_allocator = 0;
public:
entry(partition_index_cache* parent, key_type key)
: _parent(parent)
, _key(key)
, _page(make_lw_shared<shared_promise<>>())
{ }
void set_page(partition_index_page&& page) noexcept {
@@ -67,7 +68,12 @@ private:
entry(entry&&) noexcept = default;
~entry() {
assert(!is_referenced());
if (is_referenced()) {
// Live entry_ptr should keep the entry alive, except when the entry failed on loading.
// In that case, entry_ptr holders are not supposed to use the pointer, so it's safe
// to nullify those entry_ptrs.
assert(!ready());
}
}
void on_evicted() noexcept override;
@@ -76,7 +82,7 @@ private:
// Always returns the same value for a given state of _page.
size_t size_in_allocator() const { return _size_in_allocator; }
shared_promise<>& promise() { return std::get<shared_promise<>>(_page); }
lw_shared_ptr<shared_promise<>> promise() { return std::get<lw_shared_ptr<shared_promise<>>>(_page); }
bool ready() const { return std::holds_alternative<partition_index_page>(_page); }
partition_index_page& page() { return std::get<partition_index_page>(_page); }
const partition_index_page& page() const { return std::get<partition_index_page>(_page); }
@@ -207,9 +213,7 @@ public:
return make_ready_future<entry_ptr>(std::move(ptr));
} else {
++_shard_stats.blocks;
return _as(_region, [ptr] () mutable {
return ptr.get_entry().promise().get_shared_future();
}).then([ptr] () mutable {
return ptr.get_entry().promise()->get_shared_future().then([ptr] () mutable {
return std::move(ptr);
});
}
@@ -234,23 +238,23 @@ public:
// No exceptions before then_wrapped() is installed so that ptr will be eventually populated.
return futurize_invoke(loader, key).then_wrapped([this, key, ptr] (auto&& f) mutable {
return futurize_invoke(loader, key).then_wrapped([this, key, ptr = std::move(ptr)] (auto&& f) mutable {
entry& e = ptr.get_entry();
try {
partition_index_page&& page = f.get0();
e.promise().set_value();
e.promise()->set_value();
e.set_page(std::move(page));
_shard_stats.used_bytes += e.size_in_allocator();
++_shard_stats.populations;
return ptr;
} catch (...) {
e.promise().set_exception(std::current_exception());
e.promise()->set_exception(std::current_exception());
ptr = {};
with_allocator(_region.allocator(), [&] {
_cache.erase(key);
});
throw;
}
}).then([ptr] {
return ptr;
});
}

View File

@@ -400,10 +400,15 @@ void time_series_sstable_set::for_each_sstable(std::function<void(const shared_s
// O(log n)
void time_series_sstable_set::insert(shared_sstable sst) {
try {
auto min_pos = sst->min_position();
auto max_pos_reversed = sst->max_position().reversed();
_sstables->emplace(std::move(min_pos), sst);
_sstables_reversed->emplace(std::move(max_pos_reversed), std::move(sst));
} catch (...) {
erase(sst);
throw;
}
}
// O(n) worst case, but should be close to O(log n) most of the time

View File

@@ -1493,13 +1493,14 @@ bool table::can_flush() const {
}
future<> table::clear() {
auto permits = co_await _config.dirty_memory_manager->get_all_flush_permits();
if (_commitlog) {
for (auto& t : *_memtables) {
_commitlog->discard_completed_segments(_schema->id(), t->get_and_discard_rp_set());
}
}
_memtables->clear_and_add();
return _cache.invalidate(row_cache::external_updater([] { /* There is no underlying mutation source */ }));
co_await _cache.invalidate(row_cache::external_updater([] { /* There is no underlying mutation source */ }));
}
// NOTE: does not need to be futurized, but might eventually, depending on

43
test.py
View File

@@ -291,6 +291,8 @@ class Test:
def print_summary(self):
pass
def get_junit_etree(self):
return None
def check_log(self, trim):
"""Check and trim logs and xml output for tests which have it"""
@@ -338,9 +340,36 @@ class BoostTest(UnitTest):
boost_args += ['--color_output=false']
boost_args += ['--']
self.args = boost_args + self.args
self.casename = casename
self.__junit_etree = None
def get_junit_etree(self):
def adjust_suite_name(name):
# Normalize "path/to/file.cc" to "path.to.file" to conform to
# Jenkins expectations that the suite name is a class name. ".cc"
# doesn't add any infomation. Add the mode, otherwise failures
# in different modes are indistinguishable. The "test/" prefix adds
# no information, so remove it.
import re
name = re.sub(r'^test/', '', name)
name = re.sub(r'\.cc$', '', name)
name = re.sub(r'/', '.', name)
name = f'{name}.{self.mode}'
return name
if self.__junit_etree is None:
self.__junit_etree = ET.parse(self.xmlout)
root = self.__junit_etree.getroot()
suites = root.findall('.//TestSuite')
for suite in suites:
suite.attrib['name'] = adjust_suite_name(suite.attrib['name'])
skipped = suite.findall('./TestCase[@reason="disabled"]')
for e in skipped:
suite.remove(e)
os.unlink(self.xmlout)
return self.__junit_etree
def check_log(self, trim):
ET.parse(self.xmlout)
self.get_junit_etree()
super().check_log(trim)
@@ -800,6 +829,17 @@ def write_junit_report(tmpdir, mode):
with open(junit_filename, "w") as f:
ET.ElementTree(xml_results).write(f, encoding="unicode")
def write_consolidated_boost_junit_xml(tmpdir, mode):
xml = ET.Element("TestLog")
for suite in TestSuite.suites.values():
for test in suite.tests:
if test.mode != mode:
continue
test_xml = test.get_junit_etree()
if test_xml is not None:
xml.extend(test_xml.getroot().findall('.//TestSuite'))
et = ET.ElementTree(xml)
et.write(f'{tmpdir}/{mode}/xml/boost.xunit.xml', encoding='unicode')
def open_log(tmpdir):
pathlib.Path(tmpdir).mkdir(parents=True, exist_ok=True)
@@ -839,6 +879,7 @@ async def main():
for mode in options.modes:
write_junit_report(options.tmpdir, mode)
write_consolidated_boost_junit_xml(options.tmpdir, mode)
if 'coverage' in options.modes:
coverage.generate_coverage_report("build/coverage", "tests")

View File

@@ -374,6 +374,14 @@ def test_getitem_attributes_to_get_duplicate(dynamodb, test_table):
with pytest.raises(ClientError, match='ValidationException.*Duplicate'):
test_table.get_item(Key={'p': p, 'c': c}, AttributesToGet=['a', 'a'], ConsistentRead=True)
# Verify that it is forbidden to ask for an empty AttributesToGet
# Reproduces issue #10332.
def test_getitem_attributes_to_get_empty(dynamodb, test_table):
p = random_string()
c = random_string()
with pytest.raises(ClientError, match='ValidationException'):
test_table.get_item(Key={'p': p, 'c': c}, AttributesToGet=[], ConsistentRead=True)
# Basic test for DeleteItem, with hash key only
def test_delete_item_hash(test_table_s):
p = random_string()

View File

@@ -170,6 +170,13 @@ def test_query_attributes_to_get(dynamodb, test_table):
expected_items = [{k: x[k] for k in wanted if k in x} for x in items]
assert multiset(expected_items) == multiset(got_items)
# Verify that it is forbidden to ask for an empty AttributesToGet
# Reproduces issue #10332.
def test_query_attributes_to_get_empty(dynamodb, test_table):
p = random_string()
with pytest.raises(ClientError, match='ValidationException'):
full_query(test_table, KeyConditions={'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}}, AttributesToGet=[])
# Test that in a table with both hash key and sort key, which keys we can
# Query by: We can Query by the hash key, by a combination of both hash and
# sort keys, but *cannot* query by just the sort key, and obviously not

View File

@@ -16,6 +16,9 @@
# along with Scylla. If not, see <http://www.gnu.org/licenses/>.
# Tests for basic table operations: CreateTable, DeleteTable, ListTables.
# Also some basic tests for UpdateTable - although UpdateTable usually
# enables more elaborate features (such as GSI or Streams) and those are
# tested elsewhere.
import pytest
from botocore.exceptions import ClientError
@@ -311,3 +314,17 @@ def test_table_sse_off(dynamodb):
KeySchema=[{ 'AttributeName': 'p', 'KeyType': 'HASH' }],
AttributeDefinitions=[{ 'AttributeName': 'p', 'AttributeType': 'S' }]);
table.delete();
# Test that trying to delete a table that doesn't exist fails in the
# appropriate way (ResourceNotFoundException)
def test_delete_table_non_existent(dynamodb, test_table):
client = dynamodb.meta.client
with pytest.raises(ClientError, match='ResourceNotFoundException'):
client.delete_table(TableName=random_string(20))
# Test that trying to update a table that doesn't exist fails in the
# appropriate way (ResourceNotFoundException)
def test_update_table_non_existent(dynamodb, test_table):
client = dynamodb.meta.client
with pytest.raises(ClientError, match='ResourceNotFoundException'):
client.update_table(TableName=random_string(20), BillingMode='PAY_PER_REQUEST')

View File

@@ -1043,6 +1043,20 @@ def test_nested_attribute_remove_from_missing_item(test_table_s):
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE x.y')
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE x[0]')
# Though in an above test (test_nested_attribute_update_bad_path_dot) we
# showed that DynamoDB does not allow REMOVE x.y if attribute x doesn't
# exist - and generates a ValidationException, if x *does* exist but y
# doesn't, it's fine and the removal should just be silently ignored.
def test_nested_attribute_remove_missing_leaf(test_table_s):
p = random_string()
item = {'p': p, 'a': {'x': 3}, 'b': ['hi']}
test_table_s.put_item(Item=item)
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE a.y')
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE b[7]')
test_table_s.update_item(Key={'p': p}, UpdateExpression='REMOVE c')
# The above UpdateItem calls didn't change anything...
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item'] == item
# Similarly for other types of bad paths - using [0] on something which
# doesn't exist or isn't an array.
def test_nested_attribute_update_bad_path_array(test_table_s):

View File

@@ -19,6 +19,7 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/range/irange.hpp>
#include <seastar/testing/test_case.hh>
#include <seastar/testing/thread_test_case.hh>
#include <seastar/core/iostream.hh>
@@ -49,6 +50,15 @@ static sstring read_to_string(cached_file::stream& s, size_t limit = std::numeri
return b.substr(0, limit);
}
static void read_to_void(cached_file::stream& s, size_t limit = std::numeric_limits<size_t>::max()) {
while (auto buf = s.next().get0()) {
if (buf.size() >= limit) {
break;
}
limit -= buf.size();
}
}
static sstring read_to_string(file& f, size_t start, size_t len) {
file_input_stream_options opt;
auto in = make_file_input_stream(f, start, len, opt);
@@ -61,6 +71,12 @@ static sstring read_to_string(cached_file& cf, size_t off, size_t limit = std::n
return read_to_string(s, limit);
}
[[gnu::unused]]
static void read_to_void(cached_file& cf, size_t off, size_t limit = std::numeric_limits<size_t>::max()) {
auto s = cf.read(off, default_priority_class(), std::nullopt);
read_to_void(s, limit);
}
struct test_file {
tmpdir dir;
file f;
@@ -204,7 +220,9 @@ SEASTAR_THREAD_TEST_CASE(test_eviction_via_lru) {
}
{
cf_lru.evict_all();
with_allocator(region.allocator(), [] {
cf_lru.evict_all();
});
BOOST_REQUIRE_EQUAL(0, metrics.cached_bytes); // change here
BOOST_REQUIRE_EQUAL(0, cf.cached_bytes()); // change here
@@ -212,6 +230,8 @@ SEASTAR_THREAD_TEST_CASE(test_eviction_via_lru) {
BOOST_REQUIRE_EQUAL(3, metrics.page_evictions); // change here
BOOST_REQUIRE_EQUAL(0, metrics.page_hits);
BOOST_REQUIRE_EQUAL(3, metrics.page_populations);
BOOST_REQUIRE_EQUAL(region.occupancy().used_space(), 0);
}
{
@@ -255,6 +275,88 @@ SEASTAR_THREAD_TEST_CASE(test_eviction_via_lru) {
}
}
// A file which serves garbage but is very fast.
class garbage_file_impl : public file_impl {
private:
[[noreturn]] void unsupported() {
throw_with_backtrace<std::logic_error>("unsupported operation");
}
public:
// unsupported
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override { unsupported(); }
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override { unsupported(); }
virtual future<> flush(void) override { unsupported(); }
virtual future<> truncate(uint64_t length) override { unsupported(); }
virtual future<> discard(uint64_t offset, uint64_t length) override { unsupported(); }
virtual future<> allocate(uint64_t position, uint64_t length) override { unsupported(); }
virtual subscription<directory_entry> list_directory(std::function<future<>(directory_entry)>) override { unsupported(); }
virtual future<struct stat> stat(void) override { unsupported(); }
virtual future<uint64_t> size(void) override { unsupported(); }
virtual std::unique_ptr<seastar::file_handle_impl> dup() override { unsupported(); }
virtual future<> close() override { return make_ready_future<>(); }
virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t size, const io_priority_class& pc) override {
return make_ready_future<temporary_buffer<uint8_t>>(temporary_buffer<uint8_t>(size));
}
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {
unsupported(); // FIXME
}
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
unsupported(); // FIXME
}
};
#ifndef SEASTAR_DEFAULT_ALLOCATOR // Eviction works only with the seastar allocator
SEASTAR_THREAD_TEST_CASE(test_stress_eviction) {
auto page_size = cached_file::page_size;
auto n_pages = 8'000'000 / page_size;
auto file_size = page_size * n_pages;
auto cached_size = 4'000'000;
cached_file::metrics metrics;
logalloc::region region;
auto f = file(make_shared<garbage_file_impl>());
cached_file cf(f, metrics, cf_lru, region, file_size);
region.make_evictable([&] {
testlog.trace("Evicting");
cf.invalidate_at_most_front(file_size / 2);
return cf_lru.evict();
});
for (int i = 0; i < (cached_size / page_size); ++i) {
read_to_string(cf, page_size * i, page_size);
}
testlog.debug("Saturating memory...");
// Disable background reclaiming which will prevent bugs from reproducing
// We want reclamation to happen synchronously with page cache population in read_to_void()
seastar::memory::set_min_free_pages(0);
// Saturate std memory
chunked_fifo<bytes> blobs;
auto rc = region.reclaim_counter();
while (region.reclaim_counter() == rc) {
blobs.emplace_back(bytes(bytes::initialized_later(), 1024));
}
testlog.debug("Memory: allocated={}, free={}", seastar::memory::stats().allocated_memory(), seastar::memory::stats().free_memory());
testlog.debug("Starting test...");
for (int j = 0; j < n_pages * 16; ++j) {
testlog.trace("Allocating");
auto stride = tests::random::get_int(1, 20);
auto page_idx = tests::random::get_int(n_pages - stride);
read_to_void(cf, page_idx * page_size, page_size * stride);
}
}
#endif
SEASTAR_THREAD_TEST_CASE(test_invalidation) {
auto page_size = cached_file::page_size;
test_file tf = make_test_file(page_size * 2);

View File

@@ -25,6 +25,8 @@
#include <deque>
#include <random>
#include "utils/lsa/chunked_managed_vector.hh"
#include "utils/managed_ref.hh"
#include "test/lib/log.hh"
#include <boost/range/algorithm/sort.hpp>
#include <boost/range/algorithm/equal.hpp>
@@ -216,3 +218,106 @@ SEASTAR_TEST_CASE(tests_reserve_partial) {
});
return make_ready_future<>();
}
SEASTAR_TEST_CASE(test_clear_and_release) {
region region;
allocating_section as;
with_allocator(region.allocator(), [&] {
lsa::chunked_managed_vector<managed_ref<uint64_t>> v;
for (uint64_t i = 1; i < 4000; ++i) {
as(region, [&] {
v.emplace_back(make_managed<uint64_t>(i));
});
}
v.clear_and_release();
});
return make_ready_future<>();
}
SEASTAR_TEST_CASE(test_chunk_reserve) {
region region;
allocating_section as;
for (auto conf :
{ // std::make_pair(reserve size, push count)
std::make_pair(0, 4000),
std::make_pair(100, 4000),
std::make_pair(200, 4000),
std::make_pair(1000, 4000),
std::make_pair(2000, 4000),
std::make_pair(3000, 4000),
std::make_pair(5000, 4000),
std::make_pair(500, 8000),
std::make_pair(1000, 8000),
std::make_pair(2000, 8000),
std::make_pair(8000, 500),
})
{
with_allocator(region.allocator(), [&] {
auto [reserve_size, push_count] = conf;
testlog.info("Testing reserve({}), {}x emplace_back()", reserve_size, push_count);
lsa::chunked_managed_vector<managed_ref<uint64_t>> v;
v.reserve(reserve_size);
uint64_t seed = rand();
for (uint64_t i = 0; i < push_count; ++i) {
as(region, [&] {
v.emplace_back(make_managed<uint64_t>(seed + i));
BOOST_REQUIRE(**v.begin() == seed);
});
}
auto v_it = v.begin();
for (uint64_t i = 0; i < push_count; ++i) {
BOOST_REQUIRE(**v_it++ == seed + i);
}
v.clear_and_release();
});
}
return make_ready_future<>();
}
// Tests the case of make_room() invoked with last_chunk_capacity_deficit but _size not in
// the last reserved chunk.
SEASTAR_TEST_CASE(test_shrinking_and_expansion_involving_chunk_boundary) {
region region;
allocating_section as;
with_allocator(region.allocator(), [&] {
lsa::chunked_managed_vector<managed_ref<uint64_t>> v;
// Fill two chunks
v.reserve(2000);
for (uint64_t i = 0; i < 2000; ++i) {
as(region, [&] {
v.emplace_back(make_managed<uint64_t>(i));
});
}
// Make the last chunk smaller than max size to trigger the last_chunk_capacity_deficit path in make_room()
v.shrink_to_fit();
// Leave the last chunk reserved but empty
for (uint64_t i = 0; i < 1000; ++i) {
v.pop_back();
}
// Try to reserve more than the currently reserved capacity and trigger last_chunk_capacity_deficit path
// with _size not in the last chunk. Should not sigsegv.
v.reserve(8000);
for (uint64_t i = 0; i < 2000; ++i) {
as(region, [&] {
v.emplace_back(make_managed<uint64_t>(i));
});
}
v.clear_and_release();
});
return make_ready_future<>();
}

View File

@@ -191,3 +191,32 @@ BOOST_AUTO_TEST_CASE(tests_reserve_partial) {
BOOST_REQUIRE_EQUAL(v.capacity(), orig_size);
}
}
// Tests the case of make_room() invoked with last_chunk_capacity_deficit but _size not in
// the last reserved chunk.
BOOST_AUTO_TEST_CASE(test_shrinking_and_expansion_involving_chunk_boundary) {
using vector_type = utils::chunked_vector<std::unique_ptr<uint64_t>>;
vector_type v;
// Fill two chunks
v.reserve(vector_type::max_chunk_capacity() * 3 / 2);
for (uint64_t i = 0; i < vector_type::max_chunk_capacity() * 3 / 2; ++i) {
v.emplace_back(std::make_unique<uint64_t>(i));
}
// Make the last chunk smaller than max size to trigger the last_chunk_capacity_deficit path in make_room()
v.shrink_to_fit();
// Leave the last chunk reserved but empty
for (uint64_t i = 0; i < vector_type::max_chunk_capacity(); ++i) {
v.pop_back();
}
// Try to reserve more than the currently reserved capacity and trigger last_chunk_capacity_deficit path
// with _size not in the last chunk. Should not sigsegv.
v.reserve(vector_type::max_chunk_capacity() * 4);
for (uint64_t i = 0; i < vector_type::max_chunk_capacity() * 2; ++i) {
v.emplace_back(std::make_unique<uint64_t>(i));
}
}

View File

@@ -44,7 +44,9 @@
#include "test/lib/tmpdir.hh"
#include "db/commitlog/commitlog.hh"
#include "db/commitlog/commitlog_replayer.hh"
#include "db/commitlog/commitlog_extensions.hh"
#include "db/commitlog/rp_set.hh"
#include "db/extensions.hh"
#include "log.hh"
#include "service/priority_manager.hh"
#include "test/lib/exception_utils.hh"
@@ -947,3 +949,113 @@ SEASTAR_TEST_CASE(test_commitlog_deadlock_with_flush_threshold) {
co_await log.clear();
}
}
static future<> do_test_exception_in_allocate_ex(bool do_file_delete, bool reuse = true) {
commitlog::config cfg;
constexpr auto max_size_mb = 1;
cfg.commitlog_segment_size_in_mb = max_size_mb;
cfg.commitlog_total_space_in_mb = 2 * max_size_mb * smp::count;
cfg.commitlog_sync_period_in_ms = 10;
cfg.reuse_segments = reuse;
cfg.allow_going_over_size_limit = false; // #9348 - now can enforce size limit always
cfg.use_o_dsync = true; // make sure we pre-allocate.
// not using cl_test, because we need to be able to abandon
// the log.
tmpdir tmp;
cfg.commit_log_location = tmp.path().string();
class myfail : public std::exception {
public:
using std::exception::exception;
};
struct myext: public db::commitlog_file_extension {
public:
bool fail = false;
bool thrown = false;
bool do_file_delete;
myext(bool dd)
: do_file_delete(dd)
{}
seastar::future<seastar::file> wrap_file(const seastar::sstring& filename, seastar::file f, seastar::open_flags flags) override {
if (fail && !thrown) {
thrown = true;
if (do_file_delete) {
co_await f.close();
co_await seastar::remove_file(filename);
}
throw myfail{};
}
co_return f;
}
seastar::future<> before_delete(const seastar::sstring&) override {
co_return;
}
};
auto ep = std::make_unique<myext>(do_file_delete);
auto& mx = *ep;
db::extensions myexts;
myexts.add_commitlog_file_extension("hufflepuff", std::move(ep));
cfg.extensions = &myexts;
auto log = co_await commitlog::create_commitlog(cfg);
rp_set rps;
// uncomment for verbosity
// logging::logger_registry().set_logger_level("commitlog", logging::log_level::debug);
auto uuid = utils::UUID_gen::get_time_UUID();
auto size = log.max_record_size();
auto r = log.add_flush_handler([&](cf_id_type id, replay_position pos) {
log.discard_completed_segments(id, rps);
mx.fail = true;
});
try {
while (!mx.thrown) {
rp_handle h = co_await log.add_mutation(uuid, size, db::commitlog::force_sync::no, [&](db::commitlog::output& dst) {
dst.fill('1', size);
});
rps.put(std::move(h));
}
} catch (...) {
BOOST_FAIL("log write timed out. maybe it is deadlocked... Will not free log. ASAN errors and leaks will follow...");
}
co_await log.shutdown();
co_await log.clear();
}
/**
* Test generating an exception in segment file allocation
*/
SEASTAR_TEST_CASE(test_commitlog_exceptions_in_allocate_ex) {
co_await do_test_exception_in_allocate_ex(false);
}
SEASTAR_TEST_CASE(test_commitlog_exceptions_in_allocate_ex_no_recycle) {
co_await do_test_exception_in_allocate_ex(false, false);
}
/**
* Test generating an exception in segment file allocation, but also
* delete the file, which in turn should cause follow-up exceptions
* in cleanup delete. Which CL should handle
*/
SEASTAR_TEST_CASE(test_commitlog_exceptions_in_allocate_ex_deleted_file) {
co_await do_test_exception_in_allocate_ex(true, false);
}
SEASTAR_TEST_CASE(test_commitlog_exceptions_in_allocate_ex_deleted_file_no_recycle) {
co_await do_test_exception_in_allocate_ex(true);
}

View File

@@ -784,3 +784,38 @@ SEASTAR_TEST_CASE(upgrade_sstables) {
}).get();
});
}
SEASTAR_TEST_CASE(database_drop_column_family_clears_querier_cache) {
return do_with_cql_env_thread([] (cql_test_env& e) {
e.execute_cql("create table ks.cf (k text, v int, primary key (k));").get();
auto& db = e.local_db();
const auto ts = db_clock::now();
auto& tbl = db.find_column_family("ks", "cf");
auto op = std::optional(tbl.read_in_progress());
auto s = tbl.schema();
auto q = query::data_querier(
tbl.as_mutation_source(),
tbl.schema(),
database_test(db).get_user_read_concurrency_semaphore().make_tracking_only_permit(s.get(), "test", db::no_timeout),
query::full_partition_range,
s->full_slice(),
default_priority_class(),
nullptr);
auto f = e.db().invoke_on_all([ts] (database& db) {
return db.drop_column_family("ks", "cf", [ts] { return make_ready_future<db_clock::time_point>(ts); });
});
// we add a querier to the querier cache while the drop is ongoing
auto& qc = db.get_querier_cache();
qc.insert(utils::make_random_uuid(), std::move(q), nullptr);
BOOST_REQUIRE_EQUAL(qc.get_stats().population, 1);
op.reset(); // this should allow the drop to finish
f.get();
// the drop should have cleaned up all entries belonging to that table
BOOST_REQUIRE_EQUAL(qc.get_stats().population, 0);
});
}

View File

@@ -22,6 +22,8 @@
#include <seastar/testing/test_case.hh>
#include "test/lib/cql_test_env.hh"
#include "test/lib/cql_assertions.hh"
#include "cql3/untyped_result_set.hh"
#include "cql3/query_processor.hh"
#include "transport/messages/result_message.hh"
SEASTAR_TEST_CASE(test_index_with_paging) {
@@ -56,3 +58,51 @@ SEASTAR_TEST_CASE(test_index_with_paging) {
});
});
}
SEASTAR_TEST_CASE(test_index_with_paging_with_base_short_read) {
return do_with_cql_env_thread([] (auto& e) {
e.execute_cql("CREATE TABLE tab (pk int, ck text, v int, v2 int, v3 text, PRIMARY KEY (pk, ck))").get();
e.execute_cql("CREATE INDEX ON tab (v)").get();
// Enough to trigger a short read on the base table during scan
sstring big_string(2 * query::result_memory_limiter::maximum_result_size, 'j');
const int row_count = 67;
for (int i = 0; i < row_count; ++i) {
e.execute_cql(format("INSERT INTO tab (pk, ck, v, v2, v3) VALUES ({}, 'hello{}', 1, {}, '{}')", i % 3, i, i, big_string)).get();
}
eventually([&] {
uint64_t count = 0;
e.qp().local().query_internal("SELECT * FROM ks.tab WHERE v = 1", [&] (const cql3::untyped_result_set_row&) {
++count;
return make_ready_future<stop_iteration>(stop_iteration::no);
}).get();
BOOST_REQUIRE_EQUAL(count, row_count);
});
});
}
SEASTAR_TEST_CASE(test_index_with_paging_with_base_short_read_no_ck) {
return do_with_cql_env_thread([] (auto& e) {
e.execute_cql("CREATE TABLE tab (pk int, v int, v2 int, v3 text, PRIMARY KEY (pk))").get();
e.execute_cql("CREATE INDEX ON tab (v)").get();
// Enough to trigger a short read on the base table during scan
sstring big_string(2 * query::result_memory_limiter::maximum_result_size, 'j');
const int row_count = 67;
for (int i = 0; i < row_count; ++i) {
e.execute_cql(format("INSERT INTO tab (pk, v, v2, v3) VALUES ({}, 1, {}, '{}')", i, i, big_string)).get();
}
eventually([&] {
uint64_t count = 0;
e.qp().local().query_internal("SELECT * FROM ks.tab WHERE v = 1", [&] (const cql3::untyped_result_set_row&) {
++count;
return make_ready_future<stop_iteration>(stop_iteration::no);
}).get();
BOOST_REQUIRE_EQUAL(count, row_count);
});
});
}

View File

@@ -391,3 +391,87 @@ SEASTAR_TEST_CASE(test_loading_cache_reload_during_eviction) {
BOOST_REQUIRE_EQUAL(loading_cache.size(), 1);
});
}
SEASTAR_THREAD_TEST_CASE(test_loading_cache_remove_leaves_no_old_entries_behind) {
using namespace std::chrono;
load_count = 0;
auto load_v1 = [] (auto key) { return make_ready_future<sstring>("v1"); };
auto load_v2 = [] (auto key) { return make_ready_future<sstring>("v2"); };
auto load_v3 = [] (auto key) { return make_ready_future<sstring>("v3"); };
{
utils::loading_cache<int, sstring> loading_cache(num_loaders, 100s, testlog);
auto stop_cache_reload = seastar::defer([&loading_cache] { loading_cache.stop().get(); });
//
// Test remove() concurrent with loading
//
auto f = loading_cache.get_ptr(0, [&](auto key) {
return later().then([&] {
return load_v1(key);
});
});
loading_cache.remove(0);
BOOST_REQUIRE_EQUAL(loading_cache.find(0), nullptr);
BOOST_REQUIRE_EQUAL(loading_cache.size(), 0);
auto ptr1 = f.get0();
BOOST_REQUIRE_EQUAL(*ptr1, "v1");
BOOST_REQUIRE_EQUAL(loading_cache.find(0), nullptr);
BOOST_REQUIRE_EQUAL(loading_cache.size(), 0);
ptr1 = loading_cache.get_ptr(0, load_v2).get0();
loading_cache.remove(0);
BOOST_REQUIRE_EQUAL(*ptr1, "v2");
//
// Test that live ptr1, removed from cache, does not prevent reload of new value
//
auto ptr2 = loading_cache.get_ptr(0, load_v3).get0();
ptr1 = nullptr;
BOOST_REQUIRE_EQUAL(*ptr2, "v3");
}
// Test remove_if()
{
utils::loading_cache<int, sstring> loading_cache(num_loaders, 100s, testlog);
auto stop_cache_reload = seastar::defer([&loading_cache] { loading_cache.stop().get(); });
//
// Test remove_if() concurrent with loading
//
auto f = loading_cache.get_ptr(0, [&](auto key) {
return later().then([&] {
return load_v1(key);
});
});
loading_cache.remove_if([] (auto&& v) { return v == "v1"; });
BOOST_REQUIRE_EQUAL(loading_cache.find(0), nullptr);
BOOST_REQUIRE_EQUAL(loading_cache.size(), 0);
auto ptr1 = f.get0();
BOOST_REQUIRE_EQUAL(*ptr1, "v1");
BOOST_REQUIRE_EQUAL(loading_cache.find(0), nullptr);
BOOST_REQUIRE_EQUAL(loading_cache.size(), 0);
ptr1 = loading_cache.get_ptr(0, load_v2).get0();
loading_cache.remove_if([] (auto&& v) { return v == "v2"; });
BOOST_REQUIRE_EQUAL(*ptr1, "v2");
//
// Test that live ptr1, removed from cache, does not prevent reload of new value
//
auto ptr2 = loading_cache.get_ptr(0, load_v3).get0();
ptr1 = nullptr;
BOOST_REQUIRE_EQUAL(*ptr2, "v3");
ptr2 = nullptr;
}
}

View File

@@ -39,6 +39,9 @@
#include "test/lib/random_utils.hh"
#include "test/lib/log.hh"
#include "test/lib/reader_concurrency_semaphore.hh"
#include "test/lib/simple_schema.hh"
#include "test/lib/make_random_string.hh"
#include "utils/error_injection.hh"
static api::timestamp_type next_timestamp() {
static thread_local api::timestamp_type next_timestamp = 1;
@@ -528,6 +531,74 @@ SEASTAR_TEST_CASE(test_exception_safety_of_single_partition_reads) {
});
}
SEASTAR_THREAD_TEST_CASE(test_tombstone_merging_with_multiple_versions) {
tests::reader_concurrency_semaphore_wrapper semaphore;
simple_schema ss;
auto s = ss.schema();
auto mt = make_lw_shared<memtable>(ss.schema());
auto pk = ss.make_pkey(0);
auto pr = dht::partition_range::make_singular(pk);
auto t0 = ss.new_tombstone();
auto t1 = ss.new_tombstone();
auto t2 = ss.new_tombstone();
auto t3 = ss.new_tombstone();
mutation m1(s, pk);
ss.delete_range(m1, *position_range_to_clustering_range(position_range(
position_in_partition::before_key(ss.make_ckey(0)),
position_in_partition::for_key(ss.make_ckey(3))), *s), t1);
ss.add_row(m1, ss.make_ckey(0), "v");
ss.add_row(m1, ss.make_ckey(1), "v");
// Fill so that rd1 stays in the partition snapshot
int n_rows = 1000;
auto v = make_random_string(512);
for (int i = 0; i < n_rows; ++i) {
ss.add_row(m1, ss.make_ckey(i), v);
}
mutation m2(s, pk);
ss.delete_range(m2, *position_range_to_clustering_range(position_range(
position_in_partition::before_key(ss.make_ckey(0)),
position_in_partition::before_key(ss.make_ckey(1))), *s), t2);
ss.delete_range(m2, *position_range_to_clustering_range(position_range(
position_in_partition::before_key(ss.make_ckey(1)),
position_in_partition::for_key(ss.make_ckey(3))), *s), t3);
mutation m3(s, pk);
ss.delete_range(m3, *position_range_to_clustering_range(position_range(
position_in_partition::before_key(ss.make_ckey(0)),
position_in_partition::for_key(ss.make_ckey(4))), *s), t0);
mt->apply(m1);
auto rd1 = mt->make_flat_reader(s, semaphore.make_permit(), pr, s->full_slice(), default_priority_class(),
nullptr, streamed_mutation::forwarding::no, mutation_reader::forwarding::no);
auto close_rd1 = defer([&] { rd1.close().get(); });
rd1.fill_buffer().get();
BOOST_REQUIRE(!rd1.is_end_of_stream()); // rd1 must keep the m1 version alive
mt->apply(m2);
auto rd2 = mt->make_flat_reader(s, semaphore.make_permit(), pr, s->full_slice(), default_priority_class(),
nullptr, streamed_mutation::forwarding::no, mutation_reader::forwarding::no);
auto close_r2 = defer([&] { rd2.close().get(); });
rd2.fill_buffer().get();
BOOST_REQUIRE(!rd2.is_end_of_stream()); // rd2 must keep the m1 version alive
mt->apply(m3);
assert_that(mt->make_flat_reader(s, semaphore.make_permit(), pr))
.has_monotonic_positions();
assert_that(mt->make_flat_reader(s, semaphore.make_permit(), pr))
.produces(m1 + m2 + m3);
}
SEASTAR_TEST_CASE(test_hash_is_cached) {
return seastar::async([] {
auto s = schema_builder("ks", "cf")

View File

@@ -702,6 +702,7 @@ SEASTAR_TEST_CASE(test_cell_ordering) {
};
auto assert_equal = [] (atomic_cell_view c1, atomic_cell_view c2) {
testlog.trace("Expected {} == {}", c1, c2);
BOOST_REQUIRE(compare_atomic_cell_for_merge(c1, c2) == 0);
BOOST_REQUIRE(compare_atomic_cell_for_merge(c2, c1) == 0);
};
@@ -723,9 +724,11 @@ SEASTAR_TEST_CASE(test_cell_ordering) {
atomic_cell::make_live(*bytes_type, 1, bytes(), expiry_2, ttl_2));
// Origin doesn't compare ttl (is it wise?)
assert_equal(
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_1),
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_2));
// But we do. See https://github.com/scylladb/scylla/issues/10156
// and https://github.com/scylladb/scylla/issues/10173
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_2),
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_1));
assert_order(
atomic_cell::make_live(*bytes_type, 0, bytes("value1")),

View File

@@ -560,7 +560,7 @@ SEASTAR_TEST_CASE(test_apply_to_incomplete_respects_continuity) {
static mutation_partition read_using_cursor(partition_snapshot& snap) {
tests::reader_concurrency_semaphore_wrapper semaphore;
partition_snapshot_row_cursor cur(*snap.schema(), snap);
cur.maybe_refresh();
cur.advance_to(position_in_partition::before_all_clustered_rows());
auto mp = read_partition_from(*snap.schema(), cur);
for (auto&& rt : snap.range_tombstones()) {
mp.apply_delete(*snap.schema(), rt);

View File

@@ -210,6 +210,35 @@ BOOST_AUTO_TEST_CASE(test_overlapping_addition) {
BOOST_REQUIRE(it == l.end());
}
BOOST_AUTO_TEST_CASE(test_adjacent_empty_range_tombstone) {
range_tombstone_list l(*s);
l.apply(*s, rtie(1, 1, 2));
l.apply(*s, rt(1, 2, 3));
l.apply(*s, rtei(2, 2, 2));
l.apply(*s, rtei(2, 4, 3));
auto it = l.begin();
assert_rt(rt(1, 4, 3), *it++);
BOOST_REQUIRE(it == l.end());
}
BOOST_AUTO_TEST_CASE(test_empty_range_tombstones_are_dropped) {
range_tombstone_list l(*s);
l.apply(*s, rtei(0, 0, 1));
l.apply(*s, rtie(0, 0, 1));
l.apply(*s, rt(1, 2, 1));
l.apply(*s, rtei(4, 4, 1));
l.apply(*s, rtie(5, 5, 1));
l.apply(*s, rt(7, 8, 1));
auto it = l.begin();
assert_rt(rt(1, 2, 1), *it++);
assert_rt(rt(7, 8, 1), *it++);
BOOST_REQUIRE(it == l.end());
}
BOOST_AUTO_TEST_CASE(test_simple_overlap) {
range_tombstone_list l1(*s);
@@ -473,6 +502,23 @@ static std::vector<range_tombstone> make_random() {
rts.emplace_back(std::move(start_b), std::move(end_b), tombstone(dist(gen), gc_now));
}
int32_t size_empty = dist(gen) / 2;
for (int32_t i = 0; i < size_empty; ++i) {
clustering_key_prefix key = make_random_ckey();
bool start_incl = dist(gen) > 25;
if (start_incl) {
rts.emplace_back(
position_in_partition::before_key(key),
position_in_partition::before_key(key),
tombstone(dist(gen), gc_now));
} else {
rts.emplace_back(
position_in_partition::after_key(key),
position_in_partition::after_key(key),
tombstone(dist(gen), gc_now));
}
}
return rts;
}

View File

@@ -1242,9 +1242,13 @@ SEASTAR_TEST_CASE(test_update_failure) {
class throttle {
unsigned _block_counter = 0;
promise<> _p; // valid when _block_counter != 0, resolves when goes down to 0
std::optional<promise<>> _entered;
bool _one_shot;
public:
// one_shot means whether only the first enter() after block() will block.
throttle(bool one_shot = false) : _one_shot(one_shot) {}
future<> enter() {
if (_block_counter) {
if (_block_counter && (!_one_shot || _entered)) {
promise<> p1;
promise<> p2;
@@ -1256,16 +1260,21 @@ public:
p3.set_value();
});
_p = std::move(p2);
if (_entered) {
_entered->set_value();
_entered.reset();
}
return f1;
} else {
return make_ready_future<>();
}
}
void block() {
future<> block() {
++_block_counter;
_p = promise<>();
_entered = promise<>();
return _entered->get_future();
}
void unblock() {
@@ -1410,7 +1419,7 @@ SEASTAR_TEST_CASE(test_cache_population_and_update_race) {
mt2->apply(m);
}
thr.block();
auto f = thr.block();
auto m0_range = dht::partition_range::make_singular(ring[0].ring_position());
auto rd1 = cache.make_reader(s, semaphore.make_permit(), m0_range);
@@ -1421,6 +1430,7 @@ SEASTAR_TEST_CASE(test_cache_population_and_update_race) {
rd2.set_max_buffer_size(1);
auto rd2_fill_buffer = rd2.fill_buffer();
f.get();
sleep(10ms).get();
// This update should miss on all partitions
@@ -1548,12 +1558,13 @@ SEASTAR_TEST_CASE(test_cache_population_and_clear_race) {
mt2->apply(m);
}
thr.block();
auto f = thr.block();
auto rd1 = cache.make_reader(s, semaphore.make_permit());
rd1.set_max_buffer_size(1);
auto rd1_fill_buffer = rd1.fill_buffer();
f.get();
sleep(10ms).get();
// This update should miss on all partitions
@@ -3777,3 +3788,81 @@ SEASTAR_TEST_CASE(test_scans_erase_dummies) {
BOOST_REQUIRE_EQUAL(tracker.get_stats().rows, 2);
});
}
SEASTAR_TEST_CASE(test_eviction_of_upper_bound_of_population_range) {
return seastar::async([] {
simple_schema s;
tests::reader_concurrency_semaphore_wrapper semaphore;
auto cache_mt = make_lw_shared<memtable>(s.schema());
auto pkey = s.make_pkey("pk");
mutation m1(s.schema(), pkey);
s.add_row(m1, s.make_ckey(1), "v1");
s.add_row(m1, s.make_ckey(2), "v2");
cache_mt->apply(m1);
cache_tracker tracker;
throttle thr(true);
auto cache_source = make_decorated_snapshot_source(snapshot_source([&] { return cache_mt->as_data_source(); }),
[&] (mutation_source src) {
return throttled_mutation_source(thr, std::move(src));
});
row_cache cache(s.schema(), cache_source, tracker);
auto pr = dht::partition_range::make_singular(pkey);
auto read = [&] (int start, int end) {
auto slice = partition_slice_builder(*s.schema())
.with_range(query::clustering_range::make(s.make_ckey(start), s.make_ckey(end)))
.build();
auto rd = cache.make_reader(s.schema(), semaphore.make_permit(), pr, slice);
auto close_rd = deferred_close(rd);
auto m_cache = read_mutation_from_flat_mutation_reader(rd).get0();
close_rd.close_now();
rd = cache_mt->make_flat_reader(s.schema(), semaphore.make_permit(), pr, slice);
auto close_rd2 = deferred_close(rd);
auto m_mt = read_mutation_from_flat_mutation_reader(rd).get0();
BOOST_REQUIRE(m_mt);
assert_that(m_cache).has_mutation().is_equal_to(*m_mt);
};
// populate [2]
{
auto slice = partition_slice_builder(*s.schema())
.with_range(query::clustering_range::make_singular(s.make_ckey(2)))
.build();
assert_that(cache.make_reader(s.schema(), semaphore.make_permit(), pr, slice))
.has_monotonic_positions();
}
auto arrived = thr.block();
// Read [0, 2]
auto f = seastar::async([&] {
read(0, 2);
});
arrived.get();
// populate (2, 3]
{
auto slice = partition_slice_builder(*s.schema())
.with_range(query::clustering_range::make(query::clustering_range::bound(s.make_ckey(2), false),
query::clustering_range::bound(s.make_ckey(3), true)))
.build();
assert_that(cache.make_reader(s.schema(), semaphore.make_permit(), pr, slice))
.has_monotonic_positions();
}
testlog.trace("Evicting");
evict_one_row(tracker); // Evicts before(0)
evict_one_row(tracker); // Evicts ck(2)
testlog.trace("Unblocking");
thr.unblock();
f.get();
read(0, 3);
});
}

View File

@@ -37,20 +37,30 @@ static void add_entry(logalloc::region& r,
{
logalloc::allocating_section as;
as(r, [&] {
sstables::key sst_key = sstables::key::from_partition_key(s, key);
page._entries.push_back(make_managed<index_entry>(
managed_bytes(sst_key.get_bytes()),
position,
managed_ref<promoted_index>()));
with_allocator(r.allocator(), [&] {
sstables::key sst_key = sstables::key::from_partition_key(s, key);
page._entries.push_back(make_managed<index_entry>(
managed_bytes(sst_key.get_bytes()),
position,
managed_ref<promoted_index>()));
});
});
}
static partition_index_page make_page0(logalloc::region& r, simple_schema& s) {
partition_index_page page;
auto destroy_page = defer([&] {
with_allocator(r.allocator(), [&] {
auto p = std::move(page);
});
});
add_entry(r, *s.schema(), page, s.make_pkey(0).key(), 0);
add_entry(r, *s.schema(), page, s.make_pkey(1).key(), 1);
add_entry(r, *s.schema(), page, s.make_pkey(2).key(), 2);
add_entry(r, *s.schema(), page, s.make_pkey(3).key(), 3);
destroy_page.cancel();
return page;
}
@@ -141,6 +151,47 @@ SEASTAR_THREAD_TEST_CASE(test_caching) {
}
}
template <typename T>
static future<> ignore_result(future<T>&& f) {
return f.then_wrapped([] (auto&& f) {
try {
f.get();
} catch (...) {
// expected, silence warnings about ignored failed futures
}
});
}
SEASTAR_THREAD_TEST_CASE(test_exception_while_loading) {
::lru lru;
simple_schema s;
logalloc::region r;
partition_index_cache cache(lru, r);
auto clear_lru = defer([&] {
with_allocator(r.allocator(), [&] {
lru.evict_all();
});
});
auto page0_loader = [&] (partition_index_cache::key_type k) {
return later().then([&] {
return make_page0(r, s);
});
};
memory::with_allocation_failures([&] {
cache.evict_gently().get();
auto f0 = ignore_result(cache.get_or_load(0, page0_loader));
auto f1 = ignore_result(cache.get_or_load(0, page0_loader));
f0.get();
f1.get();
});
auto ptr = cache.get_or_load(0, page0_loader).get0();
has_page0(ptr);
}
SEASTAR_THREAD_TEST_CASE(test_auto_clear) {
::lru lru;
simple_schema s;

View File

@@ -19,6 +19,7 @@ from cassandra.cluster import ConsistencyLevel
from cassandra.query import SimpleStatement
from util import new_test_table
from nodetool import flush
def test_cdc_log_entries_use_cdc_streams(scylla_only, cql, test_keyspace):
'''Test that the stream IDs chosen for CDC log entries come from the CDC generation
@@ -44,3 +45,16 @@ def test_cdc_log_entries_use_cdc_streams(scylla_only, cql, test_keyspace):
assert(log_stream_ids.issubset(stream_ids))
# Test for #10473 - reading logs (from sstable) after dropping
# column in base.
def test_cdc_alter_table_drop_column(scylla_only, cql, test_keyspace):
schema = "pk int primary key, v int"
extra = " with cdc = {'enabled': true}"
with new_test_table(cql, test_keyspace, schema, extra) as table:
cql.execute(f"insert into {table} (pk, v) values (0, 0)")
cql.execute(f"insert into {table} (pk, v) values (1, null)")
flush(cql, table)
flush(cql, table + "_scylla_cdc_log")
cql.execute(f"alter table {table} drop v")
cql.execute(f"select * from {table}_scylla_cdc_log")

Some files were not shown because too many files have changed in this diff Show More