Compare commits

...

144 Commits

Author SHA1 Message Date
Anna Mikhlin
0354e13718 release: prepare for 5.0.7 2022-12-07 14:57:09 +02:00
Nadav Har'El
2750d2e94b Merge 'alternator: fix wrong 'where' condition for GSI range key' from Marcin Maliszkiewicz
Contains fixes requested in the issue (and some tiny extras), together with analysis why they don't affect the users (see commit messages).

Fixes [ #11800](https://github.com/scylladb/scylladb/issues/11800)

Closes #11926

* github.com:scylladb/scylladb:
  alternator: add maybe_quote to secondary indexes 'where' condition
  test/alternator: correct xfail reason for test_gsi_backfill_empty_string
  test/alternator: correct indentation in test_lsi_describe
  alternator: fix wrong 'where' condition for GSI range key

(cherry picked from commit ce7c1a6c52)
2022-12-05 20:53:19 +02:00
Benny Halevy
b4383a389b repair_reader: construct _reader_handle before _reader
Currently, the `_reader` member is explicitly
initialized with the result of the call to `make_reader`.
And `make_reader`, as a side effect, assigns a value
to the `_reader_handle` member.

Since C++ initializes class members sequentially,
in the order they are defined, the assignment to `_reader_handle`
in `make_reader()` happens before `_reader_handle` is initialized.

This patch fixes that by changing the definition order,
and consequently, the member initialization order
in the constructor so that `_reader_handle` will be (default-)initialized
before the call to `make_reader()`, avoiding the undefined behavior.

Fixes #10882

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10883

(cherry picked from commit 9c231ad0ce)
2022-12-05 20:33:58 +02:00
Nadav Har'El
f667c5923a materialized views: fix view writes after base table schema change
When we write to a materialized view, we need to know some information
defined in the base table such as the columns in its schema. We have
a "view_info" object that tracks each view and its base.

This view_info object has a couple of mutable attributes which are
used to lazily-calculate and cache the SELECT statement needed to
read from the base table. If the base-table schema ever changes -
and the code calls set_base_info() at that point - we need to forget
this cached statement. If we don't (as before this patch), the SELECT
will use the wrong schema and writes will no longer work.

This patch also includes a reproducing test that failed before this
patch, and passes afterwords. The test creates a base table with a
view that has a non-trivial SELECT (it has a filter on one of the
base-regular columns), makes a benign modification to the base table
(just a silly addition of a comment), and then tries to write to the
view - and before this patch it fails.

Fixes #10026
Fixes #11542

(cherry picked from commit 2f2f01b045)
2022-12-05 20:09:36 +02:00
Botond Dénes
e4ba0c56df db/view/view_builder: don't drop partition and range tombstones when resuming
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.

Fixes: #11668

Closes #11671

(cherry picked from commit 5621cdd7f9)
2022-12-05 15:01:21 +02:00
Benny Halevy
329d55cc4f configure: add --perf-tests-debuginfo option
Provides separate control over debuginfo for perf tests
since enabling --tests-debuginfo affects both today
causing the Jenkins archives of perf tests binaries to
inflate considerably.

Refs https://github.com/scylladb/scylla-pkg/issues/3060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 48021f3ceb)

Fixes #12191
2022-12-04 17:20:33 +02:00
Petr Gusev
b956293f47 modification_statement: fix LWT insert crash if clustering key is null
PR #9314 fixed a similar issue with regular insert statements
but missed the LWT code path.

It's expected behaviour of
modification_statement::create_clustering_ranges to return an
empty range in this case, since possible_lhs_values it
uses explicitly returns empty_value_set if it evaluates rhs
to null, and it has a comment about it (All NULL
comparisons fail; no column values match.) On the other hand,
all components of the primary key are required to be set,
this is checked at the prepare phase, in
modification_statement::process_where_clause. So the only
problem was modification_statement::execute_with_condition
was not expecting an empty clustering_range in case of
a null clustering key.

Fixes: #11954
(cherry picked from commit 0d443dfd16)
2022-12-04 15:00:27 +02:00
Nadav Har'El
6a8c2d3f56 Merge 'cql3: don't ignore other restrictions when a multi column restriction is present during filtering' from Jan Ciołek
When filtering with multi column restriction present all other restrictions were ignored.
So a query like:
`SELECT * FROM WHERE pk = 0 AND (ck1, ck2) < (0, 0) AND regular_col = 0 ALLOW FILTERING;`
would ignore the restriction `regular_col = 0`.

This was caused by a bug in the filtering code:
2779a171fc/cql3/selection/selection.cc (L433-L449)

When multi column restrictions were detected, the code checked if they are satisfied and returned immediately.
This is fixed by returning only when these restrictions are not satisfied. When they are satisfied the other restrictions are checked as well to ensure all of them are satisfied.

This code was introduced back in 2019, when fixing #3574.
Perhaps back then it was impossible to mix multi column and regular columns and this approach was correct.

Fixes: #6200
Fixes: #12014

Closes #12031

* github.com:scylladb/scylladb:
  cql-pytest: add a reproducer for #12014, verify that filtering multi column and regular restrictions works
  boost/restrictions-test: uncomment part of the test that passes now
  cql-pytest: enable test for filtering combined multi column and regular column restrictions
  cql3: don't ignore other restrictions when a multi column restriction is present during filtering

(cherry picked from commit 2d2034ea28)

Closes #12086
2022-11-26 14:24:08 +02:00
Piotr Grabowski
27a35c7f98 Udpate tools/jmx submodule (jackson dependency update)
* tools/jmx 53f7f55...fe351e8 (1):
  > Update jackson dependency

(cherry picked from commit 41b098f54e)

Refs #11929

Closes #11931
2022-11-20 20:10:14 +02:00
Pavel Emelyanov
d83134a245 Merge '[branch-5.0] multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.

The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.

Fixes: https://github.com/scylladb/scylladb/issues/9482

Closes #11912

* github.com:scylladb/scylladb:
  test/cql-pytest: add regression test for "IDL frame truncated" error
  mutation_compactor: detach_state(): make it no-op if partition was exhausted
2022-11-16 11:50:50 +03:00
Anna Mikhlin
b844d14829 release: prepare for 5.0.6 2022-11-13 16:39:30 +02:00
Eliran Sinvani
184df0393e cql: Fix crash upon use of the word empty for service level name
Wrong access to an uninitialized token instead of the actual
generated string caused the parser to crash, this wasn't
detected by the ANTLR3 compiler because all the temporary
variables defined in the ANTLR3 statements are global in the
generated code. This essentialy caused a null dereference.

Tests: 1. The fixed issue scenario from github.
       2. Unit tests in release mode.

Fixes #11774

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190612133151.20609-1-eliransin@scylladb.com>

Closes #11777

(cherry picked from commit ab7429b77d)
2022-11-10 20:43:21 +02:00
Nadav Har'El
1b550dd301 cql3: fix cql3::util::maybe_quote() for keywords
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.

maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.

So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.

This patch also adds two tests that reproduce the bug and verify its
fix:

1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
   that maybe_quote() quotes reserved keywords like "to", but doesn't
   quote unreserved keywords like "int".

2. Add a test reproducing issue #9450 - creating a materialized view
   whose key column is a keyword. This new test passes on Cassandra,
   failed on Scylla before this patch, and passes after this patch.

It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:

1. Try hard not to introduced new reserved keywords. Instead, introduce
   unreserved keywords. We've been doing this even before recognizing
   this maybe_quote() future-compatibility problem.

2. In the next patch we will introduce quote() - which unconditionally
   quotes identifier names, even if lowercase. These quoted names will
   be uglier for lowercase names - but will be safe from future
   introduction of new keywords. So we can consider switching some or
   all uses of maybe_quote() to quote().

Fixes #9450

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
(cherry picked from commit 5d2f694a90)
2022-11-07 17:01:32 +02:00
Alexander Turetskiy
01ce53d7fb Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.

Fixes #11470

Closes #11693

(cherry picked from commit 636e14cc77)
2022-11-07 17:01:32 +02:00
Jadw1
e9c7f89b32 CQL3: fromJson accepts string as bool
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.

Fixes: #7915
(cherry picked from commit 1902dbc9ff)
2022-11-07 17:01:32 +02:00
Takuya ASADA
93f468c12c locator::ec2_snitch: Retry HTTP request to EC2 instance metadata service
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.

Fixes #10250

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #11688

(cherry picked from commit 6b246dc119)
(cherry picked from commit e2809674d2)
2022-11-07 17:01:32 +02:00
Botond Dénes
e54ae9efd9 test/cql-pytest: add regression test for "IDL frame truncated" error
(cherry picked from commit 11af489e84)
2022-11-07 13:43:53 +02:00
Botond Dénes
ef40e59c0e mutation_compactor: detach_state(): make it no-op if partition was exhausted
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.

This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.

(cherry picked from commit 70b4158ce0)
2022-11-07 13:42:43 +02:00
Botond Dénes
8c56b0b268 Merge 'Alternator, MV: fix bug in some view updates which set the view key to its existing value' from Nadav Har'El
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.

In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).

Fixes #11801

Closes #11808

* github.com:scylladb/scylladb:
  test/alternator: add test for issue 11801
  MV: fix handling of view update which reassign the same key value
  materialized views: inline used-once and confusing function, replace_entry()

(cherry picked from commit e981bd4f21)
2022-11-01 13:25:22 +02:00
Kamil Braun
fc78d88783 service: raft: raft_group0: don't call _abort_source.request_abort()
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.

Fixes #10668.

Closes #10678

(cherry picked from commit ef7643d504)
2022-10-16 11:42:22 +03:00
Pavel Emelyanov
31a20c4c54 compaction_manager: Swallow ENOSPCs in ::stop()
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened

refs: #11245

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 15:56:53 +03:00
Pavel Emelyanov
7e42bcfd61 exceptions: Mark storage_io_error::code() with noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 15:56:03 +03:00
Pavel Emelyanov
2107ffe2d2 compaction_manager: Shuffle really_do_stop()
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 15:56:02 +03:00
Beni Peled
5a97a1060e release: prepare for 5.0.5 2022-10-09 08:44:14 +03:00
Nadav Har'El
2b0487c900 cql: validate bloom_filter_fp_chance up-front
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.

Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.

This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).

Fixes #11524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11576

(cherry picked from commit 4c93a694b7)
2022-10-04 16:22:50 +03:00
Pavel Emelyanov
d3b3c53d9f system_keyspace/config: Swallow string->value cast exception
When updating an updateable value via CQL the new value comes as a
string that's then boost::lexical_cast-ed to the desired value. If the
cast throws the respective exception is printed in logs which is very
likely uncalled for.

fixes: #10394
tests: manual

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220503142942.8145-1-xemul@scylladb.com>
(cherry picked from commit 063d26bc9e)
2022-10-04 16:19:46 +03:00
Nadav Har'El
50c2c1b1d4 alternator: return ProvisionedThroughput in DescribeTable
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.

The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.

So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.

Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.

Fixes #11222

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11298

(cherry picked from commit 941c719a23)
2022-10-03 14:28:16 +03:00
Tomasz Grabiec
aa647a637a test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone
The generator was first setting the marker then applied tombstones.

The marker was set like this:

  row.marker() = random_row_marker();

Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.

However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.

This could generate rows with markers uncompacted with shadowable tombstones.

This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.

Fix by making sure there are no key clashes.

Closes #11663

(cherry picked from commit 5268f0f837)
2022-10-02 16:45:07 +03:00
Michael Livshin
2c0040fcb3 allow pre-scrub snapshots of materialized views and secondary indices
Previously, any attempt to take a materialized view or secondary index
snapshot was considered a mistake and caused the snapshot operation to
abort, with a suggestion to snapshot the base table instead.

But an automatic pre-scrub snapshot of a view cannot be attributed to
user error, so the operation should not be aborted in that case.

(It is an open question whether the more correct thing to do during
pre-scrub snapshot would be to silently ignore views.  Or perhaps they
should be ignored in all cases except when the user explicitly asks to
snapshot them, by name)

Closes #10760.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
(cherry picked from commit aab4cd850c)

Fixes #10760.
2022-10-02 14:04:11 +03:00
Nadav Har'El
54564adb7c alternator: forbid duplicate index (LSI and GSI) names
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.

So in this patch we add a test for this issue, and fix it.

Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.

Fixes #10789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8866c326de)
2022-10-02 13:00:03 +03:00
Tomasz Grabiec
839876e8f2 db: range_tombstone_list: Avoid quadratic behavior when applying
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).

This can cause reactor stalls and availability issues when replicas
apply such deletions.

This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.

Fixes #11211

Closes #11215

(cherry picked from commit 7f80602b01)
2022-09-30 17:55:23 +03:00
Botond Dénes
36002e2b7c sstables: crawling mx-reader: make on_out_of_clustering_range() no-op
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.

Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.

Fixes: #11421

Closes #11422

(cherry picked from commit be9d1c4df4)
2022-09-30 17:55:14 +03:00
Botond Dénes
91a8f9e09b test/lib/random_schema: add a simpler overload for fixed partition count
Some tests want to generate a fixed amount of random partitions, make
their life easier.

(cherry picked from commit 98f3d516a2)

Ref #11421 (prerequisite)
2022-09-30 17:54:55 +03:00
Michael Livshin
bc29f350dd batchlog_manager: warn when a batch fails to replay
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.

(Would info be more appropriate here than warning?)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10556

Fixes #10636

(cherry picked from commit 00ed4ac74c)
2022-09-29 12:14:56 +03:00
Asias He
4fe571f470 streaming: Allow drop table during streaming
Currently, if a table is dropped during streaming, the streaming would
fail with no_such_column_family error.

Since the table is dropped anyway, it makes more sense to ignore the
streaming result of the dropped table, whether it is successful or
failed.

This allows users to drop tables during node operations, e.g., bootstrap
or decommission a node.

This is especially useful for the cloud users where it is hard to
coordinate between a node operation by admin and user cql change.

This patch also fixes a possible user after free issue by not passing
the table reference object around.

Fixes #10395

Closes #10396

(cherry picked from commit 953af38281)
2022-09-21 10:26:22 +03:00
Michał Radwański
ebf38eaead flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO.
In functions such as upgrade_to_v2 (excerpt below), if the constructor
of transforming_reader throws, r needs to be destroyed, however it
hasn't been closed. However, if a reader didn't start any operations, it
is safe to destruct such a reader. This issue can potentially manifest
itself in many more readers and might be hard to track down. This commit
adds a bool indicating whether a close is anticipated, thus avoiding
errors in the destructor.

Code excerpt:
flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) {
    class transforming_reader : public flat_mutation_reader_v2::impl {
        // ...
    };
    return make_flat_mutation_reader_v2<transforming_reader>(std::move(r));
}

Fixes #9065.
Fixes #11491

(cherry picked from commit 9ada63a9cb)
2022-09-21 10:25:18 +03:00
Beni Peled
1c82766f33 release: prepare for 5.0.4 2022-09-21 09:16:13 +03:00
Piotr Sarna
e1f78c33b4 Merge 'Fix mutation commutativity with shadowable tombstone'
from Tomasz Grabiec

This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.

No known production impact.

Refs https://github.com/scylladb/scylladb/issues/11307

Closes #11312

* github.com:scylladb/scylladb:
  test: mutation_test: Add explicit test for mutation commutativity
  test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
  db: mutation_partition: Drop unnecessary maybe_shadow()
  db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
  mutation_partition: row: make row marker shadowing symmetric

(cherry picked from commit 484004e766)
2022-09-20 23:21:06 +02:00
Tomasz Grabiec
0634b5f734 test: row_cache: Use more narrow key range to stress overlapping reads more
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.

Closes #11260

(cherry picked from commit 8ee5b69f80)
2022-09-20 23:20:43 +02:00
Avi Kivity
6f020b26e1 Merge 'Backport 3 fixes for the evictable reader v2' from Botond Dénes
This pull request backports 3 important fixes from adc08d0ab9. Said 3 commits fixed important bugs in the v2 variant of the evitable reader, but were not backported because they were part of a large series doing v2 conversion in general. This means that 5.0 was left with a buggy evictable reader v2, which is used by repair. So far in the wild we've seen one bug manifest itself: the evictable reader getting stuck, spinning in a tight loop in `evictable_reader_v2::do_fill_buffer()`, in turn making repair being stuck too.

Fixes: #11223

Closes #11540

* github.com:scylladb/scylladb:
  test/boost/mutation_reader_test: add v2 specific evictable reader tests
  evictable_reader_v2: terminate active range tombstones on reader recreation
  evictable_reader_v2: restore handling of non-monotonically increasing positions
  evictable_reader_v2: simplify handling of reader recreation
2022-09-20 13:42:10 +03:00
Pavel Emelyanov
7f8dcc5657 messaging_service: Fix gossiper verb group
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.

fixes: #11465

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2c74062962)
2022-09-19 10:31:58 +03:00
Botond Dénes
20451760fe tools/scylla-sstable: fix description template
Quote '{' and '}' used in CQL example, so format doesn't try to
interpret it.

Fixes: #11571

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220221140652.173015-1-bdenes@scylladb.com>
(cherry picked from commit 10880fb0a7)
2022-09-19 06:54:25 +03:00
Michał Chojnowski
51b031d04e sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202

(cherry picked from commit cdb3e71045)
2022-09-18 13:29:35 +03:00
Botond Dénes
82d1446ca9 test/boost/mutation_reader_test: add v2 specific evictable reader tests
One is a reincarnation of the recently removed
test_multishard_combining_reader_non_strictly_monotonic_positions. The
latter was actually targeting the evictable reader but through the
multishard reader, probably for historic reasons (evictable reader was
part of the multishard reader family).
The other one checks that active range tombstones changes are properly
terminated when the partition ends abruptly after recreating the reader.

(cherry picked from commit 014a23bf2a)
2022-09-15 13:51:13 +03:00
Botond Dénes
e0acb0766d evictable_reader_v2: terminate active range tombstones on reader recreation
Reader recreation messes with the continuity of the mutation fragment
stream because it breaks snapshot isolation. We cannot guarantee that a
range tombstone or even the partition started before will continue after
too. So we have to make sure to wrap up all loose threads when
recreating the reader. We already close uncontinued partitions. This
commit also takes care of closing any range tombstone started by
unconditionally emitting a null range tombstone. This is legal to do,
even if no range tombstone was in effect.

(cherry picked from commit 9e48237b86)
2022-09-14 19:15:50 +03:00
Botond Dénes
4f26d489a0 evictable_reader_v2: restore handling of non-monotonically increasing positions
We thought that unlike v1, v2 will not need this. But it does.
Handled similarly to how v1 did it: we ensure each buffer represents
forward progress, when the last fragment in the buffer is a range
tombstone change:
* Ensure the content of the buffer represents progress w.r.t.
  _next_position_in_partition, thus ensuring the next time we recreate
  the reader it will continue from a later position.
* Continue reading until the next (peeked) fragment has a strictly
  larger position.

The code is just much nicer because it uses coroutines.

(cherry picked from commit 6db08ddeb2)
2022-09-14 19:15:49 +03:00
Botond Dénes
43cbc5c836 evictable_reader_v2: simplify handling of reader recreation
The evictable reader has a handful of flags dictating what to do after
the reader is recreated: what to validate, what to drop, etc. We
actually need a single flag telling us if the reader was recreated or
not, all other things can be derived from existing fields.
This patch does exactly that. Furthermore it folds do_fill_buffer() into
fill_buffer() and replaces the awkward to use `should_drop_fragment()`
with `examine_first_fragments()`, which does a much better job of
encapsulating all validation and fragment dropping logic.
This code reorganization also fixes two bugs introduced by the v2
conversion:
* The loop in `do_fill_buffer()` could become infinite in certain
  circumstances due to a difference between the v1 and v2 versions of
  `is_end_of_stream()`.
* The position of the first non-dropped fragment is was not validated
  (this was integrated into the range tombstone trimming which was
  thrown out by the conversion).

(cherry picked from commit 498d03836b)
2022-09-14 19:15:49 +03:00
Nadav Har'El
f0c521efdf alternator: clean error shutdown in case of TLS misconfigration
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.

This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:

<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.

After this patch we get the right error message:

ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])

Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.

Fixes #10025
Fixes #11520

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
(cherry picked from commit 7f89c8b3e3)
2022-09-11 14:43:18 +03:00
Beni Peled
b9a61c8e9a release: prepare for 5.0.3 2022-09-07 11:16:52 +03:00
Karol Baryła
32aa1e5287 transport/server.cc: Return correct size of decompressed lz4 buffer
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.

Fixes #11476

(cherry picked from commit 1c2eef384d)
2022-09-07 10:58:42 +03:00
Nadav Har'El
da6a126d79 cross-tree: fix header file self-sufficiency
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.

We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.

This is needed because our CI runs "ninja dev-headers".

Fixes #10995

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11457
2022-09-06 15:45:34 +03:00
Avi Kivity
d07e902983 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.

Fixes: https://github.com/scylladb/scylladb/issues/11264

Closes #11273

* github.com:scylladb/scylladb:
  querier: querier_cache: remove now unused evict_all_for_table()
  database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
  reader_concurrency_semaphore: add evict_inactive_reads_for_table()

(cherry picked from commit afa7960926)
2022-09-02 11:39:43 +03:00
Piotr Sarna
3c0fc42f84 cql3: fix misleading error message for service level timeouts
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).

Fixes #10286

Closes #10294

(cherry picked from commit 85e95a8cc3)
2022-09-01 20:34:12 +03:00
Piotr Grabowski
964ccf9192 type_json: support integers in scientific format
Add support for specifing integers in scientific format (for example
1.234e8) in INSERT JSON statement:

INSERT INTO table JSON '{"int_column": 1e7}';

Inserting a floating-point number ending with .0 is allowed, as
the fractional part is zero. Non-zero fractional part (for example
12.34) is disallowed. A new test is added to test all those behaviors.

Before the JSON parsing library was switched to RapidJSON from JsonCpp,
this statement used to work correctly, because JsonCpp transparently
casts double to integer value.

This behavior differs from Cassandra, which disallows those types of
numbers (1e7, 123.0 and 12.34).

Fix typo in if condition: "if (value.GetUint64())" to
"if (value.IsUint64())".

Fixes #10100

(cherry picked from commit efe7456f0a)
2022-09-01 16:03:49 +03:00
Avi Kivity
dfdc128faf Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239

Closes #11240

* github.com:scylladb/scylladb:
  test: mvcc: Fix illegal use of maybe_refresh()
  tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
  tests: row_cache_test: Introduce one_shot mode to throttle
  row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
2022-08-11 18:36:44 +02:00
Yaron Kaikov
299122e78d release: prepare for 5.0.2 2022-08-07 16:15:02 +03:00
Avi Kivity
23a34d7e42 Merge 'Backport: Fix map subscript crashes when map or subscript is null' from Nadav Har'El
This is a backport of https://github.com/scylladb/scylla/pull/10420 to branch 5.0.
Branch 5.0 had somewhat different code in this expression area, so the backport was not automatically, but nevertheless was fairly straightforward - just copy the exact same checking code to its right place, and keep the exact same tests to see we indeed fixed the bug.

Refs #10535.

The original cover letter from https://github.com/scylladb/scylla/pull/10420:

In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues https://github.com/scylladb/scylla/issues/10361, https://github.com/scylladb/scylla/issues/10399, https://github.com/scylladb/scylla/pull/10401. The existing test test_null.py::test_map_subscript_null reproduced all these bugs sporadically.

In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then define both m[NULL] and NULL[2] to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., NULL < 2 is also defined as resulting in NULL.

However, this decision differs from Cassandra, where m[NULL] is considered an error but NULL[2] is allowed. We believe that making m[NULL] be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like m[a], where the column a can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.

Fixes https://github.com/scylladb/scylla/issues/10361
Fixes https://github.com/scylladb/scylla/issues/10399
Fixes https://github.com/scylladb/scylla/pull/10401

Closes #11142

* github.com:scylladb/scylla:
  test/cql-pytest: reproducer for CONTAINS NULL bug
  expressions: don't dereference invalid map subscript in filter
  expressions: fix invalid dereference in map subscript evaluation
  test/cql-pytest: improve tests for map subscripts and nulls
2022-07-28 15:31:28 +03:00
Nadav Har'El
67a2f3aa67 test/cql-pytest: reproducer for CONTAINS NULL bug
This is a reproducer for issue #10359 that a "CONTAINS NULL" and
"CONTAINS KEY NULL" restrictions should not match any set, but currently
do match non-empty or all sets.

The tests currently fail on Scylla, so marked xfail. They also fails on
Cassandra because Cassandra considers such a request an error, which
we consider a mistake (see #4776) - so the tests are marked "cassandra_bug".

Refs #10359.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412130914.823646-1-nyh@scylladb.com>
(cherry picked from commit ae0e1574dc)
2022-07-27 20:03:30 +03:00
Nadav Har'El
66e8cf8cea expressions: don't dereference invalid map subscript in filter
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).

Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:

 * For NULL, we consider m[NULL] to result in NULL - instead of an error.
   This behavior is more consistent with other expressions that contain
   null - for example NULL[2] and NULL<2 both result in NULL as well.
   Moreover, if in the future we allow more complex expressions, such
   as m[a] (where a is a column), we can find the subscript to be null
   for some rows and non-null for other rows - and throwing an "invalid
   query" in the middle of the filtering doesn't make sense.

 * For UNSET_VALUE, we do consider this an error like Cassandra, and use
   the same error message as Cassandra. However, the current implementation
   checks for this error only when the expression is evaluated - not
   before. It means that if the scan is empty before the filtering, the
   error will not be reported and we'll silently return an empty result
   set. We currently consider this ok, but we can also change this in the
   future by binding the expression only once (today we do it on every
   evaluation) and validating it once after this binding.

Fixes #10361
Fixes #10399

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit fbb2a41246)
2022-07-27 19:56:17 +03:00
Nadav Har'El
35b66c844c expressions: fix invalid dereference in map subscript evaluation
When we have an filter such as "WHERE m[2] = 3" (where m is a map
column), if a row had a null value for m, our expression evaluation
code incorrectly dereferences an unset optional, and continued
processing the result of this dereference which resulted in undefined
behavior - sometimes we were lucky enough to get "marshaling error"
but other times Scylla crashed.

The fix is trivial - just check before dereferencing the optional value
of the map. We return null in that case, which means that we consider
the result of null[2] to be null. I think this is a reasonable approach
and fits our overall approach of making null dominate expressions (e.g.,
the value of "null < 2" is also null).

The test test_filtering.py::test_filtering_null_map_with_subscript,
which used to frequently fail with marshaling errors or crashes, now
passes every time so its "xfail" mark is removed.

Fixes #10417

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 808a93d29b)
2022-07-27 19:50:24 +03:00
Nadav Har'El
9e7a1340b9 test/cql-pytest: improve tests for map subscripts and nulls
The test test_null.py::test_map_subscript_null turned out to reproduce
multiple bugs related to using map subscripts in filtering expressions.
One was issue #10361 (m[null] resulted in a bizarre error) or #10399
(m[null] resulted in a crash), and a different issue was #10401 (m[2]
resulted in a bizarre error or a crash if m itself was null). Moreover,
the same test uncovered different bugs depending how it was run - alone
or with other tests - because it was using a shared table.

In this patch we introduce two separate tests in test_filtering.py
which are designed to reproduce these separate bugs instead of mixing
them into one test. The new tests also cover a few more corners which
the previous test (which focused on nulls) missed - such as UNSET_VALUE.

The two new tests (and the old test_map_subscript_null) pass on
Cassandra so still assume that the Cassandra behavior - that m[null]
should be an error - is the correct behavior. We may want to change
the desired behavior (e.g., to decide that m[null] be null, not an
error), and change the tests accordingly later - but for now the
tests follow Cassandra's behavior exactly, and pass on Cassandra
and fail on Scylla (so are marked xfail).

The bugs reproduced by these tests involve randomness or reading
uninitialized memory, so these tests sometimes pass, sometimes fail,
and sometimes even crash (as reported in #10399 and #10401). So to
reproduce these bugs run the tests multiple times. For example:

    test/cql-pytest/run --count 100 --runxfail
        test_filtering.py::test_filtering_null_map_with_subscript

Refs #10361
Refs #10399
Refs #10401

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 189b8845fe)
2022-07-27 19:28:17 +03:00
Benny Halevy
d5a0750ef3 multishard_mutation_query: do_query: stop ctx if lookup_readers fails
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.

Fixes #10351

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10425

(cherry picked from commit 055141fc2e)
2022-07-25 14:52:44 +03:00
Benny Halevy
618c483c73 sstables: time_series_sstable_set: insert: make exception safe
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.

Fixes #10787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit cd68b04fbf)
2022-07-25 14:21:45 +03:00
Tomasz Grabiec
f10fd1bc12 test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.

Fixes #10801
Refs #10793

(cherry picked from commit 0e78ad50ea)

Closes #10802

(cherry picked from commit 3bec1cc19f)
2022-07-25 14:19:48 +03:00
Tomasz Grabiec
1891f10141 memtable: Fix missing range tombstones during reads under ceratin rare conditions
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.

_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.

The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.

For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:

[a, c] @ t0
[a, b) @ t1, [b, d] @ t2

c > b

The proper sequence for such versions is (assuming d > c):

[a, b) @ t1,
[b, d] @ t2

Due to the bug, the reader will produce:

[a, b) @ t1,
[b, c] @ t0

The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).

The problem goes away once MVCC versions merge.

Fixes #10913
Fixes #10830

Closes #10914

(cherry picked from commit a6aef60b93)
2022-07-19 19:33:51 +03:00
Pavel Emelyanov
b177dacd36 Update seastar submodule (auto-increase latency goal fixes)
* seastar dbf79189...9a7ba6d5 (3):
  > io: Adjust IO latency goal on fair-queue level
  > reactor: Check IOPS/bandwidth and increase latency goal
  > Revert "io_queue: Auto-increase the io-latency goal"

refs: #10927

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 13:06:43 +03:00
Yaron Kaikov
283a722923 release: prepare for 5.0.1 2022-07-19 06:39:11 +03:00
Pavel Emelyanov
522d0a81e7 azure_snitch: Do nothing on non-io-cpu
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: #10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c6d0bc87d0)
2022-07-17 14:13:25 +03:00
Pavel Emelyanov
cd13911db4 Merge 'Scrub compaction: prevent mishandling of range tombstone changes' from Botond
With v2 having individual bounds of range tombstone as separate
fragments, out-of-order fragments become more difficult to handle,
especially in the presence of active range tombstone.
Scrub in both SKIP and SEGREGATE mode closes the partition on
seeing the first invalid fragment (SEGREAGE re-opens it immediately).
If there is an active range tombstone, scrub now also has to take care
of closing said tombstone when closing the partition. In a normal stream
it could just use the last position-in-partition to create a closing
bound. But when out-of-order fragments are on the table this is not
possible: the closing bound may be found later in the stream, with a
position smaller than that of the current position-in-partition.
To prevent extending range tombstone changes like that, Scrub now aborts
the compaction on the first invalid fragment seen *inside* an active
range tombstone.
Fixing a v2 stream with range tombstone changes is definitely possible,
but non-trivial, so we defer it until there is demand for it.

This series also makes the mutation fragment stream validator check for
open range tombstones on partition-end and adds a comprehensive
test-suite for the validator.

Fixes: #10168

Tests: unit(dev)

* scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git:
  compaction/compaction: abort scrub when attempting to rectify stream with active tombstone
  test/boost/mutation_test: add test for mutation_fragment_stream_validator
  mutation_fragment_stream_validator: validate range tombstone changes

(cherry picked from commit edd0481b38)
2022-07-14 18:49:13 +03:00
Nadav Har'El
32423ebc38 Merge 'Handle errors during snapshot' from Benny Halevy
This series refactors `table::snapshot` and moves the responsibility
to flush the table before taking the snapshot to the caller.

`flush_on_all` and `snapshot_on_all` helpers are added to replica::database
(by making it a peering_sharded_service) and upper layers,
including api and snapshot-ctl now call it instead of calling cf.snapshot directly.

With that, error are handed in table::snapshot and propagated
back to the callers.

Failure to allocate the `snapshot_manager` object is fatal,
similar to failure to allocate a continuation, since we can't
coordinate across the shards without it.

Test: unit(dev), rest_api(debug)

* github.com:scylladb/scylla:
  table: snapshot: handle errors
  table: snapshot: get rid of skip_flush param
  database: truncate: skip flush when taking snapshot
  test: rest_api: storage_service: verify_snapshot_details: add truncate
  database: snapshot_on_all: flush before snapshot if needed
  table: make snapshot method private
  database: add snapshot_on_all
  snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
  snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
  api: storage_service: increase visibility of snapshot ops in the log
  api: storage_service: coroutinize take_snapshot and del_snapshot
  api: storage_service: take_snapshot: improve api help messages
  test: rest_api: storage_service: add test_storage_service_snapshot
  database: add flush_on_all variants
  test: rest_api: add test_storage_service_flush

(cherry picked from commit 2c39c4c284)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10975
2022-07-12 15:24:24 +03:00
Pavel Emelyanov
97054ee691 view: Fix trace-state pointer use after move
It's moved into .mutate_locally() but it captured and used in its
continuation. It works well just because moved-from pointer looks like
nullptr and all the tracing code checks for it to be non-such.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/
       (CI job failed on post-actions thus it's red)

Fixes #11015

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220711134152.30346-1-xemul@scylladb.com>
(cherry picked from commit 5526738794)
2022-07-12 14:20:57 +03:00
Piotr Sarna
34085c364f view: exclude using static columns in the view filter
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).

Fixes #10851

Closes #10855

(cherry picked from commit bc3a635c42)
2022-07-11 17:06:55 +03:00
Takuya ASADA
323521f4c8 install.sh: install files with correct permission in strict umask setting
To avoid failing to run scripts in non-root user, we need to set
permission explicitly on executables.

Fixes #10752

Closes #10840

(cherry picked from commit 13caac7ae6)
2022-07-10 16:46:30 +03:00
Asias He
1ad59d6a7b repair: Do not flush hints and batchlog if tombstone_gc_mode is not repair
The flush of hints and batchlog are needed only for the table with
tombstone_gc_mode set to repair mode. We should skip the flush if the
tombstone_gc_mode is not repair mode.

Fixes #10004

Closes #10124

(cherry picked from commit ec59f7a079)
2022-07-04 10:31:51 +03:00
Nadav Har'El
d3045df9c9 Merge 'types: fix is_string for reversed types' from Piotr Sarna
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.

Fixes #10183

Closes #10181

* github.com:scylladb/scylla:
  test: add a case for LIKE operator on a descending order column
  types: fix is_string for reversed types

(cherry picked from commit 733672fc54)
2022-07-03 17:59:33 +03:00
Benny Halevy
be48b7aa8b compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.

However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.

Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.

The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.

Fixes #10151

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
(cherry picked from commit 0764e511bb)
2022-07-03 14:28:47 +03:00
Takuya ASADA
3c4688bcfa scylla_coredump_setup: support new format of Storage field
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.

Fixes #10669

Closes #10714

(cherry picked from commit ad2344a864)
2022-07-03 13:55:18 +03:00
Nadav Har'El
cc22021876 alternator: forbid empty AttributesToGet
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.

Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax

However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.

So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).

Fixes #10332

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
(cherry picked from commit 9c1ebdceea)
2022-07-03 13:35:50 +03:00
Yaron Kaikov
c9e79cb4a3 release: prepare for 5.0.0 2022-06-28 15:51:29 +03:00
Yaron Kaikov
f28542a71e release: prepare for 5.0.rc8 2022-06-12 14:44:47 +03:00
Pavel Emelyanov
527a75a4c0 Update seastar submodule (Calculate max IO lengths as lengths)
* seastar 8b2c13b3...dbf79189 (1):
  > Merge 'Calculate max IO lengths as lengths'
     io_queue: Type alias for internal::io_direction_and_length
     io_queue, fair_group: Throw instead of assert
     io_queue: Keep max lengths on board
     io_queue: Toss request_fq_ticket()
     io_queue: Introduce make_ticket() helper
     io_queue: Remove max_ticket_size
     io_queue: Make make_ticket() non-brancy
     io_queue: Add devid to group creation log

tests: cstress(release)
fixes: #10704
2022-06-09 21:15:21 +03:00
Avi Kivity
df00f8fcfb Update seastar submodule (json crash in describe_ring)
* seastar 7a430a0830...8b2c13b346 (1):
  > Merge 'stream_range_as_array: always close output stream' from Benny Halevy

Fixes #10592.
2022-06-08 16:48:28 +03:00
Yaron Kaikov
41a00c744f release: prepare for 5.0.rc7 2022-06-02 15:13:59 +03:00
Avi Kivity
2d7b6cd702 messaging: do isolate default tenants
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.

In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:

 scheduling_group
 messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
     return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
     if (must_compress) {
         opts.compressor_factory = &compressor_factory;
     }
     opts.tcp_nodelay = must_tcp_nodelay;
     opts.reuseaddr = true;
-    opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    // We send cookies only for non-default statement tenant clients.
+    if (idx > 3) {
+        opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    }

This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.

Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.

Fixes #9505.

Closes #10673

(cherry picked from commit c83393e819)
2022-06-01 17:20:30 +03:00
Avi Kivity
ff79228178 Merge 'Allow trigger off strategy compaction early for node operations' from Asias He
This patch set adds two commits to allow trigger off strategy early for node operations.

*) repair: Repair table by table internally

This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.

Before:

```
for range in ranges
   for table in tables
       repair(range, table)
```

After:

```
for table in tables
    for range in ranges
       repair(range, table)
```

The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.

This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.

Refs: #10462

*) repair: Trigger off strategy compaction after all ranges of a table is repaired

When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.

To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.

Refs: #10462

Closes #10551

* github.com:scylladb/scylla:
  repair: Trigger off strategy compaction after all ranges of a table is repaired
  repair: Repair table by table internally

(cherry picked from commit e65b3ed50a)
2022-06-01 14:17:01 +03:00
Nadav Har'El
1803124cc6 alternator: allow DescribeTimeToLive even without TTL enabled
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!

This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.

The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.

After this patch, the following test now works on Scylla without
experimental flags turned on:

    test/alternator/run test_ttl.py::test_describe_ttl_without_ttl

Refs #10660

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 8ecf1e306f)
2022-05-30 20:08:41 +03:00
Takuya ASADA
6fcbf66bfb scylla_sysconfig_setup: handle >=32CPUs correctly
Seems like 59adf05 has a bug, the regex pattern only handles first
32CPUs cpuset pattern, and ignores rest.
We should extend regex pattern to handle all CPUs.

Fixes #10523

Closes #10524

(cherry picked from commit a9dfe5a8f4)
2022-05-30 14:27:27 +03:00
Takuya ASADA
e9a3dee234 scylla_sysconfig_setup: avoid perse error on perftune.py --get-cpu-mask
Currently, we just passes entire output of perftune.py when getting CPU
mask from the script, but it may cause parse error since the script may
also print warning message.

To avoid that, we need to extract CPU mask from the output.

Fixes #10082

Closes #10107

(cherry picked from commit 59adf05951)
2022-05-30 14:25:21 +03:00
Avi Kivity
279cd44c7f Update seastar submodule (xfs project attribute zeroed)
* seastar 6745a43c10...7a430a0830 (1):
  > file: don't trample on xfs flags when setting xfs size hint

Fixes #10667.
2022-05-29 17:43:43 +03:00
Avi Kivity
c99f768381 Merge 'Rework off strategy compaction locking for branch 5.0' from Raphael "Raph" Carvalho
First patch removes incorrect usage of rwlock which should be restricted to minor and major compaction tasks.

Second patch revives a semaphore, which was lost in 6737c88045, as we want major to not wait on off-strategy completion before deciding whether or not it should proceed with execution. It wouldn't proceed with execution if user asked major to stop while waiting for a chance to run.

For master, we're going to rely on abortable variant of get_units() to allow major to be quickly aborted.

Fixes #10485.

Closes #10582

* github.com:scylladb/scylla:
  compaction_manager: Revive custom job semaphore
  compaction_manager: Remove rwlock usage in run_custom_job()
2022-05-29 17:38:01 +03:00
Tomasz Grabiec
89a540d54a sstable: partition_index_cache: Fix abort on bad_alloc during page loading
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.

The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.

The assert manifested like this:

scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.

Fixes #10617

Closes #10653

(cherry picked from commit f87274f66a)
2022-05-27 09:50:32 +03:00
Yaron Kaikov
338edcc02e release: prepare for 5.0.rc6 2022-05-23 11:37:37 +03:00
Avi Kivity
a8eb5164b2 Update seastar submodule (io_queue delay metrics in 25ms granularity)
* seastar 4a30c44c4c...6745a43c10 (1):
  > metrics: Report IO total times as real numbers

Ref #10392
2022-05-19 18:20:15 +03:00
Raphael S. Carvalho
9accb44f9c compaction_manager: Revive custom job semaphore
In commit 6737c88045, we started using a single semaphore for
maintenance operations, which is a good change.

However, after introduction of off-strategy, major cannot proceed
until off-strategy is done reshaping all its input files.

If user requests major to abort, the command will only return
once off-strategy is done, and that can take lots of time.

In master, we'll allow pending major to be quickly aborted, but
that's not possible here as abortable variant of get_units()
is not available yet.

Here, we'll allow major to proceed in parallel to off-strategy,
so major can decide whether or not it should run in parallel.

Fixes #10485.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-16 20:46:31 -03:00
Raphael S. Carvalho
8878007106 compaction_manager: Remove rwlock usage in run_custom_job()
The rwlock usage was introduced in 2017 commit 10eaa2339e.

Resharding was online back then and we want to serialize it with
major.

Rwlock usage should be restricted to major and minor, as clearly
stated in the documentation, but we're still using it in
run_custom_job().

It gains us nothing, it only prevents off-strategy and other
custom jobs from running concurrently to major.

Let's kill this as we want to allow off-strategy to not prevent
a major from happening in parallel, as the former works only
on the maintenance sstable set and won't interfere with
the latter.

Refs #10485.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-16 20:45:54 -03:00
Yaron Kaikov
9da666e778 release: prepare for 5.0.rc5 2022-05-15 22:09:16 +03:00
Benny Halevy
aca355dec1 table: clear: serialize with ongoing flush
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.

Fixes #10423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
2022-05-15 13:39:03 +03:00
Raphael S. Carvalho
efbb2efd3f compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508

(cherry picked from commit 8e99d3912e)
2022-05-15 13:20:11 +03:00
Eliran Sinvani
44dc5c4a1d Revert "table: disable_auto_compaction: stop ongoing compactions"
This reverts commit 4affa801a5.
In issue #10146 a write throughput drop of ~50% was reported, after
bisect it was found that the change that caused it was adding some
code to the table::disable_auto_compaction which stops ongoing
compactions and returning a future that resolves once all the  compaction
tasks for a table, if any, were terminated. It turns out that this function
is used only at startup (and in REST api calls which are not used in the test)
in the distributed loader just before resharding and loading of
the sstable data. It is then reanabled after the resharding and loading
is done.
For still unknown reason, adding the extra logic of stopping ongoing
compactions made the write throughput drop to 50%.
Strangely enough this extra logic **should** (still unvalidated) not
have any side effects since no compactions for a table are supposed to
be running prior to loading it.
This regains the performance but also undo a change which eventually
should get in once we find the actual culprit.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10559

Reopens #9313.

(cherry picked from commit 8e8dc2c930)
2022-05-15 08:50:38 +03:00
Juliusz Stasiewicz
6b34ba3a4f CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503

(cherry picked from commit 603dd72f9e)
2022-05-10 14:04:52 +02:00
Yaron Kaikov
f1e25cb4a6 release: prepare for 5.0.rc4 2022-05-10 07:35:53 +03:00
Benny Halevy
c9798746ae compaction: time_window_compaction_strategy: reset estimated_remaining_tasks when running out of candidates
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.

Refs #10418
(to be closed when the fix reaches branch-4.6)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10419

(cherry picked from commit 01f41630a5)
2022-05-09 09:35:53 +03:00
Eliran Sinvani
7f70ffc5ce prepared_statements: Invalidate batch statement too
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.

Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
2022-05-08 12:31:42 +03:00
Eliran Sinvani
551636ec89 cql3 statements: Change dependency test API to express better it's
purpose

Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.

(cherry picked from commit bf50dbd35b)

Ref #10129
2022-05-08 12:31:02 +03:00
Raphael S. Carvalho
e1130a01e7 table: Close reader if flush fails to peek into fragment
An OOM failure while peeking into fragment, to determine if reader will
produce any fragment, causes Scylla to abort as flat_mutation_reader
expects reader to be closed before destroyed. Let's close it if
peek() fails, to handle the scenario more gracefully.

Fixes #10027.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220204031553.124848-1-raphaelsc@scylladb.com>
(cherry picked from commit 755cec1199)
2022-05-08 12:16:15 +03:00
Calle Wilund
b0233cb7c5 cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474

(cherry picked from commit 78350a7e1b)
2022-05-05 11:38:18 +02:00
Avi Kivity
e480c5bf4d Merge 'loading_cache: force minimum size of unprivileged ' from Piotr Grabowski
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.

When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.

This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.

To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.

New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.

Fixes #10440.

Closes #10456

* github.com:scylladb/scylla:
  loading_cache_test: test prepared stmts cache
  loading_cache: force minimum size of unprivileged
  loading_cache: extract dropping entries to lambdas
  loading_cache: separately track size of sections
  loading_cache: fix typo in 'privileged'

(cherry picked from commit 5169ce40ef)
2022-05-04 14:35:53 +03:00
Tomasz Grabiec
7d90f7e93f loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
2022-05-04 14:35:37 +03:00
Avi Kivity
3e6e8579c6 loading_cache: fix indentation of timestamped_val and two nested type aliases
timestamped_val (and two other type aliases) are nested inside loading_cache,
but indented as if they were top-level names. Adjust the indent to
avoid confusion.

Closes #10118

(cherry picked from commit d1a394fd97)

Ref #10117 - backport prerequisite
2022-05-04 14:35:15 +03:00
Avi Kivity
3e98e17d18 Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private

(cherry picked from commit 7f1e368e92)
2022-05-01 17:22:57 +03:00
Avi Kivity
a214f8cf6e Update tools/java submodule (bad IPv6 addresses in nodetool)
* tools/java b1e09c8b8f...2241a63bda (1):
  > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index

Fixes #10442
2022-04-28 11:33:15 +03:00
Benny Halevy
e8b92fe34d replica: distributed_database: populate_column_family: trigger offstrategy compaction only for the base directory
In https://github.com/scylladb/scylla/issues/10218
we see off-strategy compaction happening on a table
during the initial phases of
`distributed_loader::populate_column_family`.

It is caused by triggering offtrategy compaction
too early, when sstables are populated from the staging
directory in a144d30162.

We need to trigger offstrategy compaction only of the base
table directory, never the staging or quarantine dirs.

Fixes #10218

Test: unit(dev)
DTest: materialized_views_test.py::TestInterruptBuildProcess

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220316152812.3344634-1-bhalevy@scylladb.com>
(cherry picked from commit a1d0f089c8)
2022-04-24 17:38:53 +03:00
Nadav Har'El
fa479c84ac config: fix some types in system.config virtual table
The system.config virtual tables prints each configuration variable of
type T based on the JSON printer specified in the config_type_for<T>
in db/config.cc.

For two variable types - experimental_features and tri_mode_restriction,
the specified converter was wrong: We used value_to_json<string> or
value_to_json<vector<string>> on something which was *not* a string.
Unfortunately, value_to_json silently casted the given objects into
strings, and the result was garbage: For example as noted in #10047,
for experimental_features instead of printing a list of features *names*,
e.g., "raft", we got a bizarre list of one-byte strings with each feature's
number (which isn't documented or even guaranteed to not change) as well
as carriage-return characters (!?).

So solution is a new printable_to_json<T> which works on a type T that
can be printed with operator<< - as in fact the above two types can -
and the type is converted into a string or vector of strings using this
operator<<, not a cast.

Also added a cql-pytest test for reading system.config and in particular
options of the above two types - checking that they contain sensible
strings and not "garbage" like before this patch.

Fixes #10047.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209090421.298849-1-nyh@scylladb.com>
(cherry picked from commit fef7934a2d)
2022-04-14 19:29:08 +03:00
Tomasz Grabiec
40c26dd2c5 utils/chunked_managed_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no user impact.

Fixes #10364.

Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
(cherry picked from commit 0c365818c3)
2022-04-13 09:48:34 +03:00
Tomasz Grabiec
2c6f069fd1 utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)
2022-04-13 09:47:24 +03:00
Avi Kivity
e27dff0c50 transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.

Fix by updating the error codes.

A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.

Fixes #5610.

Closes #10362

(cherry picked from commit 987e6533d2)
2022-04-13 09:47:24 +03:00
Tomasz Grabiec
3f03260ffb utils/chunked_managed_vector: Fix corruption in case there is more than one chunk
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.

Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.

Currently, the container is only used in the sstable partition index
cache.

Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.

Introduced in 78e5b9fd85 (4.6.0)

Fixes #10290

Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
(cherry picked from commit 41fe01ecff)
2022-04-08 10:53:33 +03:00
Takuya ASADA
1315135fca docker: enable --log-to-stdout which mistakenly disabled
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.

We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.

Fixes #10270

Closes #10280

(cherry picked from commit bdefea7c82)
2022-04-07 12:13:19 +03:00
Yaron Kaikov
f92622e0de release: prepare for 5.0.rc3 2022-04-06 14:31:03 +03:00
Takuya ASADA
3bca608db5 docker: run scylla as root
Previous versions of Docker image runs scylla as root, but cb19048
accidently modified it to scylla user.
To keep compatibility we need to revert this to root.

Fixes #10261

Closes #10325

(cherry picked from commit f95a531407)
2022-04-05 12:46:25 +03:00
Takuya ASADA
a93b72d5dd docker: revert scylla-server.conf service name change
We changed supervisor service name at cb19048, but this breaks
compatibility with scylla-operator.
To fix the issue we need to revert the service name to previous one.

Fixes #10269

Closes #10323

(cherry picked from commit 41edc045d9)
2022-04-05 12:40:59 +03:00
Benny Halevy
d58ca2edbd range_tombstone_list: insert_from: correct rev.update range_tombstone in not overlapping case
2nd std::move(start) looks like a typo
in fe2fa3f20d.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220404124741.1775076-1-bhalevy@scylladb.com>
(cherry picked from commit 2d80057617)

Fixes ##10326
2022-04-05 12:39:13 +03:00
Alexey Kartashov
75740ace2a dist/docker: fix incorrect locale value
Docker build script contains an incorrect locale specification for LC_ALL setting,
this commit fixes that.

Fixes #10310

Closes #10321

(cherry picked from commit d86c3a8061)
2022-04-04 12:51:02 +03:00
Piotr Sarna
d7a1bf6331 cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302

(cherry picked from commit c0fd53a9d7)
2022-04-03 11:20:49 +03:00
Avi Kivity
bbd7d657cc Update seastar submodule (pidof command not installed)
* seastar 1c0d622ba0...4a30c44c4c (1):
  > seastar-cpu-map.sh: switch from pidof to pgrep
Fixes #10238.
2022-03-29 12:36:06 +03:00
Avi Kivity
f5bf4c81d1 Merge 'replica/database: truncate: temporarily disable compaction on table and views before flush' from Benny Halevy
Flushing the base table triggers view building
and corresponding compactions on the view tables.

Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.

This should make truncate go faster.

In the process, this series also embeds `database::truncate_views`
into `truncate` and coroutinizes both

Refs #6309

Test: unit(dev)

Closes #10203

* github.com:scylladb/scylla:
  replica/database: truncate: fixup indentation
  replica/database: truncate: temporarily disable compaction on table and views before flush
  replica/database: truncate: coroutinize per-view logic
  replica/database: open-code truncate_view in truncate
  replica/database: truncate: coroutinize run_with_compaction_disabled lambda
  replica/database: coroutinize truncate
  compaction_manager: add disable_compaction method

(cherry picked from commit aab052c0d5)
2022-03-28 15:40:40 +03:00
Benny Halevy
02e8336659 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
2022-03-24 18:00:11 +02:00
Benny Halevy
601812e11b atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
2022-03-24 18:00:11 +02:00
Benny Halevy
ea466320d2 atomic_cell: compare_atomic_cell_for_merge: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-2-bhalevy@scylladb.com>
(cherry picked from commit d43da5d6dc)
2022-03-24 18:00:11 +02:00
Benny Halevy
25ea831a15 atomic_cell: compare_atomic_cell_for_merge: simplify expiry/deltion_time comparison
No need to check first the the cells' expiry is different
or that deletion_time is different before comparing them
with `<=>`.

If they are the same the function returns std::strong_ordering::equal
anyhow and that is the same as `<=>` comparing identical values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>
(cherry picked from commit be865a29b8)
2022-03-24 18:00:11 +02:00
Benny Halevy
8648c79c9e main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
2022-03-24 14:48:52 +02:00
Nadav Har'El
7ae4d0e6f8 Seastar: backport Seastar fix for missing scring escape in JSON output
Backported Seastar fix:
  > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz

Fixes #9061

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-23 20:29:50 +02:00
Piotr Sarna
f3564db941 expression: fix get_value for mismatched column definitions
As observed in #10026, after schema changes it somehow happened
that a column defition that does not match any of the base table
columns was passed to expression verification code.
The function that looks up the index of a column happens to return
-1 when it doesn't find anything, so using this returned index
without checking if it's nonnegative results in accessing invalid
vector data, and a segfault or silent memory corruption.
Therefore, an explicit check is added to see if the column was actually
found. This serves two purposes:
 - avoiding segfaults/memory corruption
 - making it easier to investigate the root cause of #10026

Closes #10039

(cherry picked from commit 7b364fec9849e9a342af1c240e3a7185bf5401ef)
2022-03-21 10:37:48 +01:00
Pavel Emelyanov
97caf12836 Update seastar submodule (IO preemption overlap)
* seastar 47573503...8ef87d48 (3):
  > io_queue: Don't let preemption overlap requests
  > io_queue: Pending needs to keep capacity instead of ticket
  > io_queue: Extend grab_capacity() return codes

Fixes #10233
2022-03-17 11:26:38 +03:00
Yaron Kaikov
839d9ef41a release: prepare for 5.0.rc2 2022-03-16 14:35:52 +02:00
Benny Halevy
782bd50f92 compaction_manager: rewrite_sstables: do not acquire table write lock
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.

Fixes #10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10177

(cherry picked from commit 11ea2ffc3c)
2022-03-14 13:13:48 +02:00
Avi Kivity
0a4d971b4a Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

The series also adds two safety checks to LSA to catch such problems earlier.

Fixes #10056

\cc @slivne @bhalevy

Closes #10130

* github.com:scylladb/scylla:
  lsa: Abort when trying to free a standard allocator object not allocated through the region
  lsa: Abort when _non_lsa_memory_in_use goes negative
  tests: utils: cached_file: Validate occupancy after eviction
  test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
  utils: cached_file: Fix alloc-dealloc mismatch during eviction

(cherry picked from commit ff2cd72766)
2022-02-26 11:28:36 +02:00
Benny Halevy
22562f767f cql3: result_set: remove std::ref from comperator&
Applying std::ref on `RowComparator& cmp` hits the
following compilation error on Fedora 34 with
libstdc++-devel-11.2.1-9.fc34.x86_64

```
FAILED: build/dev/cql3/statements/select_statement.o
clang++ -MD -MT build/dev/cql3/statements/select_statement.o -MF build/dev/cql3/statements/select_statement.o.d -I/home/bhalevy/dev/scylla/seastar/include -I/home/bhalevy/dev/scylla/build/dev/seastar/gen/include -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_LOCALE -DFMT_SHARED -I/usr/include/p11-kit-1  -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_ENABLE_WASMTIME -iquote. -iquote build/dev/gen --std=gnu++20  -ffile-prefix-map=/home/bhalevy/dev/scylla=.  -march=westmere -DBOOST_TEST_DYN_LINK   -Iabseil -fvisibility=hidden  -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-sometimes-uninitialized -Wno-return-stack-address -Wno-missing-braces -Wno-unused-lambda-capture -Wno-overflow -Wno-noexcept-type -Wno-error=cpp -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-defaulted-function-deleted -Wno-redeclared-class-member -Wno-unsupported-friend -Wno-unused-variable -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-uninitialized-const-reference -Wno-psabi -Wno-narrowing -Wno-array-bounds -Wno-nonnull -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DHAVE_LZ4_COMPRESS_DEFAULT  -c -o build/dev/cql3/statements/select_statement.o cql3/statements/select_statement.cc
In file included from cql3/statements/select_statement.cc:14:
In file included from ./cql3/statements/select_statement.hh:16:
In file included from ./cql3/statements/raw/select_statement.hh:16:
In file included from ./cql3/statements/raw/cf_statement.hh:16:
In file included from ./cql3/cf_name.hh:16:
In file included from ./cql3/keyspace_element_name.hh:16:
In file included from /home/bhalevy/dev/scylla/seastar/include/seastar/core/sstring.hh:25:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/algorithm:74:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/pstl/glue_algorithm_defs.h:13:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/functional:58:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: error: exception specification of 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' uses itself
                = decltype(reference_wrapper::_S_fun(std::declval<_Up>()))>
                                                     ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: note: in instantiation of exception specification for 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' requested here
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:321:2: note: in instantiation of default argument for 'reference_wrapper<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' required here
        reference_wrapper(_Up&& __uref)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1017:57: note: while substituting deduced template arguments into function template 'reference_wrapper' [with _Up = __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, $1 = (no value), $2 = (no value)]
      = __bool_constant<__is_nothrow_constructible(_Tp, _Args...)>;
                                                        ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1023:14: note: in instantiation of template type alias '__is_nothrow_constructible_impl' requested here
    : public __is_nothrow_constructible_impl<_Tp, _Args...>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:153:14: note: in instantiation of template class 'std::is_nothrow_constructible<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:298:11: note: (skipping 8 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
          return __and_<typename _Base::_Local_storage,
                 ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1933:13: note: in instantiation of function template specialization 'std::__partial_sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
              std::__partial_sort(__first, __last, __last, __comp);
                   ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1954:9: note: in instantiation of function template specialization 'std::__introsort_loop<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, long, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
          std::__introsort_loop(__first, __last,
               ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:4875:12: note: in instantiation of function template specialization 'std::__sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
      std::__sort(__first, __last, __gnu_cxx::__ops::__iter_comp_iter(__comp));
           ^
./cql3/result_set.hh:168:14: note: in instantiation of function template specialization 'std::sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>' requested here
        std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
             ^
cql3/statements/select_statement.cc:773:21: note: in instantiation of function template specialization 'cql3::result_set::sort<std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>' requested here
                rs->sort(_ordering_comparator);
                    ^
1 error generated.
ninja: build stopped: subcommand failed.
```

Fixes #10079.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215071955.316895-3-bhalevy@scylladb.com>
(cherry picked from commit 3e20fee070)

[avi: backport for developer quality-of-life rather than as a bug fix]
2022-02-16 10:07:11 +02:00
Raphael S. Carvalho
eb80dd1db5 Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME"
This reverts commit 4c05e5f966.

Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.

Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.

Fixes #10060.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>
(cherry picked from commit a9427f150a)
2022-02-14 18:05:43 +02:00
Avi Kivity
51d699ee21 Update seastar submodule (overzealous log silencer)
* seastar 0d250d15ac...47573503cd (1):
  > log: Fix silencer to be shard-local and logger-global
Fixes #9784.
2022-02-14 17:54:54 +02:00
Avi Kivity
83a33bff8c Point seastar submodule at scylla-seastar.git
This allows us to backport Seastar fixes to this branch.
2022-02-14 17:54:16 +02:00
Nadav Har'El
273563b9ad alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
2022-02-08 11:37:31 +02:00
Yaron Kaikov
891990ec09 release: prepare for 5.0.rc1 2022-02-06 16:41:05 +02:00
Yaron Kaikov
da0cd2b107 release: prepare for 5.0.rc0 2022-02-03 08:10:30 +02:00
178 changed files with 4086 additions and 899 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -60,7 +60,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.0.dev
VERSION=5.0.7
if test -f version
then

View File

@@ -78,6 +78,11 @@ future<> controller::start_server() {
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server
// services we just started - or Scylla will cause an assertion
// failure when the controller object is destroyed in the exception
// unwinding.
std::optional<uint16_t> alternator_port;
if (_config.alternator_port()) {
alternator_port = _config.alternator_port();
@@ -104,7 +109,13 @@ future<> controller::start_server() {
}
opts.erase("require_client_auth");
opts.erase("truststore");
utils::configure_tls_creds_builder(creds.value(), std::move(opts)).get();
try {
utils::configure_tls_creds_builder(creds.value(), std::move(opts)).get();
} catch(...) {
logger.error("Failed to set up Alternator TLS credentials: {}", std::current_exception());
stop_server().get();
std::throw_with_nested(std::runtime_error("Failed to set up Alternator TLS credentials"));
}
}
bool alternator_enforce_authorization = _config.alternator_enforce_authorization();
_server.invoke_on_all(

View File

@@ -34,6 +34,7 @@
#include "expressions.hh"
#include "conditions.hh"
#include "cql3/constants.hh"
#include "cql3/util.hh"
#include <optional>
#include "utils/overloaded_functor.hh"
#include "seastar/json/json_elements.hh"
@@ -46,6 +47,7 @@
#include <seastar/core/coroutine.hh>
#include <boost/range/adaptors.hpp>
#include <boost/range/algorithm/find_end.hpp>
#include <unordered_set>
#include "service/storage_proxy.hh"
#include "gms/gossiper.hh"
#include "schema_registry.hh"
@@ -148,16 +150,16 @@ static void validate_table_name(const std::string& name) {
// instead of each component individually as DynamoDB does.
// The view_name() function assumes the table_name has already been validated
// but validates the legality of index_name and the combination of both.
static std::string view_name(const std::string& table_name, const std::string& index_name, const std::string& delim = ":") {
static std::string view_name(const std::string& table_name, std::string_view index_name, const std::string& delim = ":") {
static const std::regex valid_index_name_chars ("[a-zA-Z0-9_.-]*");
if (index_name.length() < 3) {
throw api_error::validation("IndexName must be at least 3 characters long");
}
if (!std::regex_match(index_name.c_str(), valid_index_name_chars)) {
if (!std::regex_match(index_name.data(), valid_index_name_chars)) {
throw api_error::validation(
format("IndexName '{}' must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", index_name));
}
std::string ret = table_name + delim + index_name;
std::string ret = table_name + delim + std::string(index_name);
if (ret.length() > max_table_name_length) {
throw api_error::validation(
format("The total length of TableName ('{}') and IndexName ('{}') cannot exceed {} characters",
@@ -166,7 +168,7 @@ static std::string view_name(const std::string& table_name, const std::string& i
return ret;
}
static std::string lsi_name(const std::string& table_name, const std::string& index_name) {
static std::string lsi_name(const std::string& table_name, std::string_view index_name) {
return view_name(table_name, index_name, "!:");
}
@@ -273,16 +275,16 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
if (index_name) {
if (index_name->IsString()) {
orig_table_name = std::move(table_name);
table_name = view_name(orig_table_name, index_name->GetString());
table_name = view_name(orig_table_name, rjson::to_string_view(*index_name));
type = table_or_view_type::gsi;
} else {
throw api_error::validation(
format("Non-string IndexName '{}'", index_name->GetString()));
format("Non-string IndexName '{}'", rjson::to_string_view(*index_name)));
}
// If no tables for global indexes were found, the index may be local
if (!proxy.data_dictionary().has_schema(keyspace_name, table_name)) {
type = table_or_view_type::lsi;
table_name = lsi_name(orig_table_name, index_name->GetString());
table_name = lsi_name(orig_table_name, rjson::to_string_view(*index_name));
}
}
@@ -432,6 +434,11 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
rjson::add(table_description, "BillingModeSummary", rjson::empty_object());
rjson::add(table_description["BillingModeSummary"], "BillingMode", "PAY_PER_REQUEST");
rjson::add(table_description["BillingModeSummary"], "LastUpdateToPayPerRequestDateTime", rjson::value(creation_date_seconds));
// In PAY_PER_REQUEST billing mode, provisioned capacity should return 0
rjson::add(table_description, "ProvisionedThroughput", rjson::empty_object());
rjson::add(table_description["ProvisionedThroughput"], "ReadCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "WriteCapacityUnits", 0);
rjson::add(table_description["ProvisionedThroughput"], "NumberOfDecreasesToday", 0);
std::unordered_map<std::string,std::string> key_attribute_types;
// Add base table's KeySchema and collect types for AttributeDefinitions:
@@ -453,6 +460,11 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
rjson::add(view_entry, "IndexName", rjson::from_string(index_name));
// Add indexes's KeySchema and collect types for AttributeDefinitions:
describe_key_schema(view_entry, *vptr, key_attribute_types);
// Add projection type
rjson::value projection = rjson::empty_object();
rjson::add(projection, "ProjectionType", "ALL");
// FIXME: we have to get ProjectionType from the schema when it is added
rjson::add(view_entry, "Projection", std::move(projection));
// Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter
rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;
rjson::push_back(index_array, std::move(view_entry));
@@ -884,17 +896,23 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
const rjson::value* gsi = rjson::find(request, "GlobalSecondaryIndexes");
std::vector<schema_builder> view_builders;
std::vector<sstring> where_clauses;
std::unordered_set<std::string> index_names;
if (gsi) {
if (!gsi->IsArray()) {
co_return api_error::validation("GlobalSecondaryIndexes must be an array.");
}
for (const rjson::value& g : gsi->GetArray()) {
const rjson::value* index_name = rjson::find(g, "IndexName");
if (!index_name || !index_name->IsString()) {
const rjson::value* index_name_v = rjson::find(g, "IndexName");
if (!index_name_v || !index_name_v->IsString()) {
co_return api_error::validation("GlobalSecondaryIndexes IndexName must be a string.");
}
std::string vname(view_name(table_name, index_name->GetString()));
elogger.trace("Adding GSI {}", index_name->GetString());
std::string_view index_name = rjson::to_string_view(*index_name_v);
auto [it, added] = index_names.emplace(index_name);
if (!added) {
co_return api_error::validation(format("Duplicate IndexName '{}', ", index_name));
}
std::string vname(view_name(table_name, index_name));
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
// require the MV code to copy just parts of the attrs map.
schema_builder view_builder(keyspace_name, vname);
@@ -927,9 +945,10 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
if (!range_key.empty() && range_key != view_hash_key && range_key != view_range_key) {
add_column(view_builder, range_key, attribute_definitions, column_kind::clustering_key);
}
sstring where_clause = "\"" + view_hash_key + "\" IS NOT NULL";
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = where_clause + " AND \"" + view_hash_key + "\" IS NOT NULL";
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
@@ -942,12 +961,17 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
throw api_error::validation("LocalSecondaryIndexes must be an array.");
}
for (const rjson::value& l : lsi->GetArray()) {
const rjson::value* index_name = rjson::find(l, "IndexName");
if (!index_name || !index_name->IsString()) {
const rjson::value* index_name_v = rjson::find(l, "IndexName");
if (!index_name_v || !index_name_v->IsString()) {
throw api_error::validation("LocalSecondaryIndexes IndexName must be a string.");
}
std::string vname(lsi_name(table_name, index_name->GetString()));
elogger.trace("Adding LSI {}", index_name->GetString());
std::string_view index_name = rjson::to_string_view(*index_name_v);
auto [it, added] = index_names.emplace(index_name);
if (!added) {
co_return api_error::validation(format("Duplicate IndexName '{}', ", index_name));
}
std::string vname(lsi_name(table_name, index_name));
elogger.trace("Adding LSI {}", index_name);
if (range_key.empty()) {
co_return api_error::validation("LocalSecondaryIndex requires that the base table have a range key");
}
@@ -979,9 +1003,10 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// Note above we don't need to add virtual columns, as all
// base columns were copied to view. TODO: reconsider the need
// for virtual columns when we support Projection.
sstring where_clause = "\"" + view_hash_key + "\" IS NOT NULL";
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = where_clause + " AND \"" + view_range_key + "\" IS NOT NULL";
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
@@ -2173,6 +2198,9 @@ static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unorder
for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
attribute_path_map_add("AttributesToGet", ret, it->GetString());
}
if (ret.empty()) {
throw api_error::validation("Empty AttributesToGet is not allowed. Consider using Select=COUNT instead.");
}
return ret;
} else if (has_projection_expression) {
const rjson::value& projection_expression = req["ProjectionExpression"];
@@ -2577,8 +2605,8 @@ static bool hierarchy_actions(
// attr member so we can use add()
rjson::add_with_string_name(v, attr, std::move(*newv));
} else {
throw api_error::validation(format("Can't remove document path {} - not present in item",
subh.get_value()._path));
// Removing a.b when a is a map but a.b doesn't exist
// is silently ignored. It's not considered an error.
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));

View File

@@ -116,9 +116,6 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.describe_time_to_live++;
if (!_proxy.data_dictionary().features().cluster_supports_alternator_ttl()) {
co_return api_error::unknown_operation("DescribeTimeToLive not yet supported. Experimental support is available if the 'alternator_ttl' experimental feature is enabled on all nodes.");
}
schema_ptr schema = get_table(_proxy, request);
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
rjson::value desc = rjson::empty_object();

View File

@@ -12,6 +12,7 @@
#include <seastar/core/sharded.hh>
#include <seastar/core/abort_source.hh>
#include <seastar/core/semaphore.hh>
#include "data_dictionary/data_dictionary.hh"
namespace replica {
class database;

View File

@@ -624,7 +624,7 @@
},
{
"name":"kn",
"description":"Comma seperated keyspaces name to snapshot",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -632,7 +632,7 @@
},
{
"name":"cf",
"description":"the column family to snapshot",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -669,19 +669,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
}));
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<request> req) {
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
auto &db = ctx.db.local();
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
co_await db.flush_on_all(keyspace);
} else {
co_await db.flush_on_all(keyspace, std::move(column_families));
}
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) {
return parallel_for_each(column_families, [&db, keyspace](const sstring& cf) mutable {
return db.find_column_family(keyspace, cf).flush();
});
}).then([]{
return make_ready_future<json::json_return_type>(json_void());
});
co_return json_void();
});
@@ -1284,40 +1281,46 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
});
ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) {
apilog.debug("take_snapshot: {}", req->query_parameters);
ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) -> future<json::json_return_type> {
apilog.info("take_snapshot: {}", req->query_parameters);
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
auto sfopt = req->get_query_param("sf");
auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
auto resp = make_ready_future<>();
if (column_families.empty()) {
resp = snap_ctl.local().take_snapshot(tag, keynames, sf);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
try {
if (column_families.empty()) {
co_await snap_ctl.local().take_snapshot(tag, keynames, sf);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
}
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
}
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
resp = snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
co_return json_void();
} catch (...) {
apilog.error("take_snapshot failed: {}", std::current_exception());
throw;
}
return resp.then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::del_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) {
ss::del_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) -> future<json::json_return_type> {
apilog.info("del_snapshot: {}", req->query_parameters);
auto tag = req->get_query_param("tag");
auto column_family = req->get_query_param("cf");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
return snap_ctl.local().clear_snapshot(tag, keynames, column_family).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
try {
co_await snap_ctl.local().clear_snapshot(tag, keynames, column_family);
co_return json_void();
} catch (...) {
apilog.error("del_snapshot failed: {}", std::current_exception());
throw;
}
});
ss::true_snapshots_size.set(r, [&snap_ctl](std::unique_ptr<request> req) {
@@ -1354,7 +1357,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag);
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::skip_flush::no, db::snapshot_ctl::allow_view_snapshots::yes);
});
}

View File

@@ -87,19 +87,24 @@ compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
if (left.is_live_and_has_ttl() && left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
return right.ttl() <=> left.ttl();
}
}
} else {
// Both are deleted
if (left.deletion_time() != right.deletion_time()) {
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
<=> (uint64_t) right.deletion_time().time_since_epoch().count();
}
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
<=> (uint64_t) right.deletion_time().time_since_epoch().count();
}
return std::strong_ordering::equal;
}

View File

@@ -59,7 +59,7 @@ using namespace std::chrono_literals;
logging::logger cdc_log("cdc");
namespace cdc {
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {});
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
}
static constexpr auto cdc_group_name = "cdc";
@@ -206,7 +206,7 @@ public:
return;
}
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt);
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);
auto log_mut = log_schema
? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
@@ -484,7 +484,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
}
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid) {
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner("com.scylladb.dht.CDCPartitioner");
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
@@ -571,6 +571,20 @@ static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID>
b.set_uuid(*uuid);
}
/**
* #10473 - if we are redefining the log table, we need to ensure any dropped
* columns are registered in "dropped_columns" table, otherwise clients will not
* be able to read data older than now.
*/
if (old) {
// not super efficient, but we don't do this often.
for (auto& col : old->all_columns()) {
if (!b.has_column({col.name(), col.name_as_text() })) {
b.without_column(col.name_as_text(), col.type, api::new_timestamp());
}
}
}
return b.build();
}

View File

@@ -1281,6 +1281,13 @@ private:
const auto& key = _validator.previous_partition_key();
if (_validator.current_tombstone()) {
throw compaction_aborted_exception(
_schema->ks_name(),
_schema->cf_name(),
"scrub compaction cannot handle invalid fragments with an active range tombstone change");
}
// If the unexpected fragment is a partition end, we just drop it.
// The only case a partition end is invalid is when it comes after
// another partition end, and we can just drop it in that case.

View File

@@ -317,9 +317,9 @@ future<> compaction_manager::run_custom_job(replica::table* t, sstables::compact
auto job_ptr = std::make_unique<noncopyable_function<future<>(sstables::compaction_data&)>>(std::move(job));
task->compaction_done = with_semaphore(_maintenance_ops_sem, 1, [this, task, &job = *job_ptr] () mutable {
// take read lock for table, so major compaction and resharding can't proceed in parallel.
return with_lock(task->compaction_state.lock.for_read(), [this, task, &job] () mutable {
task->compaction_done = with_semaphore(_custom_jobs_sem, 1, [this, task, &job = *job_ptr] () mutable {
// We don't need to take task->compaction_state.lock.for_read() as it only serializes minor and major
// Allow caller to know that task (e.g. reshape) was asked to stop while waiting for a chance to run.
if (task->stopping) {
throw sstables::compaction_stopped_exception(task->compacting_table->schema()->ks_name(), task->compacting_table->schema()->cf_name(),
@@ -335,7 +335,6 @@ future<> compaction_manager::run_custom_job(replica::table* t, sstables::compact
// no need to register shared sstables because they're excluded from non-resharding
// compaction and some of them may not even belong to current shard.
return job(task->compaction_data);
});
}).then_wrapped([this, task, job_ptr = std::move(job_ptr), type] (future<> f) {
_stats.active_tasks--;
_tasks.remove(task);
@@ -353,32 +352,50 @@ future<> compaction_manager::run_custom_job(replica::table* t, sstables::compact
return task->compaction_done.get_future().then([task] {});
}
compaction_manager::compaction_reenabler::compaction_reenabler(compaction_manager& cm, replica::table* t)
: _cm(cm)
, _table(t)
, _compaction_state(cm.get_compaction_state(_table))
, _holder(_compaction_state.gate.hold())
{
_compaction_state.compaction_disabled_counter++;
cmlog.debug("Temporarily disabled compaction for {}.{}. compaction_disabled_counter={}",
_table->schema()->ks_name(), _table->schema()->cf_name(), _compaction_state.compaction_disabled_counter);
}
compaction_manager::compaction_reenabler::compaction_reenabler(compaction_reenabler&& o) noexcept
: _cm(o._cm)
, _table(std::exchange(o._table, nullptr))
, _compaction_state(o._compaction_state)
, _holder(std::move(o._holder))
{}
compaction_manager::compaction_reenabler::~compaction_reenabler() {
// submit compaction request if we're the last holder of the gate which is still opened.
if (_table && --_compaction_state.compaction_disabled_counter == 0 && !_compaction_state.gate.is_closed()) {
cmlog.debug("Reenabling compaction for {}.{}",
_table->schema()->ks_name(), _table->schema()->cf_name());
try {
_cm.submit(_table);
} catch (...) {
cmlog.warn("compaction_reenabler could not reenable compaction for {}.{}: {}",
_table->schema()->ks_name(), _table->schema()->cf_name(), std::current_exception());
}
}
}
future<compaction_manager::compaction_reenabler>
compaction_manager::stop_and_disable_compaction(replica::table* t) {
compaction_reenabler cre(*this, t);
co_await stop_ongoing_compactions("user-triggered operation", t);
co_return cre;
}
future<>
compaction_manager::run_with_compaction_disabled(replica::table* t, std::function<future<> ()> func) {
auto& c_state = _compaction_state[t];
auto holder = c_state.gate.hold();
compaction_reenabler cre = co_await stop_and_disable_compaction(t);
c_state.compaction_disabled_counter++;
std::exception_ptr err;
try {
co_await stop_ongoing_compactions("user-triggered operation", t);
co_await func();
} catch (...) {
err = std::current_exception();
}
#ifdef DEBUG
assert(_compaction_state.contains(t));
#endif
// submit compaction request if we're the last holder of the gate which is still opened.
if (--c_state.compaction_disabled_counter == 0 && !c_state.gate.is_closed()) {
submit(t);
}
if (err) {
std::rethrow_exception(err);
}
co_return;
co_await func();
}
void compaction_manager::task::setup_new_compaction() {
@@ -584,16 +601,11 @@ future<> compaction_manager::stop() {
}
}
void compaction_manager::really_do_stop() {
if (_state == state::none || _state == state::stopped) {
return;
}
_state = state::stopped;
future<> compaction_manager::really_do_stop() {
cmlog.info("Asked to stop");
// Reset the metrics registry
_metrics.clear();
_stop_future.emplace(stop_ongoing_compactions("shutdown").then([this] () mutable {
return stop_ongoing_compactions("shutdown").then([this] () mutable {
reevaluate_postponed_compactions();
return std::move(_waiting_reevalution);
}).then([this] {
@@ -601,12 +613,34 @@ void compaction_manager::really_do_stop() {
_compaction_submission_timer.cancel();
cmlog.info("Stopped");
return _compaction_controller.shutdown();
}));
});
}
template <typename Ex>
requires std::is_base_of_v<std::exception, Ex> &&
requires (const Ex& ex) {
{ ex.code() } noexcept -> std::same_as<const std::error_code&>;
}
auto swallow_enospc(const Ex& ex) noexcept {
if (ex.code().value() != ENOSPC) {
return make_exception_future<>(std::make_exception_ptr(ex));
}
cmlog.warn("Got ENOSPC on stop, ignoring...");
return make_ready_future<>();
}
void compaction_manager::do_stop() noexcept {
if (_state == state::none || _state == state::stopped) {
return;
}
try {
really_do_stop();
_state = state::stopped;
_stop_future = really_do_stop()
.handle_exception_type([] (const std::system_error& ex) { return swallow_enospc(ex); })
.handle_exception_type([] (const storage_io_error& ex) { return swallow_enospc(ex); })
;
} catch (...) {
try {
cmlog.error("Failed to stop the manager: {}", std::current_exception());
@@ -742,6 +776,7 @@ future<> compaction_manager::perform_offstrategy(replica::table* t) {
_stats.active_tasks++;
task->setup_new_compaction();
return with_scheduling_group(_maintenance_sg.cpu, [this, task, t] {
return t->run_offstrategy_compaction(task->compaction_data).then_wrapped([this, task, schema = t->schema()] (future<> f) mutable {
_stats.active_tasks--;
task->finish_compaction();
@@ -763,6 +798,7 @@ future<> compaction_manager::perform_offstrategy(replica::table* t) {
}
return make_ready_future<stop_iteration>(stop_iteration::yes);
});
});
});
});
}).finally([this, task] {
@@ -810,7 +846,8 @@ future<> compaction_manager::rewrite_sstables(replica::table* t, sstables::compa
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
auto sstable_set_snapshot = can_purge ? std::make_optional(t.get_sstable_set()) : std::nullopt;
auto descriptor = sstables::compaction_descriptor({ sst }, std::move(sstable_set_snapshot), _maintenance_sg.io,
// FIXME: this compaction should run with maintenance priority.
auto descriptor = sstables::compaction_descriptor({ sst }, std::move(sstable_set_snapshot), service::get_local_compaction_priority(),
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, options);
// Releases reference to cleaned sstable such that respective used disk space can be freed.
@@ -819,8 +856,9 @@ future<> compaction_manager::rewrite_sstables(replica::table* t, sstables::compa
};
auto maintenance_permit = co_await seastar::get_units(_maintenance_ops_sem, 1);
// Take write lock for table to serialize cleanup/upgrade sstables/scrub with major compaction/reshape/reshard.
auto write_lock_holder = co_await _compaction_state[&t].lock.hold_write_lock();
// FIXME: acquiring the read lock is not needed after acquiring the _maintenance_ops_sem
// only major compaction needs to acquire the write lock to synchronize with regular compaction.
auto lock_holder = co_await _compaction_state[&t].lock.hold_read_lock();
_stats.pending_tasks--;
_stats.active_tasks++;
@@ -852,7 +890,7 @@ future<> compaction_manager::rewrite_sstables(replica::table* t, sstables::compa
};
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_compaction_controller.backlog_of_shares(200), _available_memory));
completed = co_await with_scheduling_group(_maintenance_sg.cpu, std::ref(perform_rewrite));
completed = co_await with_scheduling_group(_compaction_controller.sg(), std::ref(perform_rewrite));
} while (!completed);
};

View File

@@ -147,6 +147,8 @@ private:
// If the operation must be serialized with regular, then the per-table write lock must be taken.
seastar::named_semaphore _maintenance_ops_sem = {1, named_semaphore_exception_factory{"maintenance operation"}};
seastar::named_semaphore _custom_jobs_sem = {1, named_semaphore_exception_factory{"custom jobs"}};
std::function<void()> compaction_submission_callback();
// all registered tables are reevaluated at a constant interval.
// Submission is a NO-OP when there's nothing to do, so it's fine to call it regularly.
@@ -233,7 +235,7 @@ public:
// Stop all fibers, without waiting. Safe to be called multiple times.
void do_stop() noexcept;
void really_do_stop();
future<> really_do_stop();
// Submit a table to be compacted.
void submit(replica::table* t);
@@ -269,6 +271,31 @@ public:
// parameter job is a function that will carry the operation
future<> run_custom_job(replica::table* t, sstables::compaction_type type, noncopyable_function<future<>(sstables::compaction_data&)> job);
class compaction_reenabler {
compaction_manager& _cm;
replica::table* _table;
compaction_state& _compaction_state;
gate::holder _holder;
public:
compaction_reenabler(compaction_manager&, replica::table*);
compaction_reenabler(compaction_reenabler&&) noexcept;
~compaction_reenabler();
replica::table* compacting_table() const noexcept {
return _table;
}
const compaction_state& compaction_state() const noexcept {
return _compaction_state;
}
};
// Disable compaction temporarily for a table t.
// Caller should call the compaction_reenabler::reenable
future<compaction_reenabler> stop_and_disable_compaction(replica::table* t);
// Run a function with compaction temporarily disabled for a table T.
future<> run_with_compaction_disabled(replica::table* t, std::function<future<> ()> func);

View File

@@ -69,7 +69,11 @@ compaction_descriptor leveled_compaction_strategy::get_major_compaction_job(tabl
}
void leveled_compaction_strategy::notify_completion(const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added) {
if (removed.empty() || added.empty()) {
// All the update here is only relevant for regular compaction's round-robin picking policy, and if
// last_compacted_keys wasn't generated by regular, it means regular is disabled since last restart,
// therefore we can skip the updates here until regular runs for the first time. Once it runs,
// it will be able to generate last_compacted_keys correctly by looking at metadata of files.
if (removed.empty() || added.empty() || !_last_compacted_keys) {
return;
}
auto min_level = std::numeric_limits<uint32_t>::max();

View File

@@ -217,6 +217,7 @@ time_window_compaction_strategy::get_sstables_for_compaction(table_state& table_
auto compaction_time = gc_clock::now();
if (candidates.empty()) {
_estimated_remaining_tasks = 0;
return compaction_descriptor();
}

View File

@@ -615,6 +615,8 @@ arg_parser.add_argument('--static-yaml-cpp', dest='staticyamlcpp', action='store
help='Link libyaml-cpp statically')
arg_parser.add_argument('--tests-debuginfo', action='store', dest='tests_debuginfo', type=int, default=0,
help='Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--perf-tests-debuginfo', action='store', dest='perf_tests_debuginfo', type=int, default=0,
help='Enable(1)/disable(0)compiler debug information generation for perf tests')
arg_parser.add_argument('--python', action='store', dest='python', default='python3',
help='Python3 path')
arg_parser.add_argument('--split-dwarf', dest='split_dwarf', action='store_true', default=False,
@@ -1377,6 +1379,7 @@ linker_flags = linker_flags(compiler=args.cxx)
dbgflag = '-g -gz' if args.debuginfo else ''
tests_link_rule = 'link' if args.tests_debuginfo else 'link_stripped'
perf_tests_link_rule = 'link' if args.perf_tests_debuginfo else 'link_stripped'
# Strip if debuginfo is disabled, otherwise we end up with partial
# debug info from the libraries we static link with
@@ -1901,7 +1904,8 @@ with open(buildfile_tmp, 'w') as f:
# So we strip the tests by default; The user can very
# quickly re-link the test unstripped by adding a "_g"
# to the test name, e.g., "ninja build/release/testname_g"
f.write('build $builddir/{}/{}: {}.{} {} | {} {}\n'.format(mode, binary, tests_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
link_rule = perf_tests_link_rule if binary.startswith('test/perf/') else tests_link_rule
f.write('build $builddir/{}/{}: {}.{} {} | {} {}\n'.format(mode, binary, link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: {}.{} {} | {} {}\n'.format(mode, binary, regular_link_rule, mode, str.join(' ', objs), seastar_dep, seastar_testing_dep))
f.write(' libs = {}\n'.format(local_libs))

View File

@@ -1386,7 +1386,7 @@ serviceLevelOrRoleName returns [sstring name]
std::transform($name.begin(), $name.end(), $name.begin(), ::tolower); }
| t=STRING_LITERAL { $name = sstring($t.text); }
| t=QUOTED_NAME { $name = sstring($t.text); }
| k=unreserved_keyword { $name = sstring($t.text);
| k=unreserved_keyword { $name = k;
std::transform($name.begin(), $name.end(), $name.begin(), ::tolower);}
| QMARK {add_recognition_error("Bind variables cannot be used for service levels or role names");}
;

View File

@@ -12,6 +12,7 @@
#include "cql3_type.hh"
#include "cql3/util.hh"
#include "exceptions/exceptions.hh"
#include "ut_name.hh"
#include "data_dictionary/data_dictionary.hh"
#include "data_dictionary/user_types_metadata.hh"
@@ -436,7 +437,20 @@ sstring maybe_quote(const sstring& identifier) {
}
if (!need_quotes) {
return identifier;
// A seemingly valid identifier matching [a-z][a-z0-9_]* may still
// need quoting if it is a CQL keyword, e.g., "to" (see issue #9450).
// While our parser Cql.g has different production rules for different
// types of identifiers (column names, table names, etc.), all of
// these behave identically for alphanumeric strings: they exclude
// many keywords but allow keywords listed as "unreserved keywords".
// So we can use any of them, for example cident.
try {
cql3::util::do_with_parser(identifier, std::mem_fn(&cql3_parser::CqlParser::cident));
return identifier;
} catch(exceptions::syntax_exception&) {
// This alphanumeric string is not a valid identifier, so fall
// through to have it quoted:
}
}
if (num_quotes == 0) {
return make_sstring("\"", identifier, "\"");

View File

@@ -81,9 +81,7 @@ public:
virtual seastar::future<seastar::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor& qp, service::query_state& state, const query_options& options) const = 0;
virtual bool depends_on_keyspace(const seastar::sstring& ks_name) const = 0;
virtual bool depends_on_column_family(const seastar::sstring& cf_name) const = 0;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const = 0;
virtual seastar::shared_ptr<const metadata> get_result_metadata() const = 0;

View File

@@ -103,10 +103,50 @@ managed_bytes_opt get_value(const column_value& col, const column_value_eval_bag
if (!col_type->is_map()) {
throw exceptions::invalid_request_exception(format("subscripting non-map column {}", cdef->name_as_text()));
}
const auto deserialized = cdef->type->deserialize(managed_bytes_view(*data.other_columns[data.sel.index_of(*cdef)]));
int32_t index = data.sel.index_of(*cdef);
if (index == -1) {
throw std::runtime_error(
format("Column definition {} does not match any column in the query selection",
cdef->name_as_text()));
}
const managed_bytes_opt& serialized = data.other_columns[index];
if (!serialized) {
// For null[i] we return null.
return std::nullopt;
}
const auto deserialized = cdef->type->deserialize(managed_bytes_view(*serialized));
const auto& data_map = value_cast<map_type_impl::native_type>(deserialized);
const auto key = evaluate(*col.sub, options);
auto&& key_type = col_type->name_comparator();
if (key.is_null()) {
// For m[null] return null.
// This is different from Cassandra - which treats m[null]
// as an invalid request error. But m[null] -> null is more
// consistent with our usual null treatement (e.g., both
// null[2] and null < 2 return null). It will also allow us
// to support non-constant subscripts (e.g., m[a]) where "a"
// may be null in some rows and non-null in others, and it's
// not an error.
return std::nullopt;
}
if (key.is_unset_value()) {
// An m[?] with ? bound to UNSET_VALUE is a invalid query.
// We could have detected it earlier while binding, but since
// we currently don't, we must protect the following code
// which can't work with an UNSET_VALUE. Note that the
// placement of this check here means that in an empty table,
// where we never need to evaluate the filter expression, this
// error will not be detected.
throw exceptions::invalid_request_exception(
format("Unsupported unset map key for column {}",
cdef->name_as_text()));
}
if (key.type != key_type) {
// This can't happen, we always verify the index type earlier.
throw std::logic_error(
format("Tried to evaluate expression with wrong type for subscript of {}",
cdef->name_as_text()));
}
const auto found = key.view().with_linearized([&] (bytes_view key_bv) {
using entry = std::pair<data_value, data_value>;
return std::find_if(data_map.cbegin(), data_map.cend(), [&] (const entry& element) {
@@ -121,8 +161,16 @@ managed_bytes_opt get_value(const column_value& col, const column_value_eval_bag
case column_kind::clustering_key:
return managed_bytes(data.clustering_key[cdef->id]);
case column_kind::static_column:
case column_kind::regular_column:
return managed_bytes_opt(data.other_columns[data.sel.index_of(*cdef)]);
[[fallthrough]];
case column_kind::regular_column: {
int32_t index = data.sel.index_of(*cdef);
if (index == -1) {
throw std::runtime_error(
format("Column definition {} does not match any column in the query selection",
cdef->name_as_text()));
}
return managed_bytes_opt(data.other_columns[index]);
}
default:
throw exceptions::unsupported_operation_exception("Unknown column kind");
}

View File

@@ -953,7 +953,7 @@ bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
return statement->depends_on(ks_name, cf_name);
}
future<> query_processor::query_internal(

View File

@@ -514,7 +514,7 @@ statement_restrictions::statement_restrictions(data_dictionary::database db,
}
if (!_nonprimary_key_restrictions->empty()) {
if (_has_queriable_regular_index) {
if (_has_queriable_regular_index && _partition_range_is_simple) {
_uses_secondary_indexing = true;
} else if (!allow_filtering) {
throw exceptions::invalid_request_exception("Cannot execute this query as it might involve data filtering and "

View File

@@ -165,7 +165,7 @@ public:
template<typename RowComparator>
void sort(const RowComparator& cmp) {
std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
std::sort(_rows.begin(), _rows.end(), cmp);
}
metadata& get_metadata();

View File

@@ -422,11 +422,16 @@ bool result_set_builder::restrictions_filter::do_filter(const selection& selecti
}
auto clustering_columns_restrictions = _restrictions->get_clustering_columns_restrictions();
if (dynamic_pointer_cast<cql3::restrictions::multi_column_restriction>(clustering_columns_restrictions)) {
bool has_multi_col_clustering_restrictions =
dynamic_pointer_cast<cql3::restrictions::multi_column_restriction>(clustering_columns_restrictions) != nullptr;
if (has_multi_col_clustering_restrictions) {
clustering_key_prefix ckey = clustering_key_prefix::from_exploded(clustering_key);
return expr::is_satisfied_by(
bool multi_col_clustering_satisfied = expr::is_satisfied_by(
clustering_columns_restrictions->expression,
partition_key, clustering_key, static_row, row, selection, _options);
if (!multi_col_clustering_satisfied) {
return false;
}
}
auto static_row_iterator = static_row.iterator();
@@ -474,6 +479,13 @@ bool result_set_builder::restrictions_filter::do_filter(const selection& selecti
if (_skip_ck_restrictions) {
continue;
}
if (has_multi_col_clustering_restrictions) {
// Mixing multi column and single column restrictions on clustering
// key columns is forbidden.
// Since there are multi column restrictions we have to skip
// evaluating single column restrictions or we will get an error.
continue;
}
auto clustering_key_restrictions_map = _restrictions->get_single_column_clustering_key_restrictions();
auto restr_it = clustering_key_restrictions_map.find(cdef);
if (restr_it == clustering_key_restrictions_map.end()) {

View File

@@ -18,13 +18,7 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() const {
return 0;
}
bool cql3::statements::authentication_statement::depends_on_keyspace(
const sstring& ks_name) const {
return false;
}
bool cql3::statements::authentication_statement::depends_on_column_family(
const sstring& cf_name) const {
bool cql3::statements::authentication_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return false;
}

View File

@@ -27,9 +27,7 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -20,13 +20,7 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() const {
return 0;
}
bool cql3::statements::authorization_statement::depends_on_keyspace(
const sstring& ks_name) const {
return false;
}
bool cql3::statements::authorization_statement::depends_on_column_family(
const sstring& cf_name) const {
bool cql3::statements::authorization_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return false;
}

View File

@@ -31,9 +31,7 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -70,14 +70,9 @@ batch_statement::batch_statement(type type_,
{
}
bool batch_statement::depends_on_keyspace(const sstring& ks_name) const
bool batch_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}
bool batch_statement::depends_on_column_family(const sstring& cf_name) const
{
return false;
return boost::algorithm::any_of(_statements, [&ks_name, &cf_name] (auto&& s) { return s.statement->depends_on(ks_name, cf_name); });
}
uint32_t batch_statement::get_bound_terms() const

View File

@@ -88,9 +88,7 @@ public:
std::unique_ptr<attributes> attrs,
cql_stats& stats);
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual uint32_t get_bound_terms() const override;

View File

@@ -20,6 +20,7 @@
#include "gms/feature_service.hh"
#include "tombstone_gc_extension.hh"
#include "tombstone_gc.hh"
#include "utils/bloom_calculations.hh"
#include <boost/algorithm/string/predicate.hpp>
@@ -145,6 +146,16 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
throw exceptions::configuration_exception(KW_MAX_INDEX_INTERVAL + " must be greater than " + KW_MIN_INDEX_INTERVAL);
}
if (get_simple(KW_BF_FP_CHANCE)) {
double bloom_filter_fp_chance = get_double(KW_BF_FP_CHANCE, 0/*not used*/);
double min_bloom_filter_fp_chance = utils::bloom_calculations::min_supported_bloom_filter_fp_chance();
if (bloom_filter_fp_chance <= min_bloom_filter_fp_chance || bloom_filter_fp_chance > 1.0) {
throw exceptions::configuration_exception(format(
"{} must be larger than {} and less than or equal to 1.0 (got {})",
KW_BF_FP_CHANCE, min_bloom_filter_fp_chance, bloom_filter_fp_chance));
}
}
speculative_retry::from_sstring(get_string(KW_SPECULATIVE_RETRY, speculative_retry(speculative_retry::type::NONE, 0).to_sstring()));
}

View File

@@ -13,6 +13,7 @@
#include "cql3/statements/cf_prop_defs.hh"
#include "cql3/column_identifier.hh"
#include "data_dictionary/data_dictionary.hh"
namespace cql3 {

View File

@@ -110,9 +110,6 @@ future<> modification_statement::check_access(query_processor& qp, const service
future<std::vector<mutation>>
modification_statement::get_mutations(query_processor& qp, const query_options& options, db::timeout_clock::time_point timeout, bool local, int64_t now, service::query_state& qs) const {
if (_restrictions->range_or_slice_eq_null(options)) { // See #7852 and #9290.
throw exceptions::invalid_request_exception("Invalid null value in condition for a key column");
}
auto cl = options.get_consistency();
auto json_cache = maybe_prepare_json_cache(options);
auto keys = build_partition_keys(options, json_cache);
@@ -255,6 +252,10 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
inc_cql_stats(qs.get_client_state().is_internal());
if (_restrictions->range_or_slice_eq_null(options)) { // See #7852 and #9290.
throw exceptions::invalid_request_exception("Invalid null value in condition for a key column");
}
if (has_conditions()) {
return execute_with_condition(qp, qs, options);
}
@@ -539,12 +540,8 @@ modification_statement::validate(query_processor&, const service::client_state&
}
}
bool modification_statement::depends_on_keyspace(const sstring& ks_name) const {
return keyspace() == ks_name;
}
bool modification_statement::depends_on_column_family(const sstring& cf_name) const {
return column_family() == cf_name;
bool modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return keyspace() == ks_name && (!cf_name || column_family() == *cf_name);
}
void modification_statement::add_operation(::shared_ptr<operation> op) {

View File

@@ -137,9 +137,7 @@ public:
// Validate before execute, using client state and current schema
void validate(query_processor&, const service::client_state& state) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
void add_operation(::shared_ptr<operation> op);

View File

@@ -45,12 +45,7 @@ future<> schema_altering_statement::grant_permissions_to_creator(const service::
return make_ready_future<>();
}
bool schema_altering_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
}
bool schema_altering_statement::depends_on_column_family(const sstring& cf_name) const
bool schema_altering_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}

View File

@@ -53,9 +53,7 @@ protected:
*/
virtual future<> grant_permissions_to_creator(const service::client_state&) const;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual uint32_t get_bound_terms() const override;

View File

@@ -167,12 +167,8 @@ void select_statement::validate(query_processor&, const service::client_state& s
// Nothing to do, all validation has been done by raw_statemet::prepare()
}
bool select_statement::depends_on_keyspace(const sstring& ks_name) const {
return keyspace() == ks_name;
}
bool select_statement::depends_on_column_family(const sstring& cf_name) const {
return column_family() == cf_name;
bool select_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return keyspace() == ks_name && (!cf_name || column_family() == *cf_name);
}
const sstring& select_statement::keyspace() const {

View File

@@ -100,8 +100,7 @@ public:
virtual uint32_t get_bound_terms() const override;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;
virtual void validate(query_processor&, const service::client_state& state) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual future<::shared_ptr<cql_transport::messages::result_message>> execute(query_processor& qp,
service::query_state& state, const query_options& options) const override;

View File

@@ -17,13 +17,7 @@ uint32_t service_level_statement::get_bound_terms() const {
return 0;
}
bool service_level_statement::depends_on_keyspace(
const sstring &ks_name) const {
return false;
}
bool service_level_statement::depends_on_column_family(
const sstring &cf_name) const {
bool service_level_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return false;
}

View File

@@ -43,9 +43,7 @@ public:
uint32_t get_bound_terms() const override;
bool depends_on_keyspace(const sstring& ks_name) const override;
bool depends_on_column_family(const sstring& cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -30,7 +30,7 @@ void sl_prop_defs::validate() {
data_value v = duration_type->deserialize(duration_type->from_string(*repr));
cql_duration duration = static_pointer_cast<const duration_type_impl>(duration_type)->from_value(v);
if (duration.months || duration.days) {
throw exceptions::invalid_request_exception("Timeout values cannot be longer than 24h");
throw exceptions::invalid_request_exception("Timeout values cannot be expressed in days/months");
}
if (duration.nanoseconds % 1'000'000 != 0) {
throw exceptions::invalid_request_exception("Timeout values must be expressed in millisecond granularity");

View File

@@ -39,12 +39,7 @@ std::unique_ptr<prepared_statement> truncate_statement::prepare(data_dictionary:
return std::make_unique<prepared_statement>(::make_shared<truncate_statement>(*this));
}
bool truncate_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
}
bool truncate_statement::depends_on_column_family(const sstring& cf_name) const
bool truncate_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}

View File

@@ -30,9 +30,7 @@ public:
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
virtual bool depends_on_column_family(const sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -46,12 +46,7 @@ std::unique_ptr<prepared_statement> use_statement::prepare(data_dictionary::data
}
bool use_statement::depends_on_keyspace(const sstring& ks_name) const
{
return false;
}
bool use_statement::depends_on_column_family(const sstring& cf_name) const
bool use_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const
{
return false;
}

View File

@@ -31,9 +31,7 @@ public:
virtual uint32_t get_bound_terms() const override;
virtual bool depends_on_keyspace(const seastar::sstring& ks_name) const override;
virtual bool depends_on_column_family(const seastar::sstring& cf_name) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual seastar::future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -18,6 +18,8 @@
#include "types/listlike_partial_deserializing_iterator.hh"
#include "utils/managed_bytes.hh"
#include "exceptions/exceptions.hh"
#include <boost/algorithm/string/trim_all.hpp>
#include <boost/algorithm/string.hpp>
static inline bool is_control_char(char c) {
return c >= 0 && c <= 0x1F;
@@ -78,8 +80,35 @@ static int64_t to_int64_t(const rjson::value& value) {
return value.GetInt();
} else if (value.IsUint()) {
return value.GetUint();
} else if (value.GetUint64()) {
} else if (value.IsUint64()) {
return value.GetUint64(); //NOTICE: large uint64_t values will get overflown
} else if (value.IsDouble()) {
// We allow specifing integer constants
// using scientific notation (for example 1.3e8)
// and floating-point numbers ending with .0 (for example 12.0),
// but not floating-point numbers with fractional part (12.34).
//
// The reason is that JSON standard does not have separate
// types for integers and floating-point numbers, only
// a single "number" type. Some serializers may
// produce an integer in that floating-point format.
double double_value = value.GetDouble();
// Check if the value contains disallowed fractional part (.34 from 12.34).
// With RapidJSON and an integer value in range [-(2^53)+1, (2^53)-1],
// the fractional part will be zero as the entire value
// fits in 53-bit significand. RapidJSON's parsing code does not lose accuracy:
// when parsing a number like 12.34e8, it accumulates 1234 to a int64_t number,
// then converts it to double and multiples by power of 10, never having any
// digit in fractional part.
double integral;
double fractional = std::modf(double_value, &integral);
if (fractional != 0.0 && fractional != -0.0) {
throw marshal_exception(format("Incorrect JSON floating-point value "
"for int64 type: {} (it should not contain fractional part {})", value, fractional));
}
return double_value;
}
throw marshal_exception(format("Incorrect JSON value for int64 type: {}", value));
}
@@ -197,6 +226,17 @@ struct from_json_object_visitor {
}
bytes operator()(const boolean_type_impl& t) {
if (!value.IsBool()) {
if (value.IsString()) {
std::string str(rjson::to_string_view(value));
boost::trim_all(str);
boost::to_lower(str);
if (str == "true") {
return t.decompose(true);
} else if (str == "false") {
return t.decompose(false);
}
}
throw marshal_exception(format("Invalid JSON object {}", value));
}
return t.decompose(value.GetBool());

View File

@@ -74,6 +74,13 @@ std::unique_ptr<cql3::statements::raw::select_statement> build_select_statement(
/// forbids non-alpha-numeric characters in identifier names.
/// Quoting involves wrapping the string in double-quotes ("). A double-quote
/// character itself is quoted by doubling it.
/// maybe_quote() also quotes reserved CQL keywords (e.g., "to", "where")
/// but doesn't quote *unreserved* keywords (like ttl, int or as).
/// Note that this means that if new reserved keywords are added to the
/// parser, a saved output of maybe_quote() may no longer be parsable by
/// parser. To avoid this forward-compatibility issue, use quote() instead
/// of maybe_quote() - to unconditionally quote an identifier even if it is
/// lowercase and not (yet) a keyword.
sstring maybe_quote(const sstring& s);
// Check whether timestamp is not too far in the future as this probably

View File

@@ -11,6 +11,7 @@
*/
#include <chrono>
#include <exception>
#include <seastar/core/future-util.hh>
#include <seastar/core/do_with.hh>
#include <seastar/core/semaphore.hh>
@@ -247,6 +248,7 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
} catch (data_dictionary::no_such_keyspace& ex) {
// should probably ignore and drop the batch
} catch (...) {
blogger.warn("Replay failed (will retry): {}", std::current_exception());
// timeout, overload etc.
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),

View File

@@ -65,6 +65,25 @@ hinted_handoff_enabled_to_json(const db::config::hinted_handoff_enabled_type& h)
return value_to_json(h.to_configuration_string());
}
// Convert a value that can be printed with operator<<, or a vector of
// such values, to JSON. An example is enum_option<T>, because enum_option<T>
// has a operator<<.
template <typename T>
static json::json_return_type
printable_to_json(const T& e) {
return value_to_json(format("{}", e));
}
template <typename T>
static json::json_return_type
printable_vector_to_json(const std::vector<T>& e) {
std::vector<sstring> converted;
converted.reserve(e.size());
for (const auto& option : e) {
converted.push_back(format("{}", option));
}
return value_to_json(converted);
}
template <>
const config_type config_type_for<bool> = config_type("bool", value_to_json<bool>);
@@ -109,11 +128,11 @@ const config_type config_type_for<db::seed_provider_type> = config_type("seed pr
template <>
const config_type config_type_for<std::vector<enum_option<db::experimental_features_t>>> = config_type(
"experimental features", value_to_json<std::vector<sstring>>);
"experimental features", printable_vector_to_json<enum_option<db::experimental_features_t>>);
template <>
const config_type config_type_for<enum_option<db::tri_mode_restriction_t>> = config_type(
"restriction mode", value_to_json<sstring>);
"restriction mode", printable_to_json<enum_option<db::tri_mode_restriction_t>>);
template <>
const config_type config_type_for<db::config::hinted_handoff_enabled_type> = config_type("hinted handoff enabled", hinted_handoff_enabled_to_json);
@@ -862,6 +881,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Flush tables in the system_schema keyspace after schema modification. This is required for crash recovery, but slows down tests and can be disabled for them")
, restrict_replication_simplestrategy(this, "restrict_replication_simplestrategy", liveness::LiveUpdate, value_status::Used, db::tri_mode_restriction_t::mode::FALSE, "Controls whether to disable SimpleStrategy replication. Can be true, false, or warn.")
, restrict_dtcs(this, "restrict_dtcs", liveness::LiveUpdate, value_status::Used, db::tri_mode_restriction_t::mode::WARN, "Controls whether to prevent setting DateTieredCompactionStrategy. Can be true, false, or warn.")
, cache_index_pages(this, "cache_index_pages", liveness::LiveUpdate, value_status::Used, true,
"Keep SSTable index pages in the global cache after a SSTable read. Expected to improve performance for workloads with big partitions, but may degrade performance for workloads with small partitions.")
, default_log_level(this, "default_log_level", value_status::Used)
, logger_log_level(this, "logger_log_level", value_status::Used)
, log_to_stdout(this, "log_to_stdout", value_status::Used)

View File

@@ -365,6 +365,9 @@ public:
named_value<tri_mode_restriction> restrict_replication_simplestrategy;
named_value<tri_mode_restriction> restrict_dtcs;
named_value<bool> cache_index_pages;
seastar::logging_settings logging_settings(const log_cli::options&) const;
const db::extensions& extensions() const;

View File

@@ -574,12 +574,8 @@ public:
}
future<> flush_schemas() {
return _qp.proxy().get_db().invoke_on_all([this] (replica::database& db) {
return parallel_for_each(db::schema_tables::all_table_names(schema_features::full()), [this, &db](const sstring& cf_name) {
auto& cf = db.find_column_family(db::schema_tables::NAME, cf_name);
return cf.flush();
});
});
auto& db = _qp.db().real_database();
return db.flush_on_all(db::schema_tables::NAME, db::schema_tables::all_table_names(schema_features::full()));
}
future<> migrate() {

View File

@@ -1042,12 +1042,9 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
co_await proxy.local().mutate_locally(std::move(mutations), tracing::trace_state_ptr());
if (do_flush) {
co_await proxy.local().get_db().invoke_on_all([&] (replica::database& db) -> future<> {
auto& cfs = column_families;
co_await parallel_for_each(cfs.begin(), cfs.end(), [&] (const utils::UUID& id) -> future<> {
auto& cf = db.find_column_family(id);
co_await cf.flush();
});
auto& db = proxy.local().local_db();
co_await parallel_for_each(column_families, [&db] (const utils::UUID& id) -> future<> {
return db.flush_on_all(id);
});
}

View File

@@ -11,6 +11,8 @@
*/
#include <boost/range/adaptors.hpp>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include "db/snapshot-ctl.hh"
#include "replica/database.hh"
@@ -59,24 +61,21 @@ future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_
boost::copy(_db.local().get_keyspaces() | boost::adaptors::map_keys, std::back_inserter(keyspace_names));
};
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), sf, this] {
return parallel_for_each(keyspace_names, [tag, this] (auto& ks_name) {
return check_snapshot_not_exist(ks_name, tag);
}).then([this, tag, keyspace_names, sf] {
return _db.invoke_on_all([tag = std::move(tag), keyspace_names, sf] (replica::database& db) {
return parallel_for_each(keyspace_names, [&db, tag = std::move(tag), sf] (auto& ks_name) {
auto& ks = db.find_keyspace(ks_name);
return parallel_for_each(ks.metadata()->cf_meta_data(), [&db, tag = std::move(tag), sf] (auto& pair) {
auto& cf = db.find_column_family(pair.second);
return cf.snapshot(db, tag, bool(sf));
});
});
});
});
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), sf, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), sf);
});
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
co_await parallel_for_each(keyspace_names, [tag, this] (const auto& ks_name) {
return check_snapshot_not_exist(ks_name, tag);
});
co_await parallel_for_each(keyspace_names, [this, tag = std::move(tag), sf] (const auto& ks_name) {
return _db.local().snapshot_on_all(ks_name, tag, bool(sf));
});
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf, allow_view_snapshots av) {
if (ks_name.empty()) {
throw std::runtime_error("You must supply a keyspace name");
}
@@ -87,25 +86,25 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
throw std::runtime_error("You must supply a snapshot name.");
}
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), sf] {
return check_snapshot_not_exist(ks_name, tag, tables).then([this, ks_name, tables, tag, sf] {
return do_with(std::vector<sstring>(std::move(tables)),[this, ks_name, tag, sf](const std::vector<sstring>& tables) {
return do_for_each(tables, [ks_name, tag, sf, this] (const sstring& table_name) {
if (table_name.find(".") != sstring::npos) {
throw std::invalid_argument("Cannot take a snapshot of a secondary index by itself. Run snapshot on the table that owns the index.");
}
return _db.invoke_on_all([ks_name, table_name, tag, sf] (replica::database &db) {
auto& cf = db.find_column_family(ks_name, table_name);
return cf.snapshot(db, tag, bool(sf));
});
});
});
});
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), sf, av] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), sf, av);
});
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, sstring cf_name, sstring tag, skip_flush sf) {
return take_column_family_snapshot(ks_name, std::vector<sstring>{cf_name}, tag, sf);
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf, allow_view_snapshots av) {
co_await check_snapshot_not_exist(ks_name, tag, tables);
for (const auto& table_name : tables) {
auto& cf = _db.local().find_column_family(ks_name, table_name);
if (cf.schema()->is_view() && !av) {
throw std::invalid_argument("Do not take a snapshot of a materialized view or a secondary index by itself. Run snapshot on the base table instead.");
}
}
co_await _db.local().snapshot_on_all(ks_name, std::move(tables), std::move(tag), bool(sf));
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, sstring cf_name, sstring tag, skip_flush sf, allow_view_snapshots av) {
return take_column_family_snapshot(ks_name, std::vector<sstring>{cf_name}, tag, sf, av);
}
future<> snapshot_ctl::clear_snapshot(sstring tag, std::vector<sstring> keyspace_names, sstring cf_name) {

View File

@@ -27,6 +27,7 @@ namespace db {
class snapshot_ctl : public peering_sharded_service<snapshot_ctl> {
public:
using skip_flush = bool_class<class skip_flush_tag>;
using allow_view_snapshots = bool_class<class allow_view_snapsots_tag>;
struct snapshot_details {
int64_t live;
@@ -64,7 +65,7 @@ public:
* @param tables a vector of tables names to snapshot
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no, allow_view_snapshots av = allow_view_snapshots::no);
/**
* Takes the snapshot of a specific column family. A snapshot name must be specified.
@@ -73,7 +74,7 @@ public:
* @param columnFamilyName the column family to snapshot
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_column_family_snapshot(sstring ks_name, sstring cf_name, sstring tag, skip_flush sf = skip_flush::no);
future<> take_column_family_snapshot(sstring ks_name, sstring cf_name, sstring tag, skip_flush sf = skip_flush::no, allow_view_snapshots av = allow_view_snapshots::no);
/**
* Remove the snapshot with the given name from the given keyspaces.
@@ -97,6 +98,9 @@ private:
template <typename Func>
std::result_of_t<Func()> run_snapshot_list_operation(Func&&);
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no, allow_view_snapshots av = allow_view_snapshots::no);
};
}

View File

@@ -2482,10 +2482,14 @@ class db_config_table final : public streaming_virtual_table {
for (auto& c_ref : cfg.values()) {
auto& c = c_ref.get();
if (c.name() == name) {
if (c.set_value(value, utils::config_file::config_source::CQL)) {
return cfg.broadcast_to_all_shards();
} else {
return make_exception_future<>(virtual_table_update_exception("option is not live-updateable"));
try {
if (c.set_value(value, utils::config_file::config_source::CQL)) {
return cfg.broadcast_to_all_shards();
} else {
return make_exception_future<>(virtual_table_update_exception("option is not live-updateable"));
}
} catch (boost::bad_lexical_cast&) {
return make_exception_future<>(virtual_table_update_exception("cannot parse option value"));
}
}
}

View File

@@ -10,6 +10,7 @@
#include <seastar/core/seastar.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/reactor.hh>
#include <utility>
#include <optional>
#include "dht/token.hh"

View File

@@ -121,6 +121,9 @@ const column_definition* view_info::view_column(const column_definition& base_de
void view_info::set_base_info(db::view::base_info_ptr base_info) {
_base_info = std::move(base_info);
// Forget the cached objects which may refer to the base schema.
_select_statement = nullptr;
_partition_slice = std::nullopt;
}
// A constructor for a base info that can facilitate reads and writes from the materialized view.
@@ -322,7 +325,11 @@ public:
view_filter_checking_visitor(const schema& base, const view_info& view)
: _base(base)
, _view(view)
, _selection(cql3::selection::selection::wildcard(_base.shared_from_this()))
, _selection(cql3::selection::selection::for_columns(_base.shared_from_this(),
boost::copy_range<std::vector<const column_definition*>>(
_base.regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return &cdef; }))
)
)
{}
void accept_new_partition(const partition_key& key, uint64_t row_count) {
@@ -859,13 +866,18 @@ void view_updates::generate_update(
bool same_row = true;
for (auto col_id : col_ids) {
auto* after = update.cells().find_cell(col_id);
// Note: multi-cell columns can't be part of the primary key.
auto& cdef = _base->regular_column_at(col_id);
if (existing) {
auto* before = existing->cells().find_cell(col_id);
// Note that this cell is necessarily atomic, because col_ids are
// view key columns, and keys must be atomic.
if (before && before->as_atomic_cell(cdef).is_live()) {
if (after && after->as_atomic_cell(cdef).is_live()) {
auto cmp = compare_atomic_cell_for_merge(before->as_atomic_cell(cdef), after->as_atomic_cell(cdef));
// We need to compare just the values of the keys, not
// metadata like the timestamp. This is because below,
// if the old and new view row have the same key, we need
// to be sure to reach the update_entry() case.
auto cmp = compare_unsigned(before->as_atomic_cell(cdef).value(), after->as_atomic_cell(cdef).value());
if (cmp != 0) {
same_row = false;
}
@@ -885,7 +897,13 @@ void view_updates::generate_update(
if (same_row) {
update_entry(base_key, update, *existing, now);
} else {
replace_entry(base_key, update, *existing, now);
// This code doesn't work if the old and new view row have the
// same key, because if they do we get both data and tombstone
// for the same timestamp (now) and the tombstone wins. This
// is why we need the "same_row" case above - it's not just a
// performance optimization.
delete_old_entry(base_key, *existing, update, now);
create_entry(base_key, update, now);
}
} else {
delete_old_entry(base_key, *existing, update, now);
@@ -1293,7 +1311,7 @@ future<> mutate_MV(
auto mut_ptr = remote_endpoints.empty() ? std::make_unique<frozen_mutation>(std::move(mut.fm)) : std::make_unique<frozen_mutation>(mut.fm);
tracing::trace(tr_state, "Locally applying view update for {}.{}; base token = {}; view token = {}",
mut.s->ks_name(), mut.s->cf_name(), base_token, view_token);
local_view_update = service::get_local_storage_proxy().mutate_locally(mut.s, *mut_ptr, std::move(tr_state), db::commitlog::force_sync::no).then_wrapped(
local_view_update = service::get_local_storage_proxy().mutate_locally(mut.s, *mut_ptr, tr_state, db::commitlog::force_sync::no).then_wrapped(
[s = mut.s, &stats, &cf_stats, tr_state, base_token, view_token, my_address, mut_ptr = std::move(mut_ptr),
units = sem_units.split(sem_units.count())] (future<>&& f) {
--stats.writes;
@@ -2031,15 +2049,21 @@ public:
// Called in the context of a seastar::thread.
void view_builder::execute(build_step& step, exponential_backoff_retry r) {
gc_clock::time_point now = gc_clock::now();
auto consumer = compact_for_query<emit_only_live_rows::yes, view_builder::consumer>(
auto compaction_state = make_lw_shared<compact_for_query_state<emit_only_live_rows::yes>>(
*step.reader.schema(),
now,
step.pslice,
batch_size,
query::max_partitions,
view_builder::consumer{*this, step, now});
consumer.consume_new_partition(step.current_key); // Initialize the state in case we're resuming a partition
query::max_partitions);
auto consumer = compact_for_query<emit_only_live_rows::yes, view_builder::consumer>(compaction_state, view_builder::consumer{*this, step, now});
auto built = step.reader.consume_in_thread(std::move(consumer));
if (auto ds = std::move(*compaction_state).detach_state()) {
auto& range_tombstones = std::get<std::deque<range_tombstone>>(ds->range_tombstones);
for (auto& rt : range_tombstones) {
step.reader.unpop_mutation_fragment(mutation_fragment(*step.reader.schema(), step.reader.permit(), std::move(rt)));
}
step.reader.unpop_mutation_fragment(mutation_fragment(*step.reader.schema(), step.reader.permit(), std::move(ds->partition_start)));
}
_as.check();

View File

@@ -154,10 +154,7 @@ private:
void delete_old_entry(const partition_key& base_key, const clustering_row& existing, const clustering_row& update, gc_clock::time_point now);
void do_delete_old_entry(const partition_key& base_key, const clustering_row& existing, const clustering_row& update, gc_clock::time_point now);
void update_entry(const partition_key& base_key, const clustering_row& update, const clustering_row& existing, gc_clock::time_point now);
void replace_entry(const partition_key& base_key, const clustering_row& update, const clustering_row& existing, gc_clock::time_point now) {
create_entry(base_key, update, now);
delete_old_entry(base_key, existing, update, now);
}
void update_entry_for_computed_column(const partition_key& base_key, const clustering_row& update, const std::optional<clustering_row>& existing, gc_clock::time_point now);
};
class view_update_builder {

View File

@@ -16,6 +16,7 @@
#include "db/view/row_locking.hh"
#include <seastar/core/abort_source.hh>
#include "mutation.hh"
#include <seastar/core/circular_buffer.hh>
class evictable_reader_handle;

View File

@@ -202,6 +202,12 @@ public:
});
}
future<flush_permit> get_all_flush_permits() {
return get_units(_background_work_flush_serializer, _max_background_work).then([this] (auto&& units) {
return this->get_flush_permit(std::move(units));
});
}
bool has_extraneous_flushes_requested() const {
return _extraneous_flushes > 0;
}

View File

@@ -123,10 +123,14 @@ WantedBy=multi-user.target
# - Storage: /path/to/file (inacessible)
# - Storage: /path/to/file
#
# After systemd-v248, available coredump file output changed like this:
# - Storage: /path/to/file (present)
# We need to support both versions.
#
# reference: https://github.com/systemd/systemd/commit/47f50642075a7a215c9f7b600599cbfee81a2913
corefail = False
res = re.findall(r'Storage: (.*)$', coreinfo, flags=re.MULTILINE)
res = re.findall(r'Storage: (\S+)(?: \(.+\))?$', coreinfo, flags=re.MULTILINE)
# v232 or later
if res:
corepath = res[0]

View File

@@ -70,7 +70,17 @@ if __name__ == '__main__':
network_mode = args.mode if args.mode else cfg.get('NETWORK_MODE')
if args.setup_nic_and_disks:
rps_cpus = run('{} --tune net --nic {} --get-cpu-mask'.format(perftune_base_command(), ifname), shell=True, check=True, capture_output=True, encoding='utf-8').stdout.strip()
res = run('{} --tune net --nic {} --get-cpu-mask'.format(perftune_base_command(), ifname), shell=True, check=True, capture_output=True, encoding='utf-8').stdout
# we need to extract CPU mask from output, since perftune.py may also print warning messages (#10082)
match = re.match('(.*\n)?(0x[0-9a-f]+(?:,0x[0-9a-f]+)*)', res, re.DOTALL)
try:
warning = match.group(1)
rps_cpus = match.group(2)
except:
raise Exception(f'Failed to retrive CPU mask: {res}')
# print warning message if available
if warning:
print(warning.strip())
if len(rps_cpus) > 0:
cpuset = hex2list(rps_cpus)
run('/opt/scylladb/scripts/scylla_cpuset_setup --cpuset {}'.format(cpuset), shell=True, check=True)

View File

@@ -6,12 +6,16 @@ is_nonroot() {
[ -f "$scylladir"/SCYLLA-NONROOT-FILE ]
}
is_container() {
[ -f "$scylladir"/SCYLLA-CONTAINER-FILE ]
}
is_privileged() {
[ ${EUID:-${UID}} = 0 ]
}
execsudo() {
if is_nonroot; then
if is_nonroot || is_container; then
exec "$@"
else
exec sudo -u scylla -g scylla "$@"

View File

@@ -82,15 +82,17 @@ run bash -ec "echo 'debconf debconf/frontend select Noninteractive' | debconf-se
run bash -ec "rm -rf /etc/rsyslog.conf"
run apt-get -y install hostname supervisor openssh-server openssh-client openjdk-11-jre-headless python python-yaml curl rsyslog locales sudo
run locale-gen en_US.UTF-8
run update-locale LANG=en_US.UTF-8 LANGUAGE=en_US:en LC_ALL=en_US.UTF_8
run update-locale LANG=en_US.UTF-8 LANGUAGE=en_US:en LC_ALL=en_US.UTF-8
run bash -ec "dpkg -i packages/*.deb"
run apt-get -y clean all
run bash -ec "cat /scylla_bashrc >> /etc/bash.bashrc"
run mkdir -p /etc/supervisor.conf.d
run mkdir -p /var/log/scylla
run chown -R scylla:scylla /var/lib/scylla
run sed -i -e 's/^SCYLLA_ARGS=".*"$/SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix"/' /etc/default/scylla-server
run mkdir -p /opt/scylladb/supervisor
run touch /opt/scylladb/SCYLLA-CONTAINER-FILE
bcp dist/common/supervisor/scylla-server.sh /opt/scylladb/supervisor/scylla-server.sh
bcp dist/common/supervisor/scylla-jmx.sh /opt/scylladb/supervisor/scylla-jmx.sh
bcp dist/common/supervisor/scylla-node-exporter.sh /opt/scylladb/supervisor/scylla-node-exporter.sh

View File

@@ -1,4 +1,4 @@
[program:scylla-server]
[program:scylla]
command=/opt/scylladb/supervisor/scylla-server.sh
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0

View File

@@ -1,41 +0,0 @@
# choose following mode: virtio, dpdk, posix
NETWORK_MODE=posix
# tap device name(virtio)
TAP=tap0
# bridge device name (virtio)
BRIDGE=virbr0
# ethernet device name
IFNAME=eth0
# setup NIC's and disks' interrupts, RPS, XPS, nomerges and I/O scheduler (posix)
SET_NIC_AND_DISKS=no
# ethernet device driver (dpdk)
ETHDRV=
# ethernet device PCI ID (dpdk)
ETHPCIID=
# number of hugepages
NR_HUGEPAGES=64
# user for process (must be root for dpdk)
USER=scylla
# group for process
GROUP=scylla
# scylla home dir
SCYLLA_HOME=/var/lib/scylla
# scylla config dir
SCYLLA_CONF=/etc/scylla
# scylla arguments
SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix"
# setup as AMI instance
AMI=no

View File

@@ -32,7 +32,7 @@
logging::logger fmr_logger("flat_mutation_reader");
flat_mutation_reader& flat_mutation_reader::operator=(flat_mutation_reader&& o) noexcept {
if (_impl) {
if (_impl && _impl->is_close_required()) {
impl* ip = _impl.get();
// Abort to enforce calling close() before readers are closed
// to prevent leaks and potential use-after-free due to background
@@ -45,7 +45,7 @@ flat_mutation_reader& flat_mutation_reader::operator=(flat_mutation_reader&& o)
}
flat_mutation_reader::~flat_mutation_reader() {
if (_impl) {
if (_impl && _impl->is_close_required()) {
impl* ip = _impl.get();
// Abort to enforce calling close() before readers are closed
// to prevent leaks and potential use-after-free due to background
@@ -1580,6 +1580,9 @@ bool mutation_fragment_stream_validator::operator()(dht::token t) {
}
bool mutation_fragment_stream_validator::operator()(mutation_fragment_v2::kind kind, position_in_partition_view pos) {
if (kind == mutation_fragment_v2::kind::partition_end && _current_tombstone) {
return false;
}
if (_prev_kind == mutation_fragment_v2::kind::partition_end) {
const bool valid = (kind == mutation_fragment_v2::kind::partition_start);
if (valid) {
@@ -1607,7 +1610,11 @@ bool mutation_fragment_stream_validator::operator()(mutation_fragment::kind kind
}
bool mutation_fragment_stream_validator::operator()(const mutation_fragment_v2& mf) {
return (*this)(mf.mutation_fragment_kind(), mf.position());
const auto valid = (*this)(mf.mutation_fragment_kind(), mf.position());
if (valid && mf.is_range_tombstone_change()) {
_current_tombstone = mf.as_range_tombstone_change().tombstone();
}
return valid;
}
bool mutation_fragment_stream_validator::operator()(const mutation_fragment& mf) {
return (*this)(to_mutation_fragment_kind_v2(mf.mutation_fragment_kind()), mf.position());
@@ -1646,11 +1653,17 @@ void mutation_fragment_stream_validator::reset(dht::decorated_key dk) {
_prev_partition_key = dk;
_prev_pos = position_in_partition::for_partition_start();
_prev_kind = mutation_fragment_v2::kind::partition_start;
_current_tombstone = {};
}
void mutation_fragment_stream_validator::reset(const mutation_fragment_v2& mf) {
_prev_pos = mf.position();
_prev_kind = mf.mutation_fragment_kind();
if (mf.is_range_tombstone_change()) {
_current_tombstone = mf.as_range_tombstone_change().tombstone();
} else {
_current_tombstone = {};
}
}
void mutation_fragment_stream_validator::reset(const mutation_fragment& mf) {
_prev_pos = mf.position();
@@ -1719,6 +1732,11 @@ bool mutation_fragment_stream_validating_filter::operator()(mutation_fragment_v2
fmr_logger.debug("[validator {}] {}:{}", static_cast<void*>(this), kind, pos);
if (kind == mutation_fragment_v2::kind::partition_end && _current_tombstone) {
on_validation_error(fmr_logger, format("[validator {} for {}] Unexpected active tombstone at partition-end: partition key {}: tombstone {}",
static_cast<void*>(this), _name, _validator.previous_partition_key(), _current_tombstone));
}
if (_validation_level >= mutation_fragment_stream_validation_level::clustering_key) {
valid = _validator(kind, pos);
} else {
@@ -1745,7 +1763,11 @@ bool mutation_fragment_stream_validating_filter::operator()(mutation_fragment::k
}
bool mutation_fragment_stream_validating_filter::operator()(const mutation_fragment_v2& mv) {
return (*this)(mv.mutation_fragment_kind(), mv.position());
auto valid = (*this)(mv.mutation_fragment_kind(), mv.position());
if (valid && mv.is_range_tombstone_change()) {
_current_tombstone = mv.as_range_tombstone_change().tombstone();
}
return valid;
}
bool mutation_fragment_stream_validating_filter::operator()(const mutation_fragment& mv) {
return (*this)(to_mutation_fragment_kind_v2(mv.mutation_fragment_kind()), mv.position());
@@ -1764,7 +1786,7 @@ void mutation_fragment_stream_validating_filter::on_end_of_stream() {
}
flat_mutation_reader_v2& flat_mutation_reader_v2::operator=(flat_mutation_reader_v2&& o) noexcept {
if (_impl) {
if (_impl && _impl->is_close_required()) {
impl* ip = _impl.get();
// Abort to enforce calling close() before readers are closed
// to prevent leaks and potential use-after-free due to background
@@ -1777,7 +1799,7 @@ flat_mutation_reader_v2& flat_mutation_reader_v2::operator=(flat_mutation_reader
}
flat_mutation_reader_v2::~flat_mutation_reader_v2() {
if (_impl) {
if (_impl && _impl->is_close_required()) {
impl* ip = _impl.get();
// Abort to enforce calling close() before readers are closed
// to prevent leaks and potential use-after-free due to background

View File

@@ -132,6 +132,7 @@ public:
private:
tracked_buffer _buffer;
size_t _buffer_size = 0;
bool _close_required = false;
protected:
size_t max_buffer_size_in_bytes = default_max_buffer_size_in_bytes();
bool _end_of_stream = false;
@@ -167,6 +168,8 @@ public:
bool is_end_of_stream() const { return _end_of_stream; }
bool is_buffer_empty() const { return _buffer.empty(); }
bool is_buffer_full() const { return _buffer_size >= max_buffer_size_in_bytes; }
bool is_close_required() const { return _close_required; }
void set_close_required() { _close_required = true; }
static constexpr size_t default_max_buffer_size_in_bytes() { return 8 * 1024; }
mutation_fragment pop_mutation_fragment() {
@@ -504,9 +507,15 @@ public:
//
// Can be used to skip over entire partitions if interleaved with
// `operator()()` calls.
future<> next_partition() { return _impl->next_partition(); }
future<> next_partition() {
_impl->set_close_required();
return _impl->next_partition();
}
future<> fill_buffer() { return _impl->fill_buffer(); }
future<> fill_buffer() {
_impl->set_close_required();
return _impl->fill_buffer();
}
// Changes the range of partitions to pr. The range can only be moved
// forwards. pr.begin() needs to be larger than pr.end() of the previousl
@@ -515,6 +524,7 @@ public:
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
future<> fast_forward_to(const dht::partition_range& pr) {
_impl->set_close_required();
return _impl->fast_forward_to(pr);
}
// Skips to a later range of rows.
@@ -544,6 +554,7 @@ public:
// In particular one must first enter a partition by fetching a `partition_start`
// fragment before calling `fast_forward_to`.
future<> fast_forward_to(position_range cr) {
_impl->set_close_required();
return _impl->fast_forward_to(std::move(cr));
}
// Closes the reader.

View File

@@ -164,6 +164,7 @@ public:
private:
tracked_buffer _buffer;
size_t _buffer_size = 0;
bool _close_required = false;
protected:
size_t max_buffer_size_in_bytes = default_max_buffer_size_in_bytes();
@@ -205,6 +206,8 @@ public:
bool is_end_of_stream() const { return _end_of_stream; }
bool is_buffer_empty() const { return _buffer.empty(); }
bool is_buffer_full() const { return _buffer_size >= max_buffer_size_in_bytes; }
bool is_close_required() const { return _close_required; }
void set_close_required() { _close_required = true; }
static constexpr size_t default_max_buffer_size_in_bytes() { return 8 * 1024; }
mutation_fragment_v2 pop_mutation_fragment() {
@@ -542,9 +545,15 @@ public:
//
// Can be used to skip over entire partitions if interleaved with
// `operator()()` calls.
future<> next_partition() { return _impl->next_partition(); }
future<> next_partition() {
_impl->set_close_required();
return _impl->next_partition();
}
future<> fill_buffer() { return _impl->fill_buffer(); }
future<> fill_buffer() {
_impl->set_close_required();
return _impl->fill_buffer();
}
// Changes the range of partitions to pr. The range can only be moved
// forwards. pr.begin() needs to be larger than pr.end() of the previousl
@@ -553,6 +562,7 @@ public:
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
future<> fast_forward_to(const dht::partition_range& pr) {
_impl->set_close_required();
return _impl->fast_forward_to(pr);
}
// Skips to a later range of rows.
@@ -582,6 +592,7 @@ public:
// In particular one must first enter a partition by fetching a `partition_start`
// fragment before calling `fast_forward_to`.
future<> fast_forward_to(position_range cr) {
_impl->set_close_required();
return _impl->fast_forward_to(std::move(cr));
}
// Closes the reader.

View File

@@ -143,7 +143,7 @@ export LD_LIBRARY_PATH="$prefix/libreloc"
export UBSAN_OPTIONS="${UBSAN_OPTIONS:+$UBSAN_OPTIONS:}suppressions=$prefix/libexec/ubsan-suppressions.supp"
exec -a "\$0" "$prefix/libexec/$bin" "\$@"
EOF
chmod +x "$root/$prefix/bin/$bin"
chmod 755 "$root/$prefix/bin/$bin"
}
relocate_python3() {
@@ -156,11 +156,11 @@ relocate_python3() {
local pythonpath="$(dirname "$pythoncmd")"
if [ ! -x "$script" ]; then
cp "$script" "$install"
install -m755 "$script" "$install"
return
fi
mkdir -p "$relocateddir"
cp "$script" "$relocateddir"
install -d -m755 "$relocateddir"
install -m755 "$script" "$relocateddir"
cat > "$install"<<EOF
#!/usr/bin/env bash
[[ -z "\$LD_PRELOAD" ]] || { echo "\$0: not compatible with LD_PRELOAD" >&2; exit 110; }
@@ -178,7 +178,7 @@ if [ -f "\${DEBIAN_SSL_CERT_FILE}" ]; then
fi
PYTHONPATH="\${d}:\${d}/libexec:\$PYTHONPATH" PATH="\${d}/../bin:\${d}/$pythonpath:\${PATH}" SSL_CERT_FILE="\${c}" exec -a "\$0" "\${d}/libexec/\${b}" "\$@"
EOF
chmod +x "$install"
chmod 755 "$install"
}
install() {
@@ -392,6 +392,7 @@ install -d -m755 -d "$rprefix"/scyllatop
cp -r tools/scyllatop/* "$rprefix"/scyllatop
install -d -m755 -d "$rprefix"/scripts
cp -r dist/common/scripts/* "$rprefix"/scripts
chmod 755 "$rprefix"/scripts/*
ln -srf "$rprefix/scyllatop/scyllatop.py" "$rprefix/bin/scyllatop"
if $supervisor; then
install -d -m755 "$rprefix"/supervisor
@@ -508,8 +509,13 @@ relocate_python3 "$rprefix"/scripts fix_system_distributed_tables.py
if $supervisor; then
install -d -m755 `supervisor_dir $retc`
for service in scylla-server scylla-jmx scylla-node-exporter; do
if [ "$service" = "scylla-server" ]; then
program="scylla"
else
program=$service
fi
cat << EOS > `supervisor_conf $retc $service`
[program:$service]
[program:$program]
directory=$rprefix
command=/bin/bash -c './supervisor/$service.sh'
EOS

View File

@@ -34,6 +34,10 @@ azure_snitch::azure_snitch(const sstring& fname, unsigned io_cpuid) : production
}
future<> azure_snitch::load_config() {
if (this_shard_id() != io_cpu_id()) {
co_return;
}
sstring region = co_await azure_api_call(REGION_NAME_QUERY_PATH);
sstring azure_zone = co_await azure_api_call(ZONE_NAME_QUERY_PATH);

View File

@@ -1,5 +1,7 @@
#include "locator/ec2_snitch.hh"
#include <seastar/core/seastar.hh>
#include <seastar/core/sleep.hh>
#include <seastar/core/do_with.hh>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/split.hpp>
@@ -67,6 +69,30 @@ future<> ec2_snitch::start() {
}
future<sstring> ec2_snitch::aws_api_call(sstring addr, uint16_t port, sstring cmd) {
return do_with(int(0), [this, addr, port, cmd] (int& i) {
return repeat_until_value([this, addr, port, cmd, &i]() -> future<std::optional<sstring>> {
++i;
return aws_api_call_once(addr, port, cmd).then([] (auto res) {
return make_ready_future<std::optional<sstring>>(std::move(res));
}).handle_exception([&i] (auto ep) {
try {
std::rethrow_exception(ep);
} catch (const std::system_error &e) {
logger().error(e.what());
if (i >= AWS_API_CALL_RETRIES - 1) {
logger().error("Maximum number of retries exceeded");
throw e;
}
}
return sleep(AWS_API_CALL_RETRY_INTERVAL).then([] {
return make_ready_future<std::optional<sstring>>(std::nullopt);
});
});
});
});
}
future<sstring> ec2_snitch::aws_api_call_once(sstring addr, uint16_t port, sstring cmd) {
return connect(socket_address(inet_address{addr}, port))
.then([this, addr, cmd] (connected_socket fd) {
_sd = std::move(fd);

View File

@@ -16,6 +16,8 @@ public:
static constexpr const char* ZONE_NAME_QUERY_REQ = "/latest/meta-data/placement/availability-zone";
static constexpr const char* AWS_QUERY_SERVER_ADDR = "169.254.169.254";
static constexpr uint16_t AWS_QUERY_SERVER_PORT = 80;
static constexpr int AWS_API_CALL_RETRIES = 5;
static constexpr auto AWS_API_CALL_RETRY_INTERVAL = std::chrono::seconds{5};
ec2_snitch(const sstring& fname = "", unsigned io_cpu_id = 0);
virtual future<> start() override;
@@ -32,5 +34,6 @@ private:
output_stream<char> _out;
http_response_parser _parser;
sstring _zone_req;
future<sstring> aws_api_call_once(sstring addr, uint16_t port, const sstring cmd);
};
} // namespace locator

39
main.cc
View File

@@ -367,11 +367,38 @@ static auto defer_verbose_shutdown(const char* what, Func&& func) {
startlog.info("Shutting down {}", what);
try {
func();
startlog.info("Shutting down {} was successful", what);
} catch (...) {
startlog.error("Unexpected error shutting down {}: {}", what, std::current_exception());
throw;
auto ex = std::current_exception();
bool do_abort = true;
try {
std::rethrow_exception(ex);
} catch (const std::system_error& e) {
// System error codes we consider "environmental",
// i.e. not scylla's fault, therefore there is no point in
// aborting and dumping core.
for (int i : {EIO, EACCES, ENOSPC}) {
if (e.code() == std::error_code(i, std::system_category())) {
do_abort = false;
break;
}
}
} catch (...) {
}
auto msg = fmt::format("Unexpected error shutting down {}: {}", what, ex);
if (do_abort) {
startlog.error("{}: aborting", msg);
abort();
} else {
startlog.error("{}: exiting, at {}", msg, current_backtrace());
// Call _exit() rather than exit() to exit immediately
// without calling exit handlers, avoiding
// boost::intrusive::detail::destructor_impl assert failure
// from ~segment_pool exit handler.
_exit(255);
}
}
startlog.info("Shutting down {} was successful", what);
};
auto ret = deferred_action(std::move(vfunc));
@@ -547,6 +574,12 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
cfg->broadcast_to_all_shards().get();
// We pass this piece of config through a global as a temporary hack.
// See the comment at the definition of sstables::global_cache_index_pages.
smp::invoke_on_all([&cfg] {
sstables::global_cache_index_pages = cfg->cache_index_pages.operator utils::updateable_value<bool>();
}).get();
::sighup_handler sighup_handler(opts, *cfg);
auto stop_sighup_handler = defer_verbose_shutdown("sighup", [&] {
sighup_handler.stop().get();

View File

@@ -15,6 +15,7 @@
#include "sstables/shared_sstable.hh"
#include <seastar/core/future.hh>
#include <seastar/core/io_priority_class.hh>
#include "reader_permit.hh"
class memtable;
class flat_mutation_reader;

View File

@@ -438,6 +438,8 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
// should not be blocked by any data requests.
case messaging_verb::GROUP0_PEER_EXCHANGE:
case messaging_verb::GROUP0_MODIFY_CONFIG:
// ATTN -- if moving GOSSIP_ verbs elsewhere, mind updating the tcp_nodelay
// setting in get_rpc_client(), which assumes gossiper verbs live in idx 0
return 0;
case messaging_verb::PREPARE_MESSAGE:
case messaging_verb::PREPARE_DONE_MESSAGE:
@@ -695,7 +697,7 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
}();
auto must_tcp_nodelay = [&] {
if (idx == 1) {
if (idx == 0) {
return true; // gossip
}
if (_cfg.tcp_nodelay == tcp_nodelay_what::local) {
@@ -716,10 +718,7 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
}
opts.tcp_nodelay = must_tcp_nodelay;
opts.reuseaddr = true;
// We send cookies only for non-default statement tenant clients.
if (idx > 3) {
opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
}
opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
auto client = must_encrypt ?
::make_shared<rpc_protocol_client_wrapper>(_rpc->protocol(), std::move(opts),

View File

@@ -272,8 +272,8 @@ public:
future<> lookup_readers(db::timeout_clock::time_point timeout);
future<> save_readers(flat_mutation_reader::tracked_buffer unconsumed_buffer, detached_compaction_state compaction_state,
std::optional<clustering_key_prefix> last_ckey);
future<> save_readers(flat_mutation_reader::tracked_buffer unconsumed_buffer, std::optional<detached_compaction_state> compaction_state,
dht::decorated_key last_pkey, std::optional<clustering_key_prefix> last_ckey);
future<> stop();
};
@@ -580,19 +580,22 @@ future<> read_context::lookup_readers(db::timeout_clock::time_point timeout) {
});
}
future<> read_context::save_readers(flat_mutation_reader::tracked_buffer unconsumed_buffer, detached_compaction_state compaction_state,
std::optional<clustering_key_prefix> last_ckey) {
future<> read_context::save_readers(flat_mutation_reader::tracked_buffer unconsumed_buffer, std::optional<detached_compaction_state> compaction_state,
dht::decorated_key last_pkey, std::optional<clustering_key_prefix> last_ckey) {
if (_cmd.query_uuid == utils::UUID{}) {
return make_ready_future<>();
}
auto last_pkey = compaction_state.partition_start.key();
const auto cb_stats = dismantle_combined_buffer(std::move(unconsumed_buffer), last_pkey);
tracing::trace(_trace_state, "Dismantled combined buffer: {}", cb_stats);
const auto cs_stats = dismantle_compaction_state(std::move(compaction_state));
tracing::trace(_trace_state, "Dismantled compaction state: {}", cs_stats);
auto cs_stats = dismantle_buffer_stats{};
if (compaction_state) {
cs_stats = dismantle_compaction_state(std::move(*compaction_state));
tracing::trace(_trace_state, "Dismantled compaction state: {}", cs_stats);
} else {
tracing::trace(_trace_state, "No compaction state to dismantle, partition exhausted", cs_stats);
}
return do_with(std::move(last_pkey), std::move(last_ckey), [this] (const dht::decorated_key& last_pkey,
const std::optional<clustering_key_prefix>& last_ckey) {
@@ -745,16 +748,18 @@ future<typename ResultBuilder::result_type> do_query(
ResultBuilder&& result_builder) {
auto ctx = seastar::make_shared<read_context>(db, s, cmd, ranges, trace_state, timeout);
co_await ctx->lookup_readers(timeout);
std::exception_ptr ex;
try {
co_await ctx->lookup_readers(timeout);
auto [last_ckey, result, unconsumed_buffer, compaction_state] = co_await read_page<ResultBuilder>(ctx, s, cmd, ranges, trace_state,
std::move(result_builder));
if (compaction_state->are_limits_reached() || result.is_short_read()) {
co_await ctx->save_readers(std::move(unconsumed_buffer), std::move(*compaction_state).detach_state(), std::move(last_ckey));
// Must call before calling 'detached_state()`.
auto last_pkey = *compaction_state->current_partition();
co_await ctx->save_readers(std::move(unconsumed_buffer), std::move(*compaction_state).detach_state(), std::move(last_pkey), std::move(last_ckey));
}
co_await ctx->stop();

View File

@@ -167,6 +167,9 @@ class compact_mutation_state {
std::unique_ptr<mutation_compactor_garbage_collector> _collector;
compaction_stats _stats;
// Remember if we requested to stop mid-partition.
stop_iteration _stop = stop_iteration::no;
private:
template <typename Consumer, typename GCConsumer>
requires CompactedFragmentsConsumer<Consumer> && CompactedFragmentsConsumer<GCConsumer>
@@ -304,6 +307,7 @@ public:
}
void consume_new_partition(const dht::decorated_key& dk) {
_stop = stop_iteration::no;
auto& pk = dk.key();
_dk = &dk;
_return_static_content_on_partition_with_no_rows =
@@ -370,9 +374,9 @@ public:
_static_row_live = is_live;
if (is_live || (!only_live() && !sr.empty())) {
partition_is_not_empty(consumer);
return consumer.consume(std::move(sr), current_tombstone, is_live);
_stop = consumer.consume(std::move(sr), current_tombstone, is_live);
}
return stop_iteration::no;
return _stop;
}
template <typename Consumer, typename GCConsumer>
@@ -424,22 +428,21 @@ public:
};
if (only_live() && is_live) {
auto stop = consume_row();
_stop = consume_row();
if (++_rows_in_current_partition == _current_partition_limit) {
return stop_iteration::yes;
_stop = stop_iteration::yes;
}
return stop;
return _stop;
} else if (!only_live()) {
auto stop = stop_iteration::no;
if (!cr.empty()) {
stop = consume_row();
_stop = consume_row();
}
if (!sstable_compaction() && is_live && ++_rows_in_current_partition == _current_partition_limit) {
return stop_iteration::yes;
_stop = stop_iteration::yes;
}
return stop;
return _stop;
}
return stop_iteration::no;
return _stop;
}
template <typename Consumer, typename GCConsumer>
@@ -448,7 +451,8 @@ public:
++_stats.range_tombstones;
_range_tombstones.apply(rt);
// FIXME: drop tombstone if it is fully covered by other range tombstones
return do_consume(std::move(rt), consumer, gc_consumer);
_stop = do_consume(std::move(rt), consumer, gc_consumer);
return _stop;
}
template <typename Consumer, typename GCConsumer>
@@ -459,9 +463,9 @@ public:
_rt_assembler.emplace();
}
if (auto rt_opt = _rt_assembler->consume(_schema, std::move(rtc))) {
return do_consume(std::move(*rt_opt), consumer, gc_consumer);
_stop = do_consume(std::move(*rt_opt), consumer, gc_consumer);
}
return stop_iteration::no;
return _stop;
}
template <typename Consumer, typename GCConsumer>
@@ -562,16 +566,31 @@ public:
/// compactor will result in the new compactor being in the same state *this
/// is (given the same outside parameters of course). Practically this
/// allows the compaction state to be stored in the compacted reader.
detached_compaction_state detach_state() && {
/// If the currently compacted partition is exhausted a disengaged optional
/// is returned -- in this case there is no state to detach.
std::optional<detached_compaction_state> detach_state() && {
// If we exhausted the partition, there is no need to detach-restore the
// compaction state.
// We exhausted the partition if `consume_partition_end()` was called
// without us requesting the consumption to stop (remembered in _stop)
// from one of the consume() overloads.
// The consume algorithm calls `consume_partition_end()` in two cases:
// * on a partition-end fragment
// * consume() requested to stop
// In the latter case, the partition is not exhausted. Even if the next
// fragment to process is a partition-end, it will not be consumed.
if (!_stop) {
return {};
}
partition_start ps(std::move(_last_dk), _range_tombstones.get_partition_tombstone());
if (_rt_assembler) {
if (_current_tombstone) {
return {std::move(ps), std::move(_last_static_row), range_tombstone_change(position_in_partition_view::after_key(_last_clustering_pos), _current_tombstone)};
return detached_compaction_state{std::move(ps), std::move(_last_static_row), range_tombstone_change(position_in_partition_view::after_key(_last_clustering_pos), _current_tombstone)};
} else {
return {std::move(ps), std::move(_last_static_row), std::optional<range_tombstone_change>{}};
return detached_compaction_state{std::move(ps), std::move(_last_static_row), std::optional<range_tombstone_change>{}};
}
}
return {std::move(ps), std::move(_last_static_row), std::move(_range_tombstones).range_tombstones()};
return detached_compaction_state{std::move(ps), std::move(_last_static_row), std::move(_range_tombstones).range_tombstones()};
}
const compaction_stats& stats() const { return _stats; }

View File

@@ -28,6 +28,7 @@ class mutation_fragment_stream_validator {
mutation_fragment_v2::kind _prev_kind;
position_in_partition _prev_pos;
dht::decorated_key _prev_partition_key;
tombstone _current_tombstone;
public:
explicit mutation_fragment_stream_validator(const schema& s);
@@ -122,6 +123,12 @@ public:
const position_in_partition& previous_position() const {
return _prev_pos;
}
/// Get the current effective tombstone
///
/// Not meaningful, when operator()(mutation_fragment_v2) is not used.
tombstone current_tombstone() const {
return _current_tombstone;
}
/// The previous valid partition key.
///
/// Only valid if `operator()(const dht::decorated_key&)` or
@@ -151,6 +158,7 @@ class mutation_fragment_stream_validating_filter {
mutation_fragment_stream_validator _validator;
sstring _name;
mutation_fragment_stream_validation_level _validation_level;
tombstone _current_tombstone;
public:
/// Constructor.

View File

@@ -826,6 +826,7 @@ public:
void apply(tombstone deleted_at) {
_deleted_at.apply(deleted_at);
maybe_shadow();
}
void apply(shadowable_tombstone deleted_at) {

View File

@@ -1581,11 +1581,7 @@ private:
tracing::global_trace_state_ptr _trace_state;
const mutation_reader::forwarding _fwd_mr;
reader_concurrency_semaphore::inactive_read_handle _irh;
bool _drop_partition_start = false;
bool _drop_static_row = false;
// Validate the partition key of the first emitted partition, set after the
// reader was recreated.
bool _validate_partition_key = false;
bool _reader_recreated = false; // set if reader was recreated since last operation
position_in_partition::tri_compare _tri_cmp;
std::optional<dht::decorated_key> _last_pkey;
@@ -1606,10 +1602,9 @@ private:
void adjust_partition_slice();
flat_mutation_reader_v2 recreate_reader();
future<flat_mutation_reader_v2> resume_or_create_reader();
void maybe_validate_partition_start(const flat_mutation_reader_v2::tracked_buffer& buffer);
void validate_partition_start(const partition_start& ps);
void validate_position_in_partition(position_in_partition_view pos) const;
bool should_drop_fragment(const mutation_fragment_v2& mf);
future<> do_fill_buffer();
void examine_first_fragments(mutation_fragment_v2_opt& mf1, mutation_fragment_v2_opt& mf2, mutation_fragment_v2_opt& mf3);
public:
evictable_reader_v2(
@@ -1725,9 +1720,6 @@ flat_mutation_reader_v2 evictable_reader_v2::recreate_reader() {
_range_override.reset();
_slice_override.reset();
_drop_partition_start = false;
_drop_static_row = false;
if (_last_pkey) {
bool partition_range_is_inclusive = true;
@@ -1736,11 +1728,8 @@ flat_mutation_reader_v2 evictable_reader_v2::recreate_reader() {
partition_range_is_inclusive = false;
break;
case partition_region::static_row:
_drop_partition_start = true;
break;
case partition_region::clustered:
_drop_partition_start = true;
_drop_static_row = true;
adjust_partition_slice();
slice = &*_slice_override;
break;
@@ -1763,7 +1752,7 @@ flat_mutation_reader_v2 evictable_reader_v2::recreate_reader() {
_range_override = dht::partition_range({dht::partition_range::bound(*_last_pkey, partition_range_is_inclusive)}, _pr->end());
range = &*_range_override;
_validate_partition_key = true;
_reader_recreated = true;
}
return _ms.make_reader_v2(
@@ -1788,41 +1777,33 @@ future<flat_mutation_reader_v2> evictable_reader_v2::resume_or_create_reader() {
co_return recreate_reader();
}
void evictable_reader_v2::maybe_validate_partition_start(const flat_mutation_reader_v2::tracked_buffer& buffer) {
if (!_validate_partition_key || buffer.empty()) {
return;
}
// If this is set we can assume the first fragment is a partition-start.
const auto& ps = buffer.front().as_partition_start();
void evictable_reader_v2::validate_partition_start(const partition_start& ps) {
const auto tri_cmp = dht::ring_position_comparator(*_schema);
// If we recreated the reader after fast-forwarding it we won't have
// _last_pkey set. In this case it is enough to check if the partition
// is in range.
if (_last_pkey) {
const auto cmp_res = tri_cmp(*_last_pkey, ps.key());
if (_drop_partition_start) { // we expect to continue from the same partition
if (_next_position_in_partition.region() != partition_region::partition_start) { // we expect to continue from the same partition
// We cannot assume the partition we stopped the read at is still alive
// when we recreate the reader. It might have been compacted away in the
// meanwhile, so allow for a larger partition too.
require(
cmp_res <= 0,
"{}(): validation failed, expected partition with key larger or equal to _last_pkey {} due to _drop_partition_start being set, but got {}",
"{}(): validation failed, expected partition with key larger or equal to _last_pkey {}, but got {}",
__FUNCTION__,
*_last_pkey,
ps.key());
// Reset drop flags and next pos if we are not continuing from the same partition
// Reset next pos if we are not continuing from the same partition
if (cmp_res < 0) {
// Close previous partition, we are not going to continue it.
push_mutation_fragment(*_schema, _permit, partition_end{});
_drop_partition_start = false;
_drop_static_row = false;
_next_position_in_partition = position_in_partition::for_partition_start();
}
} else { // should be a larger partition
require(
cmp_res < 0,
"{}(): validation failed, expected partition with key larger than _last_pkey {} due to _drop_partition_start being unset, but got {}",
"{}(): validation failed, expected partition with key larger than _last_pkey {}, but got {}",
__FUNCTION__,
*_last_pkey,
ps.key());
@@ -1836,8 +1817,6 @@ void evictable_reader_v2::maybe_validate_partition_start(const flat_mutation_rea
__FUNCTION__,
prange,
ps.key());
_validate_partition_key = false;
}
void evictable_reader_v2::validate_position_in_partition(position_in_partition_view pos) const {
@@ -1860,7 +1839,12 @@ void evictable_reader_v2::validate_position_in_partition(position_in_partition_v
const bool any_contains = std::any_of(ranges.begin(), ranges.end(), [this, &pos] (const query::clustering_range& cr) {
// TODO: somehow avoid this copy
auto range = position_range(cr);
return range.contains(*_schema, pos);
// We cannot use range.contains() because that treats range as a
// [a, b) range, meaning a range tombstone change with position
// after_key(b) will be considered outside of it. Such range
// tombstone changes can be emitted however when recreating the
// reader on clustering range edge.
return _tri_cmp(range.start(), pos) <= 0 && _tri_cmp(pos, range.end()) <= 0;
});
require(
any_contains,
@@ -1871,42 +1855,40 @@ void evictable_reader_v2::validate_position_in_partition(position_in_partition_v
}
}
bool evictable_reader_v2::should_drop_fragment(const mutation_fragment_v2& mf) {
if (_drop_partition_start && mf.is_partition_start()) {
_drop_partition_start = false;
return true;
void evictable_reader_v2::examine_first_fragments(mutation_fragment_v2_opt& mf1, mutation_fragment_v2_opt& mf2, mutation_fragment_v2_opt& mf3) {
if (!mf1) {
return; // the reader is at EOS
}
// Unlike partition-start above, a partition is not guaranteed to have a
// static row fragment. So reset the flag regardless of whether we could
// drop one or not.
// We are guaranteed to get here only right after dropping a partition-start,
// so if we are not seeing a static row here, the partition doesn't have one.
if (_drop_static_row) {
_drop_static_row = false;
return mf.is_static_row();
}
return false;
}
future<> evictable_reader_v2::do_fill_buffer() {
if (!_drop_partition_start && !_drop_static_row) {
auto fill_buf_fut = _reader->fill_buffer();
if (_validate_partition_key) {
fill_buf_fut = fill_buf_fut.then([this] {
maybe_validate_partition_start(_reader->buffer());
});
}
return fill_buf_fut;
// If engaged, the first fragment is always a partition-start.
validate_partition_start(mf1->as_partition_start());
if (_tri_cmp(mf1->position(), _next_position_in_partition) < 0) {
mf1 = {}; // drop mf1
}
const auto continue_same_partition = _next_position_in_partition.region() != partition_region::partition_start;
// If we have a first fragment, we are guaranteed to have a second one -- if not else, a partition-end.
if (mf2->is_end_of_partition()) {
return; // no further fragments, nothing to do
}
// We want to validate the position of the first non-dropped fragment.
// If mf2 is a static row and we need to drop it, this will be mf3.
if (mf2->is_static_row() && _tri_cmp(mf2->position(), _next_position_in_partition) < 0) {
mf2 = {}; // drop mf2
} else {
if (continue_same_partition) {
validate_position_in_partition(mf2->position());
}
return;
}
if (mf3->is_end_of_partition()) {
return; // no further fragments, nothing to do
} else if (continue_same_partition) {
validate_position_in_partition(mf3->position());
}
return repeat([this] {
return _reader->fill_buffer().then([this] {
maybe_validate_partition_start(_reader->buffer());
while (!_reader->is_buffer_empty() && should_drop_fragment(_reader->peek_buffer())) {
_reader->pop_mutation_fragment();
}
return stop_iteration(_reader->is_buffer_full() || _reader->is_end_of_stream());
});
});
}
evictable_reader_v2::evictable_reader_v2(
@@ -1935,10 +1917,62 @@ future<> evictable_reader_v2::fill_buffer() {
co_return;
}
_reader = co_await resume_or_create_reader();
co_await do_fill_buffer();
if (_reader_recreated) {
// Recreating the reader breaks snapshot isolation and creates all sorts
// of complications around the continuity of range tombstone changes,
// e.g. a range tombstone started by the previous reader object
// might not exist anymore with the new reader object.
// To avoid complications we reset the tombstone state on each reader
// recreation by emitting a null tombstone change, if we read at least
// one clustering fragment from the partition.
if (_next_position_in_partition.region() == partition_region::clustered
&& _tri_cmp(_next_position_in_partition, position_in_partition::before_all_clustered_rows()) > 0) {
push_mutation_fragment(*_schema, _permit, range_tombstone_change{position_in_partition_view::before_key(_next_position_in_partition), {}});
}
auto mf1 = co_await (*_reader)();
auto mf2 = co_await (*_reader)();
auto mf3 = co_await (*_reader)();
examine_first_fragments(mf1, mf2, mf3);
if (mf3) {
_reader->unpop_mutation_fragment(std::move(*mf3));
}
if (mf2) {
_reader->unpop_mutation_fragment(std::move(*mf2));
}
if (mf1) {
_reader->unpop_mutation_fragment(std::move(*mf1));
}
_reader_recreated = false;
} else {
co_await _reader->fill_buffer();
}
_reader->move_buffer_content_to(*this);
// Ensure that each buffer represents forward progress. Only a concern when
// the last fragment in the buffer is range tombstone change. In this case
// ensure that:
// * buffer().back().position() > _next_position_in_partition;
// * _reader.peek()->position() > buffer().back().position();
if (!is_buffer_empty() && buffer().back().is_range_tombstone_change()) {
auto* next_mf = co_await _reader->peek();
// First make sure we've made progress w.r.t. _next_position_in_partition.
while (next_mf && _tri_cmp(_next_position_in_partition, buffer().back().position()) <= 0) {
push_mutation_fragment(_reader->pop_mutation_fragment());
next_mf = co_await _reader->peek();
}
const auto last_pos = position_in_partition(buffer().back().position());
while (next_mf && _tri_cmp(last_pos, next_mf->position()) == 0) {
push_mutation_fragment(_reader->pop_mutation_fragment());
next_mf = co_await _reader->peek();
}
}
update_next_position();
_end_of_stream = _reader->is_end_of_stream() && _reader->is_buffer_empty();
_end_of_stream = _reader->is_end_of_stream();
maybe_pause(std::move(*_reader));
}

View File

@@ -292,14 +292,23 @@ class partition_snapshot_flat_reader : public flat_mutation_reader::impl, public
const std::optional<position_in_partition>& last_row,
const std::optional<position_in_partition>& last_rts,
position_in_partition_view pos) {
if (!_rt_stream.empty()) {
return _rt_stream.get_next(std::move(pos));
}
return in_alloc_section([&] () -> mutation_fragment_opt {
maybe_refresh_state(ck_range_snapshot, last_row, last_rts);
position_in_partition::less_compare rt_less(_query_schema);
// The while below moves range tombstones from partition versions
// into _rt_stream, just enough to produce the next range tombstone
// The main goal behind moving to _rt_stream is to deoverlap range tombstones
// which have the same starting position. This is not in order to satisfy
// flat_mutation_reader stream requirements, the reader can emit range tombstones
// which have the same position incrementally. This is to guarantee forward
// progress in the case iterators get invalidated and maybe_refresh_state()
// above needs to restore them. It does so using last_rts, which tracks
// the position of the last emitted range tombstone. All range tombstones
// with positions <= than last_rts are skipped on refresh. To make progress,
// we need to make sure that all range tombstones with duplicated positions
// are emitted before maybe_refresh_state().
while (has_more_range_tombstones()
&& !rt_less(pos, peek_range_tombstone().position())
&& (_rt_stream.empty() || !rt_less(_rt_stream.peek_next().position(), peek_range_tombstone().position()))) {

View File

@@ -444,7 +444,7 @@ public:
// When throws, the cursor is invalidated and its position is not changed.
bool advance_to(position_in_partition_view lower_bound) {
maybe_advance_to(lower_bound);
return no_clustering_row_between(_schema, lower_bound, position());
return no_clustering_row_between_weak(_schema, lower_bound, position());
}
// Call only when valid.

View File

@@ -567,6 +567,20 @@ bool no_clustering_row_between(const schema& s, position_in_partition_view a, po
}
}
// Returns true if and only if there can't be any clustering_row with position >= a and < b.
// It is assumed that a <= b.
inline
bool no_clustering_row_between_weak(const schema& s, position_in_partition_view a, position_in_partition_view b) {
clustering_key_prefix::equality eq(s);
if (a.has_key() && b.has_key()) {
return eq(a.key(), b.key())
&& (a.get_bound_weight() == bound_weight::after_all_prefixed
|| b.get_bound_weight() != bound_weight::after_all_prefixed);
} else {
return !a.has_key() && !b.has_key();
}
}
// Includes all position_in_partition objects "p" for which: start <= p < end
// And only those.
class position_range {

View File

@@ -10,6 +10,7 @@
#include "seastarx.hh"
#include <seastar/core/future.hh>
#include <seastar/net/socket_defs.hh>
#include <vector>
// Abstraction for a server serving some kind of user-facing protocol.

View File

@@ -414,25 +414,6 @@ future<bool> querier_cache::evict_one() noexcept {
co_return false;
}
future<> querier_cache::evict_all_for_table(const utils::UUID& schema_id) noexcept {
for (auto ip : {&_data_querier_index, &_mutation_querier_index, &_shard_mutation_querier_index}) {
auto& idx = *ip;
for (auto it = idx.begin(); it != idx.end();) {
if (it->second->schema().id() == schema_id) {
auto reader_opt = it->second->permit().semaphore().unregister_inactive_read(querier_utils::get_inactive_read_handle(*it->second));
it = idx.erase(it);
--_stats.population;
if (reader_opt) {
co_await reader_opt->close();
}
} else {
++it;
}
}
}
co_return;
}
future<> querier_cache::stop() noexcept {
co_await _closing_gate.close();

View File

@@ -476,11 +476,6 @@ public:
/// is empty).
future<bool> evict_one() noexcept;
/// Evict all queriers that belong to a table.
///
/// Should be used when dropping a table.
future<> evict_all_for_table(const utils::UUID& schema_id) noexcept;
/// Close all queriers and wait on background work.
///
/// Should be used before destroying the querier_cache.

View File

@@ -9,6 +9,7 @@
#include <boost/range/adaptor/reversed.hpp>
#include "range_tombstone_list.hh"
#include "utils/allocation_strategy.hh"
#include "utils/amortized_reserve.hh"
#include <seastar/util/variant_utils.hh>
range_tombstone_list::range_tombstone_list(const range_tombstone_list& x)
@@ -96,7 +97,7 @@ void range_tombstone_list::insert_from(const schema& s,
if (cmp(end, it->position()) < 0) {
// not overlapping
if (it->tombstone().tomb == tomb && cmp(end, it->position()) == 0) {
rev.update(it, {std::move(start), std::move(start), tomb});
rev.update(it, {std::move(start), std::move(end), tomb});
} else {
auto rt = construct_range_tombstone_entry(std::move(start), std::move(end), tomb);
rev.insert(it, *rt);
@@ -375,13 +376,13 @@ range_tombstone_list::reverter::insert(range_tombstones_type::iterator it, range
range_tombstone_list::range_tombstones_type::iterator
range_tombstone_list::reverter::erase(range_tombstones_type::iterator it) {
_ops.reserve(_ops.size() + 1);
amortized_reserve(_ops, _ops.size() + 1);
_ops.emplace_back(erase_undo_op(*it));
return _dst._tombstones.erase(it);
}
void range_tombstone_list::reverter::update(range_tombstones_type::iterator it, range_tombstone&& new_rt) {
_ops.reserve(_ops.size() + 1);
amortized_reserve(_ops, _ops.size() + 1);
swap(it->tombstone(), new_rt);
_ops.emplace_back(update_undo_op(std::move(new_rt), *it));
}

View File

@@ -12,6 +12,7 @@
#include "range_tombstone.hh"
#include "query-request.hh"
#include "utils/preempt.hh"
#include "utils/chunked_vector.hh"
#include <iosfwd>
#include <variant>
@@ -106,7 +107,7 @@ class range_tombstone_list final {
class reverter {
private:
using op = std::variant<erase_undo_op, insert_undo_op, update_undo_op>;
std::vector<op> _ops;
utils::chunked_vector<op> _ops;
const schema& _s;
protected:
range_tombstone_list& _dst;

View File

@@ -743,6 +743,25 @@ void reader_concurrency_semaphore::clear_inactive_reads() {
}
}
future<> reader_concurrency_semaphore::evict_inactive_reads_for_table(utils::UUID id) noexcept {
inactive_reads_type evicted_readers;
auto it = _inactive_reads.begin();
while (it != _inactive_reads.end()) {
auto& ir = *it;
++it;
if (ir.reader.schema()->id() == id) {
do_detach_inactive_reader(ir, evict_reason::manual);
ir.ttl_timer.cancel();
ir.unlink();
evicted_readers.push_back(ir);
}
}
while (!evicted_readers.empty()) {
std::unique_ptr<inactive_read> irp(&evicted_readers.front());
co_await irp->reader.close();
}
}
std::runtime_error reader_concurrency_semaphore::stopped_exception() {
return std::runtime_error(format("{} was stopped", _name));
}
@@ -765,11 +784,9 @@ future<> reader_concurrency_semaphore::stop() noexcept {
co_return;
}
flat_mutation_reader_v2 reader_concurrency_semaphore::detach_inactive_reader(inactive_read& ir, evict_reason reason) noexcept {
auto reader = std::move(ir.reader);
void reader_concurrency_semaphore::do_detach_inactive_reader(inactive_read& ir, evict_reason reason) noexcept {
ir.detach();
reader.permit()._impl->on_evicted();
std::unique_ptr<inactive_read> irp(&ir);
ir.reader.permit()._impl->on_evicted();
try {
if (ir.notify_handler) {
ir.notify_handler(reason);
@@ -788,7 +805,12 @@ flat_mutation_reader_v2 reader_concurrency_semaphore::detach_inactive_reader(ina
break;
}
--_stats.inactive_reads;
return reader;
}
flat_mutation_reader_v2 reader_concurrency_semaphore::detach_inactive_reader(inactive_read& ir, evict_reason reason) noexcept {
std::unique_ptr<inactive_read> irp(&ir);
do_detach_inactive_reader(ir, reason);
return std::move(irp->reader);
}
void reader_concurrency_semaphore::evict(inactive_read& ir, evict_reason reason) noexcept {

View File

@@ -186,6 +186,7 @@ private:
std::optional<future<>> _execution_loop_future;
private:
void do_detach_inactive_reader(inactive_read&, evict_reason reason) noexcept;
[[nodiscard]] flat_mutation_reader_v2 detach_inactive_reader(inactive_read&, evict_reason reason) noexcept;
void evict(inactive_read&, evict_reason reason) noexcept;
@@ -301,6 +302,9 @@ public:
/// Clear all inactive reads.
void clear_inactive_reads();
/// Evict all inactive reads the belong to the table designated by the id.
future<> evict_inactive_reads_for_table(utils::UUID id) noexcept;
private:
// The following two functions are extension points for
// future inheriting classes that needs to run some stop

View File

@@ -25,6 +25,7 @@
#include "utils/bit_cast.hh"
#include "service/migration_manager.hh"
#include "partition_range_compat.hh"
#include "gms/feature_service.hh"
#include <boost/algorithm/string/predicate.hpp>
#include <boost/algorithm/string/split.hpp>
@@ -41,6 +42,7 @@
#include <seastar/core/sleep.hh>
#include <cfloat>
#include <algorithm>
#include "idl/partition_checksum.dist.hh"
@@ -118,6 +120,13 @@ std::ostream& operator<<(std::ostream& out, row_level_diff_detect_algorithm algo
return out << "unknown";
}
static size_t get_nr_tables(const replica::database& db, const sstring& keyspace) {
auto& m = db.get_column_families_mapping();
return std::count_if(m.begin(), m.end(), [&keyspace] (auto& e) {
return e.first.first == keyspace;
});
}
static std::vector<sstring> list_column_families(const replica::database& db, const sstring& keyspace) {
std::vector<sstring> ret;
for (auto &&e : db.get_column_families_mapping()) {
@@ -443,7 +452,7 @@ float tracker::report_progress(streaming::stream_reason reason) {
for (auto& x : _repairs) {
auto& ri = x.second;
if (ri->reason == reason) {
nr_ranges_total += ri->nr_ranges_total;
nr_ranges_total += ri->ranges_size();
nr_ranges_finished += ri->nr_ranges_finished;
}
}
@@ -555,8 +564,8 @@ void repair_info::check_failed_ranges() {
rlogger.info("repair id {} on shard {} stats: repair_reason={}, keyspace={}, tables={}, ranges_nr={}, {}",
id, shard, reason, keyspace, table_names(), ranges.size(), _stats.get_stats());
if (nr_failed_ranges) {
rlogger.warn("repair id {} on shard {} failed - {} out of {} ranges failed", id, shard, nr_failed_ranges, ranges.size());
throw std::runtime_error(format("repair id {} on shard {} failed to repair {} out of {} ranges", id, shard, nr_failed_ranges, ranges.size()));
rlogger.warn("repair id {} on shard {} failed - {} out of {} ranges failed", id, shard, nr_failed_ranges, ranges_size());
throw std::runtime_error(format("repair id {} on shard {} failed to repair {} out of {} ranges", id, shard, nr_failed_ranges, ranges_size()));
} else {
if (dropped_tables.size()) {
rlogger.warn("repair id {} on shard {} completed successfully, keyspace={}, ignoring dropped tables={}", id, shard, keyspace, dropped_tables);
@@ -582,14 +591,18 @@ repair_neighbors repair_info::get_repair_neighbors(const dht::token_range& range
neighbors[range];
}
size_t repair_info::ranges_size() {
return ranges.size() * table_ids.size();
}
// Repair a single local range, multiple column families.
// Comparable to RepairSession in Origin
future<> repair_info::repair_range(const dht::token_range& range) {
future<> repair_info::repair_range(const dht::token_range& range, utils::UUID table_id) {
check_in_shutdown();
check_in_abort();
ranges_index++;
repair_neighbors neighbors = get_repair_neighbors(range);
return do_with(std::move(neighbors.all), std::move(neighbors.mandatory), [this, range] (auto& neighbors, auto& mandatory_neighbors) {
return do_with(std::move(neighbors.all), std::move(neighbors.mandatory), [this, range, table_id] (auto& neighbors, auto& mandatory_neighbors) {
auto live_neighbors = boost::copy_range<std::vector<gms::inet_address>>(neighbors |
boost::adaptors::filtered([this] (const gms::inet_address& node) { return gossiper.is_alive(node); }));
for (auto& node : mandatory_neighbors) {
@@ -598,7 +611,7 @@ future<> repair_info::repair_range(const dht::token_range& range) {
nr_failed_ranges++;
auto status = format("failed: mandatory neighbor={} is not alive", node);
rlogger.error("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, table={}, range={}, peers={}, live_peers={}, status={}",
ranges_index, ranges.size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors, status);
ranges_index, ranges_size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors, status);
abort();
return make_exception_future<>(std::runtime_error(format("Repair mandatory neighbor={} is not alive, keyspace={}, mandatory_neighbors={}",
node, keyspace, mandatory_neighbors)));
@@ -608,7 +621,7 @@ future<> repair_info::repair_range(const dht::token_range& range) {
nr_failed_ranges++;
auto status = live_neighbors.empty() ? "skipped" : "partial";
rlogger.warn("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, table={}, range={}, peers={}, live_peers={}, status={}",
ranges_index, ranges.size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors, status);
ranges_index, ranges_size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors, status);
if (live_neighbors.empty()) {
return make_ready_future<>();
}
@@ -617,13 +630,12 @@ future<> repair_info::repair_range(const dht::token_range& range) {
if (neighbors.empty()) {
auto status = "skipped_no_followers";
rlogger.warn("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, table={}, range={}, peers={}, live_peers={}, status={}",
ranges_index, ranges.size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors, status);
ranges_index, ranges_size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors, status);
return make_ready_future<>();
}
rlogger.info("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, table={}, range={}, peers={}, live_peers={}",
ranges_index, ranges.size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors);
return mm.sync_schema(db.local(), neighbors).then([this, &neighbors, range] {
return do_for_each(table_ids.begin(), table_ids.end(), [this, &neighbors, range] (utils::UUID table_id) {
ranges_index, ranges_size(), id, shard, keyspace, table_names(), range, neighbors, live_neighbors);
return mm.sync_schema(db.local(), neighbors).then([this, &neighbors, range, table_id] {
sstring cf;
try {
cf = db.local().find_column_family(table_id).schema()->cf_name();
@@ -641,7 +653,6 @@ future<> repair_info::repair_range(const dht::token_range& range) {
nr_failed_ranges++;
return make_exception_future<>(std::move(ep));
});
});
});
});
}
@@ -914,27 +925,55 @@ private:
static future<> do_repair_ranges(lw_shared_ptr<repair_info> ri) {
// repair all the ranges in limited parallelism
return parallel_for_each(ri->ranges, [ri] (auto&& range) {
return with_semaphore(ri->rs.repair_tracker().range_parallelism_semaphore(), 1, [ri, &range] {
return ri->repair_range(range).then([ri] {
if (ri->reason == streaming::stream_reason::bootstrap) {
ri->rs.get_metrics().bootstrap_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::replace) {
ri->rs.get_metrics().replace_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::rebuild) {
ri->rs.get_metrics().rebuild_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::decommission) {
ri->rs.get_metrics().decommission_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::removenode) {
ri->rs.get_metrics().removenode_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::repair) {
ri->rs.get_metrics().repair_finished_ranges_sum++;
ri->nr_ranges_finished++;
}
// Repair tables in the keyspace one after another
assert(ri->table_names().size() == ri->table_ids.size());
for (int idx = 0; idx < ri->table_ids.size(); idx++) {
auto table_id = ri->table_ids[idx];
auto table_name = ri->table_names()[idx];
// repair all the ranges in limited parallelism
rlogger.info("repair[{}]: Started to repair {} out of {} tables in keyspace={}, table={}, table_id={}, repair_reason={}",
ri->id.uuid, idx + 1, ri->table_ids.size(), ri->keyspace, table_name, table_id, ri->reason);
co_await parallel_for_each(ri->ranges, [ri, table_id] (auto&& range) {
return with_semaphore(ri->rs.repair_tracker().range_parallelism_semaphore(), 1, [ri, &range, table_id] {
return ri->repair_range(range, table_id).then([ri] {
if (ri->reason == streaming::stream_reason::bootstrap) {
ri->rs.get_metrics().bootstrap_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::replace) {
ri->rs.get_metrics().replace_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::rebuild) {
ri->rs.get_metrics().rebuild_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::decommission) {
ri->rs.get_metrics().decommission_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::removenode) {
ri->rs.get_metrics().removenode_finished_ranges++;
} else if (ri->reason == streaming::stream_reason::repair) {
ri->rs.get_metrics().repair_finished_ranges_sum++;
ri->nr_ranges_finished++;
}
rlogger.debug("repair[{}]: node ops progress bootstrap={}, replace={}, rebuild={}, decommission={}, removenode={}, repair={}",
ri->id.uuid,
ri->rs.get_metrics().bootstrap_finished_percentage(),
ri->rs.get_metrics().replace_finished_percentage(),
ri->rs.get_metrics().rebuild_finished_percentage(),
ri->rs.get_metrics().decommission_finished_percentage(),
ri->rs.get_metrics().removenode_finished_percentage(),
ri->rs.get_metrics().repair_finished_percentage());
});
});
});
});
if (ri->reason != streaming::stream_reason::repair) {
try {
auto& table = ri->db.local().find_column_family(table_id);
rlogger.debug("repair[{}]: Trigger off-strategy compaction for keyspace={}, table={}",
ri->id.uuid, table.schema()->ks_name(), table.schema()->cf_name());
table.trigger_offstrategy_compaction();
} catch (replica::no_such_column_family&) {
// Ignore dropped table
}
}
}
co_return;
}
// repair_ranges repairs a list of token ranges, each assumed to be a token
@@ -1060,33 +1099,48 @@ int repair_service::do_repair_start(sstring keyspace, std::unordered_map<sstring
cfs = std::move(cfs), ranges = std::move(ranges), options = std::move(options), ignore_nodes = std::move(ignore_nodes)] () mutable {
auto uuid = id.uuid;
auto waiting_nodes = db.local().get_token_metadata().get_all_endpoints();
std::erase_if(waiting_nodes, [&] (const auto& addr) {
return ignore_nodes.contains(addr);
});
auto participants = get_hosts_participating_in_repair(db.local(), keyspace, ranges, options.data_centers, options.hosts, ignore_nodes).get();
auto hints_timeout = std::chrono::seconds(300);
auto batchlog_timeout = std::chrono::seconds(300);
repair_flush_hints_batchlog_request req{id.uuid, participants, hints_timeout, batchlog_timeout};
bool needs_flush_before_repair = false;
if (db.local().features().cluster_supports_tombstone_gc_options()) {
for (auto& table: cfs) {
auto s = db.local().find_column_family(keyspace, table).schema();
const auto& options = s->tombstone_gc_options();
if (options.mode() == tombstone_gc_mode::repair) {
needs_flush_before_repair = true;
}
}
}
bool hints_batchlog_flushed = false;
try {
parallel_for_each(waiting_nodes, [this, uuid, &req, &participants] (gms::inet_address node) -> future<> {
rlogger.info("repair[{}]: Sending repair_flush_hints_batchlog to node={}, participants={}, started",
uuid, node, participants);
try {
auto& ms = get_messaging();
auto resp = co_await ser::partition_checksum_rpc_verbs::send_repair_flush_hints_batchlog(&ms, netw::msg_addr(node), req);
} catch (...) {
rlogger.warn("repair[{}]: Sending repair_flush_hints_batchlog to node={}, participants={}, failed: {}",
uuid, node, participants, std::current_exception());
throw;
}
}).get();
hints_batchlog_flushed = true;
} catch (...) {
rlogger.warn("repair[{}]: Sending repair_flush_hints_batchlog to participants={} failed, continue to run repair",
uuid, participants);
auto participants = get_hosts_participating_in_repair(db.local(), keyspace, ranges, options.data_centers, options.hosts, ignore_nodes).get();
if (needs_flush_before_repair) {
auto waiting_nodes = db.local().get_token_metadata().get_all_endpoints();
std::erase_if(waiting_nodes, [&] (const auto& addr) {
return ignore_nodes.contains(addr);
});
auto hints_timeout = std::chrono::seconds(300);
auto batchlog_timeout = std::chrono::seconds(300);
repair_flush_hints_batchlog_request req{id.uuid, participants, hints_timeout, batchlog_timeout};
try {
parallel_for_each(waiting_nodes, [this, uuid, &req, &participants] (gms::inet_address node) -> future<> {
rlogger.info("repair[{}]: Sending repair_flush_hints_batchlog to node={}, participants={}, started",
uuid, node, participants);
try {
auto& ms = get_messaging();
auto resp = co_await ser::partition_checksum_rpc_verbs::send_repair_flush_hints_batchlog(&ms, netw::msg_addr(node), req);
} catch (...) {
rlogger.warn("repair[{}]: Sending repair_flush_hints_batchlog to node={}, participants={}, failed: {}",
uuid, node, participants, std::current_exception());
throw;
}
}).get();
hints_batchlog_flushed = true;
} catch (...) {
rlogger.warn("repair[{}]: Sending repair_flush_hints_batchlog to participants={} failed, continue to run repair",
uuid, participants);
}
} else {
rlogger.info("repair[{}]: Skipped sending repair_flush_hints_batchlog to nodes={}", uuid, participants);
}
std::vector<future<>> repair_results;
@@ -1288,7 +1342,8 @@ future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr
auto& strat = ks.get_replication_strategy();
dht::token_range_vector desired_ranges = strat.get_pending_address_ranges(tmptr, tokens, myip).get0();
seastar::thread::maybe_yield();
nr_ranges_total += desired_ranges.size();
auto nr_tables = get_nr_tables(db.local(), keyspace_name);
nr_ranges_total += desired_ranges.size() * nr_tables;
}
container().invoke_on_all([nr_ranges_total] (repair_service& rs) {
rs.get_metrics().bootstrap_finished_ranges = 0;
@@ -1320,7 +1375,8 @@ future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr
//Collects the source that will have its range moved to the new node
std::unordered_map<dht::token_range, repair_neighbors> range_sources;
rlogger.info("bootstrap_with_repair: started with keyspace={}, nr_ranges={}", keyspace_name, desired_ranges.size());
auto nr_tables = get_nr_tables(db.local(), keyspace_name);
rlogger.info("bootstrap_with_repair: started with keyspace={}, nr_ranges={}", keyspace_name, desired_ranges.size() * nr_tables);
for (auto& desired_range : desired_ranges) {
for (auto& x : range_addresses) {
const range<dht::token>& src_range = x.first;
@@ -1461,7 +1517,8 @@ future<> repair_service::do_decommission_removenode_with_repair(locator::token_m
}
auto& ks = db.local().find_keyspace(keyspace_name);
dht::token_range_vector ranges = ks.get_effective_replication_map()->get_ranges(leaving_node);
nr_ranges_total += ranges.size();
auto nr_tables = get_nr_tables(db.local(), keyspace_name);
nr_ranges_total += ranges.size() * nr_tables;
}
if (reason == streaming::stream_reason::decommission) {
container().invoke_on_all([nr_ranges_total] (repair_service& rs) {
@@ -1485,8 +1542,9 @@ future<> repair_service::do_decommission_removenode_with_repair(locator::token_m
auto erm = ks.get_effective_replication_map();
// First get all ranges the leaving node is responsible for
dht::token_range_vector ranges = erm->get_ranges(leaving_node);
rlogger.info("{}: started with keyspace={}, leaving_node={}, nr_ranges={}", op, keyspace_name, leaving_node, ranges.size());
size_t nr_ranges_total = ranges.size();
auto nr_tables = get_nr_tables(db.local(), keyspace_name);
rlogger.info("{}: started with keyspace={}, leaving_node={}, nr_ranges={}", op, keyspace_name, leaving_node, ranges.size() * nr_tables);
size_t nr_ranges_total = ranges.size() * nr_tables;
size_t nr_ranges_skipped = 0;
std::unordered_map<dht::token_range, inet_address_vector_replica_set> current_replica_endpoints;
// Find (for each range) all nodes that store replicas for these ranges as well
@@ -1677,7 +1735,8 @@ future<> repair_service::do_rebuild_replace_with_repair(locator::token_metadata_
auto& strat = ks.get_replication_strategy();
// Okay to yield since tm is immutable
dht::token_range_vector ranges = strat.get_ranges(myip, tmptr).get0();
nr_ranges_total += ranges.size();
auto nr_tables = get_nr_tables(db.local(), keyspace_name);
nr_ranges_total += ranges.size() * nr_tables;
}
if (reason == streaming::stream_reason::rebuild) {
@@ -1702,7 +1761,8 @@ future<> repair_service::do_rebuild_replace_with_repair(locator::token_metadata_
auto& strat = ks.get_replication_strategy();
dht::token_range_vector ranges = strat.get_ranges(myip, tmptr).get0();
std::unordered_map<dht::token_range, repair_neighbors> range_sources;
rlogger.info("{}: started with keyspace={}, source_dc={}, nr_ranges={}, ignore_nodes={}", op, keyspace_name, source_dc, ranges.size(), ignore_nodes);
auto nr_tables = get_nr_tables(db.local(), keyspace_name);
rlogger.info("{}: started with keyspace={}, source_dc={}, nr_ranges={}, ignore_nodes={}", op, keyspace_name, source_dc, ranges.size() * nr_tables, ignore_nodes);
for (auto it = ranges.begin(); it != ranges.end();) {
auto& r = *it;
seastar::thread::maybe_yield();
@@ -1730,12 +1790,12 @@ future<> repair_service::do_rebuild_replace_with_repair(locator::token_metadata_
}
}
if (reason == streaming::stream_reason::rebuild) {
container().invoke_on_all([nr_ranges_skipped] (repair_service& rs) {
rs.get_metrics().rebuild_finished_ranges += nr_ranges_skipped;
container().invoke_on_all([nr_ranges_skipped, nr_tables] (repair_service& rs) {
rs.get_metrics().rebuild_finished_ranges += nr_ranges_skipped * nr_tables;
}).get();
} else if (reason == streaming::stream_reason::replace) {
container().invoke_on_all([nr_ranges_skipped] (repair_service& rs) {
rs.get_metrics().replace_finished_ranges += nr_ranges_skipped;
container().invoke_on_all([nr_ranges_skipped, nr_tables] (repair_service& rs) {
rs.get_metrics().replace_finished_ranges += nr_ranges_skipped * nr_tables;
}).get();
}
auto nr_ranges = ranges.size();

View File

@@ -200,7 +200,9 @@ public:
return _hints_batchlog_flushed;
}
future<> repair_range(const dht::token_range& range);
future<> repair_range(const dht::token_range& range, utils::UUID table_id);
size_t ranges_size();
};
// The repair_tracker tracks ongoing repair operations and their progress.

View File

@@ -347,9 +347,9 @@ private:
// Only needed for local readers, the multishard reader takes care
// of pinning tables on used shards.
std::optional<utils::phased_barrier::operation> _local_read_op;
std::optional<evictable_reader_handle> _reader_handle;
// Local reader or multishard reader to read the range
flat_mutation_reader _reader;
std::optional<evictable_reader_handle> _reader_handle;
// Current partition read from disk
lw_shared_ptr<const decorated_key_with_hash> _current_dk;
uint64_t _reads_issued = 0;

View File

@@ -67,6 +67,7 @@ public:
uint64_t repair_finished_ranges_sum{0};
private:
seastar::metrics::metric_groups _metrics;
public:
float bootstrap_finished_percentage();
float replace_finished_percentage();
float rebuild_finished_percentage();

View File

@@ -910,10 +910,9 @@ bool database::update_column_family(schema_ptr new_schema) {
return columns_changed;
}
future<> database::remove(const column_family& cf) noexcept {
void database::remove(const table& cf) noexcept {
auto s = cf.schema();
auto& ks = find_keyspace(s->ks_name());
co_await _querier_cache.evict_all_for_table(s->id());
_column_families.erase(s->id());
ks.metadata()->remove_column_family(s);
_ks_cf_to_uuid.erase(std::make_pair(s->ks_name(), s->cf_name()));
@@ -937,13 +936,22 @@ future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_
on_internal_error(dblog, fmt::format("drop_column_family {}.{}: UUID={} not found", ks_name, cf_name, uuid));
}
dblog.debug("Dropping {}.{}", ks_name, cf_name);
co_await remove(*cf);
remove(*cf);
cf->clear_views();
co_return co_await cf->await_pending_ops().then([this, &ks, cf, tsf = std::move(tsf), snapshot] {
return truncate(ks, *cf, std::move(tsf), snapshot).finally([this, cf] {
return cf->stop();
});
}).finally([cf] {});
co_await cf->await_pending_ops();
for (auto* sem : {&_read_concurrency_sem, &_streaming_concurrency_sem, &_compaction_concurrency_sem, &_system_read_concurrency_sem}) {
co_await sem->evict_inactive_reads_for_table(uuid);
}
std::exception_ptr ex;
try {
co_await truncate(ks, *cf, std::move(tsf), snapshot);
} catch (...) {
ex = std::current_exception();
}
co_await cf->stop();
if (ex) {
std::rethrow_exception(std::move(ex));
}
}
const utils::UUID& database::find_uuid(std::string_view ks, std::string_view cf) const {
@@ -2054,6 +2062,53 @@ future<> database::flush(const sstring& ksname, const sstring& cfname) {
return cf.flush();
}
future<> database::flush_on_all(utils::UUID id) {
return container().invoke_on_all([id] (replica::database& db) {
return db.find_column_family(id).flush();
});
}
future<> database::flush_on_all(std::string_view ks_name, std::string_view table_name) {
return flush_on_all(find_uuid(ks_name, table_name));
}
future<> database::flush_on_all(std::string_view ks_name, std::vector<sstring> table_names) {
return parallel_for_each(table_names, [this, ks_name] (const auto& table_name) {
return flush_on_all(ks_name, table_name);
});
}
future<> database::flush_on_all(std::string_view ks_name) {
return parallel_for_each(find_keyspace(ks_name).metadata()->cf_meta_data(), [this] (auto& pair) {
return flush_on_all(pair.second->id());
});
}
future<> database::snapshot_on_all(std::string_view ks_name, std::vector<sstring> table_names, sstring tag, bool skip_flush) {
co_await parallel_for_each(table_names, [this, ks_name, tag = std::move(tag), skip_flush] (const auto& table_name) -> future<> {
if (!skip_flush) {
co_await flush_on_all(ks_name, table_name);
}
co_await container().invoke_on_all([ks_name, &table_name, tag, skip_flush] (replica::database& db) {
auto& t = db.find_column_family(ks_name, table_name);
return t.snapshot(db, tag);
});
});
}
future<> database::snapshot_on_all(std::string_view ks_name, sstring tag, bool skip_flush) {
auto& ks = find_keyspace(ks_name);
co_await parallel_for_each(ks.metadata()->cf_meta_data(), [this, tag = std::move(tag), skip_flush] (const auto& pair) -> future<> {
if (!skip_flush) {
co_await flush_on_all(pair.second->id());
}
co_await container().invoke_on_all([id = pair.second, tag, skip_flush] (replica::database& db) {
auto& t = db.find_column_family(id);
return t.snapshot(db, tag);
});
});
}
future<> database::truncate(sstring ksname, sstring cfname, timestamp_func tsf) {
auto& ks = find_keyspace(ksname);
auto& cf = find_column_family(ksname, cfname);
@@ -2062,80 +2117,77 @@ future<> database::truncate(sstring ksname, sstring cfname, timestamp_func tsf)
future<> database::truncate(const keyspace& ks, column_family& cf, timestamp_func tsf, bool with_snapshot) {
dblog.debug("Truncating {}.{}", cf.schema()->ks_name(), cf.schema()->cf_name());
return with_gate(cf.async_gate(), [this, &ks, &cf, tsf = std::move(tsf), with_snapshot] () mutable -> future<> {
const auto auto_snapshot = with_snapshot && get_config().auto_snapshot();
const auto should_flush = auto_snapshot;
auto holder = cf.async_gate().hold();
// Force mutations coming in to re-acquire higher rp:s
// This creates a "soft" ordering, in that we will guarantee that
// any sstable written _after_ we issue the flush below will
// only have higher rp:s than we will get from the discard_sstable
// call.
auto low_mark = cf.set_low_replay_position_mark();
const auto auto_snapshot = with_snapshot && get_config().auto_snapshot();
const auto should_flush = auto_snapshot;
const auto uuid = cf.schema()->id();
// Force mutations coming in to re-acquire higher rp:s
// This creates a "soft" ordering, in that we will guarantee that
// any sstable written _after_ we issue the flush below will
// only have higher rp:s than we will get from the discard_sstable
// call.
auto low_mark = cf.set_low_replay_position_mark();
return _compaction_manager->run_with_compaction_disabled(&cf, [this, &cf, should_flush, auto_snapshot, tsf = std::move(tsf), low_mark]() mutable {
future<> f = make_ready_future<>();
bool did_flush = false;
if (should_flush && cf.can_flush()) {
// TODO:
// this is not really a guarantee at all that we've actually
// gotten all things to disk. Again, need queue-ish or something.
f = cf.flush();
did_flush = true;
} else {
f = cf.clear();
}
return f.then([this, &cf, auto_snapshot, tsf = std::move(tsf), low_mark, should_flush, did_flush] {
dblog.debug("Discarding sstable data for truncated CF + indexes");
// TODO: notify truncation
const auto uuid = cf.schema()->id();
return tsf().then([this, &cf, auto_snapshot, low_mark, should_flush, did_flush](db_clock::time_point truncated_at) {
future<> f = make_ready_future<>();
if (auto_snapshot) {
auto name = format("{:d}-{}", truncated_at.time_since_epoch().count(), cf.schema()->cf_name());
f = cf.snapshot(*this, name);
}
return f.then([this, &cf, truncated_at, low_mark, should_flush, did_flush] {
return cf.discard_sstables(truncated_at).then([this, &cf, truncated_at, low_mark, should_flush, did_flush](db::replay_position rp) {
// TODO: indexes.
// Note: since discard_sstables was changed to only count tables owned by this shard,
// we can get zero rp back. Changed assert, and ensure we save at least low_mark.
// #6995 - the assert below was broken in c2c6c71 and remained so for many years.
// We nowadays do not flush tables with sstables but autosnapshot=false. This means
// the low_mark assertion does not hold, because we maybe/probably never got around to
// creating the sstables that would create them.
assert(!did_flush || low_mark <= rp || rp == db::replay_position());
rp = std::max(low_mark, rp);
return truncate_views(cf, truncated_at, should_flush).then([&cf, truncated_at, rp] {
// save_truncation_record() may actually fail after we cached the truncation time
// but this is not be worse that if failing without caching: at least the correct time
// will be available until next reboot and a client will have to retry truncation anyway.
cf.cache_truncation_record(truncated_at);
return db::system_keyspace::save_truncation_record(cf, truncated_at, rp);
});
});
});
});
});
}).then([this, uuid] {
drop_repair_history_map_for_table(uuid);
});
});
}
std::vector<compaction_manager::compaction_reenabler> cres;
cres.reserve(1 + cf.views().size());
future<> database::truncate_views(const column_family& base, db_clock::time_point truncated_at, bool should_flush) {
return parallel_for_each(base.views(), [this, truncated_at, should_flush] (view_ptr v) {
cres.emplace_back(co_await _compaction_manager->stop_and_disable_compaction(&cf));
co_await parallel_for_each(cf.views(), [&, this] (view_ptr v) -> future<> {
auto& vcf = find_column_family(v);
return _compaction_manager->run_with_compaction_disabled(&vcf, [&vcf, truncated_at, should_flush] {
return (should_flush ? vcf.flush() : vcf.clear()).then([&vcf, truncated_at, should_flush] {
return vcf.discard_sstables(truncated_at).then([&vcf, truncated_at, should_flush](db::replay_position rp) {
return db::system_keyspace::save_truncation_record(vcf, truncated_at, rp);
});
});
});
cres.emplace_back(co_await _compaction_manager->stop_and_disable_compaction(&vcf));
});
bool did_flush = false;
if (should_flush && cf.can_flush()) {
// TODO:
// this is not really a guarantee at all that we've actually
// gotten all things to disk. Again, need queue-ish or something.
co_await cf.flush();
did_flush = true;
} else {
co_await cf.clear();
}
dblog.debug("Discarding sstable data for truncated CF + indexes");
// TODO: notify truncation
db_clock::time_point truncated_at = co_await tsf();
if (auto_snapshot) {
auto name = format("{:d}-{}", truncated_at.time_since_epoch().count(), cf.schema()->cf_name());
co_await cf.snapshot(*this, name);
}
db::replay_position rp = co_await cf.discard_sstables(truncated_at);
// TODO: indexes.
// Note: since discard_sstables was changed to only count tables owned by this shard,
// we can get zero rp back. Changed assert, and ensure we save at least low_mark.
// #6995 - the assert below was broken in c2c6c71 and remained so for many years.
// We nowadays do not flush tables with sstables but autosnapshot=false. This means
// the low_mark assertion does not hold, because we maybe/probably never got around to
// creating the sstables that would create them.
assert(!did_flush || low_mark <= rp || rp == db::replay_position());
rp = std::max(low_mark, rp);
co_await parallel_for_each(cf.views(), [this, truncated_at, should_flush] (view_ptr v) -> future<> {
auto& vcf = find_column_family(v);
if (should_flush) {
co_await vcf.flush();
} else {
co_await vcf.clear();
}
db::replay_position rp = co_await vcf.discard_sstables(truncated_at);
co_await db::system_keyspace::save_truncation_record(vcf, truncated_at, rp);
});
// save_truncation_record() may actually fail after we cached the truncation time
// but this is not be worse that if failing without caching: at least the correct time
// will be available until next reboot and a client will have to retry truncation anyway.
cf.cache_truncation_record(truncated_at);
co_await db::system_keyspace::save_truncation_record(cf, truncated_at, rp);
drop_repair_history_map_for_table(uuid);
}
const sstring& database::get_snitch_name() const {

View File

@@ -839,7 +839,11 @@ public:
db::replay_position set_low_replay_position_mark();
future<> snapshot(database& db, sstring name, bool skip_flush = false);
private:
future<> snapshot(database& db, sstring name);
friend class database;
public:
future<std::unordered_map<sstring, snapshot_details>> get_snapshot_details();
/*!
@@ -1217,7 +1221,7 @@ struct string_pair_eq {
// local metadata reads
// use shard_of() for data
class database {
class database : public peering_sharded_service<database> {
friend class ::database_test;
public:
enum class table_kind {
@@ -1371,6 +1375,7 @@ private:
Future update_write_metrics(Future&& f);
void update_write_metrics_for_timed_out_write();
future<> create_keyspace(const lw_shared_ptr<keyspace_metadata>&, locator::effective_replication_map_factory& erm_factory, bool is_bootstrap, system_keyspace system);
void remove(const table&) noexcept;
public:
static utils::UUID empty_version;
@@ -1560,6 +1565,17 @@ public:
future<> flush_all_memtables();
future<> flush(const sstring& ks, const sstring& cf);
// flush a table identified by the given id on all shards.
future<> flush_on_all(utils::UUID id);
// flush a single table in a keyspace on all shards.
future<> flush_on_all(std::string_view ks_name, std::string_view table_name);
// flush a list of tables in a keyspace on all shards.
future<> flush_on_all(std::string_view ks_name, std::vector<sstring> table_names);
// flush all tables in a keyspace on all shards.
future<> flush_on_all(std::string_view ks_name);
future<> snapshot_on_all(std::string_view ks_name, std::vector<sstring> table_names, sstring tag, bool skip_flush);
future<> snapshot_on_all(std::string_view ks_name, sstring tag, bool skip_flush);
// See #937. Truncation now requires a callback to get a time stamp
// that must be guaranteed to be the same for all shards.
@@ -1568,11 +1584,9 @@ public:
/** Truncates the given column family */
future<> truncate(sstring ksname, sstring cfname, timestamp_func);
future<> truncate(const keyspace& ks, column_family& cf, timestamp_func, bool with_snapshot = true);
future<> truncate_views(const column_family& base, db_clock::time_point truncated_at, bool should_flush);
bool update_column_family(schema_ptr s);
future<> drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func, bool with_snapshot = true);
future<> remove(const column_family&) noexcept;
const logalloc::region_group& dirty_memory_region_group() const {
return _dirty_memory_manager.region_group();

View File

@@ -454,12 +454,13 @@ future<> distributed_loader::handle_sstables_pending_delete(sstring pending_dele
});
}
future<> distributed_loader::populate_column_family(distributed<replica::database>& db, sstring sstdir, sstring ks, sstring cf, bool must_exist) {
return async([&db, sstdir = std::move(sstdir), ks = std::move(ks), cf = std::move(cf), must_exist] {
future<> distributed_loader::populate_column_family(distributed<replica::database>& db, sstring sstdir, sstring ks, sstring cf, allow_offstrategy_compaction do_allow_offstrategy_compaction, must_exist dir_must_exist) {
dblog.debug("Populating {}/{}/{} allow_offstrategy_compaction={} must_exist={}", ks, cf, sstdir, do_allow_offstrategy_compaction, dir_must_exist);
return async([&db, sstdir = std::move(sstdir), ks = std::move(ks), cf = std::move(cf), do_allow_offstrategy_compaction, dir_must_exist] {
assert(this_shard_id() == 0);
if (!file_exists(sstdir).get0()) {
if (must_exist) {
if (dir_must_exist) {
throw std::runtime_error(format("Populating {}/{} failed: {} does not exist", ks, cf, sstdir));
}
return;
@@ -529,12 +530,14 @@ future<> distributed_loader::populate_column_family(distributed<replica::databas
return global_table->make_sstable(sstdir, gen, sst_version, sstables::sstable::format_types::big);
}, eligible_for_reshape_on_boot).get();
directory.invoke_on_all([global_table, &eligible_for_reshape_on_boot] (sstables::sstable_directory& dir) {
return dir.do_for_each_sstable([&global_table, &eligible_for_reshape_on_boot] (sstables::shared_sstable sst) {
auto requires_offstrategy = sstables::offstrategy(!eligible_for_reshape_on_boot(sst));
directory.invoke_on_all([global_table, &eligible_for_reshape_on_boot, do_allow_offstrategy_compaction] (sstables::sstable_directory& dir) {
return dir.do_for_each_sstable([&global_table, &eligible_for_reshape_on_boot, do_allow_offstrategy_compaction] (sstables::shared_sstable sst) {
auto requires_offstrategy = sstables::offstrategy(do_allow_offstrategy_compaction && !eligible_for_reshape_on_boot(sst));
return global_table->add_sstable_and_update_cache(sst, requires_offstrategy);
}).then([&global_table] {
}).then([&global_table, do_allow_offstrategy_compaction] {
if (do_allow_offstrategy_compaction) {
global_table->trigger_offstrategy_compaction();
}
});
}).get();
});
@@ -560,11 +563,11 @@ future<> distributed_loader::populate_keyspace(distributed<replica::database>& d
auto sstdir = ks.column_family_directory(ksdir, cfname, uuid);
dblog.info("Keyspace {}: Reading CF {} id={} version={}", ks_name, cfname, uuid, s->version());
return ks.make_directory_for_column_family(cfname, uuid).then([&db, sstdir, uuid, ks_name, cfname] {
return distributed_loader::populate_column_family(db, sstdir + "/" + sstables::staging_dir, ks_name, cfname);
return distributed_loader::populate_column_family(db, sstdir + "/" + sstables::staging_dir, ks_name, cfname, allow_offstrategy_compaction::no);
}).then([&db, sstdir, ks_name, cfname] {
return distributed_loader::populate_column_family(db, sstdir + "/" + sstables::quarantine_dir, ks_name, cfname, false /* must_exist */);
return distributed_loader::populate_column_family(db, sstdir + "/" + sstables::quarantine_dir, ks_name, cfname, allow_offstrategy_compaction::no, must_exist::no);
}).then([&db, sstdir, uuid, ks_name, cfname] {
return distributed_loader::populate_column_family(db, sstdir, ks_name, cfname);
return distributed_loader::populate_column_family(db, sstdir, ks_name, cfname, allow_offstrategy_compaction::yes);
}).handle_exception([ks_name, cfname, sstdir](std::exception_ptr eptr) {
std::string msg =
format("Exception while populating keyspace '{}' with column family '{}' from file '{}': {}",

View File

@@ -13,6 +13,7 @@
#include <seastar/core/distributed.hh>
#include <seastar/core/sstring.hh>
#include <seastar/core/file.hh>
#include <seastar/util/bool_class.hh>
#include <vector>
#include <functional>
#include <filesystem>
@@ -67,7 +68,9 @@ class distributed_loader {
static future<size_t> make_sstables_available(sstables::sstable_directory& dir,
sharded<replica::database>& db, sharded<db::view::view_update_generator>& view_update_generator,
std::filesystem::path datadir, sstring ks, sstring cf);
static future<> populate_column_family(distributed<replica::database>& db, sstring sstdir, sstring ks, sstring cf, bool must_exist = true);
using allow_offstrategy_compaction = bool_class<struct allow_offstrategy_compaction_tag>;
using must_exist = bool_class<struct must_exist_tag>;
static future<> populate_column_family(distributed<replica::database>& db, sstring sstdir, sstring ks, sstring cf, allow_offstrategy_compaction, must_exist = must_exist::yes);
static future<> populate_keyspace(distributed<replica::database>& db, sstring datadir, sstring ks_name);
static future<> cleanup_column_family_temp_sst_dirs(sstring sstdir);
static future<> handle_sstables_pending_delete(sstring pending_deletes_dir);

View File

@@ -9,6 +9,7 @@
#include <seastar/core/seastar.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/util/closeable.hh>
#include "replica/database.hh"
@@ -662,11 +663,21 @@ table::try_flush_memtable_to_sstable(lw_shared_ptr<memtable> old, sstable_write_
[] (const dht::decorated_key&) { return api::min_timestamp; });
}
mutation_fragment* fragment = co_await reader.peek();
if (!fragment) {
std::exception_ptr err;
try {
mutation_fragment* fragment = co_await reader.peek();
if (!fragment) {
co_await reader.close();
_memtables->erase(old);
co_return stop_iteration::yes;
}
} catch (...) {
err = std::current_exception();
}
if (err) {
tlogger.error("failed to flush memtable for {}.{}: {}", old->schema()->ks_name(), old->schema()->cf_name(), err);
co_await reader.close();
_memtables->erase(old);
co_return stop_iteration::yes;
co_return stop_iteration(_async_gate.is_closed());
}
auto f = consumer(upgrade_to_v2(std::move(reader)));
@@ -1426,70 +1437,86 @@ future<> table::write_schema_as_cql(database& db, sstring dir) const {
}
future<> table::snapshot(database& db, sstring name, bool skip_flush) {
future<> table::snapshot(database& db, sstring name) {
auto jsondir = _config.datadir + "/snapshots/" + name;
tlogger.debug("snapshot {}: skip_flush={}", jsondir, skip_flush);
auto f = skip_flush ? make_ready_future<>() : flush();
return f.then([this, &db, jsondir = std::move(jsondir)]() {
return with_semaphore(_sstable_deletion_sem, 1, [this, &db, jsondir = std::move(jsondir)]() {
auto tables = boost::copy_range<std::vector<sstables::shared_sstable>>(*_sstables->all());
return do_with(std::move(tables), std::move(jsondir), [this, &db] (std::vector<sstables::shared_sstable>& tables, const sstring& jsondir) {
return io_check([&jsondir] { return recursive_touch_directory(jsondir); }).then([this, &db, &jsondir, &tables] {
return max_concurrent_for_each(tables, db.get_config().initial_sstable_loading_concurrency(), [&db, &jsondir] (sstables::shared_sstable sstable) {
return with_semaphore(db.get_sharded_sst_dir_semaphore().local(), 1, [&jsondir, sstable] {
return io_check([sstable, &dir = jsondir] {
return sstable->create_links(dir);
});
});
});
}).then([&jsondir, &tables] {
return io_check(sync_directory, jsondir);
}).finally([this, &tables, &db, &jsondir] {
auto shard = std::hash<sstring>()(jsondir) % smp::count;
std::unordered_set<sstring> table_names;
for (auto& sst : tables) {
auto f = sst->get_filename();
auto rf = f.substr(sst->get_dir().size() + 1);
table_names.insert(std::move(rf));
}
return smp::submit_to(shard, [requester = this_shard_id(), &jsondir, this, &db,
tables = std::move(table_names), datadir = _config.datadir] {
tlogger.debug("snapshot {}", jsondir);
if (!pending_snapshots.contains(jsondir)) {
pending_snapshots.emplace(jsondir, make_lw_shared<snapshot_manager>());
}
auto snapshot = pending_snapshots.at(jsondir);
for (auto&& sst: tables) {
snapshot->files.insert(std::move(sst));
}
auto sstable_deletion_guard = co_await get_units(_sstable_deletion_sem, 1);
std::exception_ptr ex;
snapshot->requests.signal(1);
auto my_work = make_ready_future<>();
if (requester == this_shard_id()) {
my_work = snapshot->requests.wait(smp::count).then([&jsondir,
&db, snapshot, this] {
// this_shard_id() here == requester == this_shard_id() before submit_to() above,
// so the db reference is still local
return write_schema_as_cql(db, jsondir).handle_exception([&jsondir](std::exception_ptr ptr) {
tlogger.error("Failed writing schema file in snapshot in {} with exception {}", jsondir, ptr);
return make_ready_future<>();
}).finally([&jsondir, snapshot] () mutable {
return seal_snapshot(jsondir).handle_exception([&jsondir] (std::exception_ptr ex) {
tlogger.error("Failed to seal snapshot in {}: {}. Ignored.", jsondir, ex);
}).then([snapshot] {
snapshot->manifest_write.signal(smp::count);
return make_ready_future<>();
});
});
});
}
return my_work.finally([snapshot] {
return snapshot->manifest_write.wait(1);
}).then([snapshot] {});
std::vector<sstables::shared_sstable> tables;
try {
tables = boost::copy_range<std::vector<sstables::shared_sstable>>(*_sstables->all());
co_await io_check([&jsondir] { return recursive_touch_directory(jsondir); });
co_await max_concurrent_for_each(tables, db.get_config().initial_sstable_loading_concurrency(), [&db, &jsondir] (sstables::shared_sstable sstable) {
return with_semaphore(db.get_sharded_sst_dir_semaphore().local(), 1, [&jsondir, sstable] {
return io_check([sstable, &dir = jsondir] {
return sstable->create_links(dir);
});
});
});
});
co_await io_check(sync_directory, jsondir);
} catch (...) {
ex = std::current_exception();
}
auto shard = std::hash<sstring>()(jsondir) % smp::count;
std::unordered_set<sstring> table_names;
try {
for (auto& sst : tables) {
auto f = sst->get_filename();
auto rf = f.substr(sst->get_dir().size() + 1);
table_names.insert(std::move(rf));
}
} catch (...) {
ex = std::current_exception();
}
co_await smp::submit_to(shard, [requester = this_shard_id(), &jsondir, this, &db,
tables = std::move(table_names), datadir = _config.datadir, ex = std::move(ex)] () mutable -> future<> {
if (!pending_snapshots.contains(jsondir)) {
try {
pending_snapshots.emplace(jsondir, make_lw_shared<snapshot_manager>());
} catch (...) {
// abort since the process will hang if we can't coordinate
// snapshot across shards, similar to failing to allocation a continuation.
tlogger.error("Failed allocating snapshot_manager: {}. Aborting.", std::current_exception());
abort();
}
}
auto snapshot = pending_snapshots.at(jsondir);
try {
for (auto&& sst: tables) {
snapshot->files.insert(std::move(sst));
}
} catch (...) {
ex = std::current_exception();
}
tlogger.debug("snapshot {}: signal requests", jsondir);
snapshot->requests.signal(1);
if (requester == this_shard_id()) {
tlogger.debug("snapshot {}: waiting for all shards", jsondir);
co_await snapshot->requests.wait(smp::count);
// this_shard_id() here == requester == this_shard_id() before submit_to() above,
// so the db reference is still local
tlogger.debug("snapshot {}: writing schema.cql", jsondir);
co_await write_schema_as_cql(db, jsondir).handle_exception([&] (std::exception_ptr ptr) {
tlogger.error("Failed writing schema file in snapshot in {} with exception {}", jsondir, ptr);
ex = std::move(ptr);
});
tlogger.debug("snapshot {}: seal_snapshot", jsondir);
co_await seal_snapshot(jsondir).handle_exception([&] (std::exception_ptr ptr) {
tlogger.error("Failed to seal snapshot in {}: {}.", jsondir, ptr);
ex = std::move(ptr);
});
snapshot->manifest_write.signal(smp::count);
}
tlogger.debug("snapshot {}: waiting for manifest on behalf of shard {}", jsondir, requester);
co_await snapshot->manifest_write.wait(1);
tlogger.debug("snapshot {}: done: error={}", jsondir, ex);
if (ex) {
std::rethrow_exception(std::move(ex));
}
});
}
@@ -1571,13 +1598,14 @@ bool table::can_flush() const {
}
future<> table::clear() {
auto permits = co_await _config.dirty_memory_manager->get_all_flush_permits();
if (_commitlog) {
for (auto& t : *_memtables) {
_commitlog->discard_completed_segments(_schema->id(), t->get_and_discard_rp_set());
}
}
_memtables->clear_and_add();
return _cache.invalidate(row_cache::external_updater([] { /* There is no underlying mutation source */ }));
co_await _cache.invalidate(row_cache::external_updater([] { /* There is no underlying mutation source */ }));
}
// NOTE: does not need to be futurized, but might eventually, depending on
@@ -2235,7 +2263,7 @@ std::chrono::milliseconds table::get_coordinator_read_latency_percentile(double
void
table::enable_auto_compaction() {
// FIXME: unmute backlog. turn table backlog back on.
// XXX: unmute backlog. turn table backlog back on.
// see table::disable_auto_compaction() notes.
_compaction_disabled_by_user = false;
trigger_compaction();
@@ -2243,7 +2271,7 @@ table::enable_auto_compaction() {
future<>
table::disable_auto_compaction() {
// FIXME: mute backlog. When we disable background compactions
// XXX: mute backlog. When we disable background compactions
// for the table, we must also disable current backlog of the
// table compaction strategy that contributes to the scheduling
// group resources prioritization.
@@ -2270,9 +2298,8 @@ table::disable_auto_compaction() {
// - it will break computation of major compaction descriptor
// for new submissions
_compaction_disabled_by_user = true;
return with_gate(_async_gate, [this] {
return compaction_manager().stop_ongoing_compactions("disable auto-compaction", this, sstables::compaction_type::Compaction);
});
// FIXME: stop ongoing compactions
return make_ready_future<>();
}
flat_mutation_reader

Some files were not shown because too many files have changed in this diff Show More