36915 Commits

Author SHA1 Message Date
Pavel Emelyanov
42a930da3b Update seastar submodule
* seastar e45cef9c...1b299004 (3):
  > rpc: Abort server connection streams on stop
  > rpc: Do not register stream to dying parent
  > rpc: Fix client-side stream registration race

refs: #13100

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-06 12:35:37 +03:00
Takuya ASADA
660d68953d scylla_fstrim_setup: start scylla-fstrim.timer on setup
Currently, scylla_fstrim_setup does not start scylla-fstrim.timer and
just enables it, so the timer starts only after rebooted.
This is incorrect behavior, we start start it during the setup.

Also, unmask is unnecessary for enabling the timer.

Fixes #14249

Closes #14252

(cherry picked from commit c70a9cbffe)

Closes #14422
2023-07-18 16:03:53 +03:00
Botond Dénes
bff9b459ef repair: Release permit earlier when the repair_reader is done
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0414)
2023-07-14 18:18:05 +03:00
Anna Stuchlik
4c77b86f26 doc: document the minimum_keyspace_rf option
Fixes https://github.com/scylladb/scylladb/issues/14598

This commit adds the description of minimum_keyspace_rf
to the CREATE KEYSPACE section of the docs.
(When we have the reference section for all ScyllaDB options,
an appropriate link should be added.)

This commit must be backported to branch-5.3, because
the feature is already on that branch.

Closes #14686

(cherry picked from commit 9db9dedb41)
2023-07-14 15:48:28 +03:00
Marcin Maliszkiewicz
fb0afb04ae alternator: close output_stream when exception is thrown during response streaming
When exception occurs and we omit closing output_stream then the whole process is brought down
by an assertion in ~output_stream.

Fixes https://github.com/scylladb/scylladb/issues/14453
Relates https://github.com/scylladb/scylladb/issues/14403

Closes #14454

(cherry picked from commit 6424dd5ec4)
2023-07-13 22:48:36 +03:00
Nadav Har'El
bdb93af423 Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz
Adds preemption points used in Alternator when:
 - sending bigger json response
 - building results for BatchGetItem

I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to:

    auto start  = std::chrono::steady_clock::now();
    do { } while ((std::chrono::steady_clock::now() - start) < 100ms);

and seeing reactor stall times. After the patch they
were not increasing while before they kept building up due to no preemption.

Refs #7926
Fixes #13689

Closes #12351

* github.com:scylladb/scylladb:
  alternator: remove redundant flush call in make_streamed
  utils: yield when streaming json in print()
  alternator: yield during BatchGetItem operation

(cherry picked from commit d2e089777b)
2023-07-13 22:48:30 +03:00
Avi Kivity
2fffaf36c4 Merge ' message: match unknown tenants to the default tenant' from Botond Dénes
On connection setup, the isolation cookie of the connection is matched to the appropriate scheduling group. This is achieved by iterating over the known statement tenant connection types as well as the system connections and choosing the one with a matching name.

If a match is not found, it is assumed that the cluster is upgraded and the remote node has a scheduling group the local one doesn't have. To avoid demoting a scheduling group of unknown importance, in this case the default scheduling group is chosen.

This is problematic when upgrading an OSS cluster to an enterprise version, as the scheduling groups of the enterprise service-levels will match none of the statement tenants and will hence fall-back to the default scheduling group. As a consequence, while the cluster is mixed, user workload on old (OSS) nodes, will be executed under the system scheduling group and concurrency semaphore. Not only does this mean that user workloads are directly competing for resources with system ones, but the two workloads are now sharing the semaphore too, reducing the available throughput. This usually manifests in queries timing out on the old (OSS) nodes in the cluster.

This PR proposes to fix this, by recognizing that the unknown scheduling group is in fact a tenant this node doesn't know yet, and matching it with the default statement tenant. With this, order should be restored, with service-level connections being recognized as user connections and being executed in the statement scheduling group and the statement (user) concurrency semaphore.

I tested this manually, by creating a cluster of 2 OSS nodes, then upgrading one of the nodes to enterprise and verifying (with extra logging) that service level connections are matched to the default statement tenant after the PR and they indeed match to the default scheduling group before.

Fixes: #13841
Fixes: #12552

Closes #13843

* github.com:scylladb/scylladb:
  message: match unknown tenants to the default tenant
  message: generalize per-tenant connection types

(cherry picked from commit a7c2c9f92b)
2023-07-12 15:31:30 +03:00
Avi Kivity
dab150f3d8 Revert "Merge 'cql: update permissions when creating/altering a function/keyspace' from Wojciech Mitros"
This reverts commit 52e4edfd5e, reversing
changes made to d2d53fc1db. The associated test
fails with about 10% probablity, which blocks other work.

Fixes #14395

Closes #14662
2023-07-12 13:23:17 +03:00
Tomasz Grabiec
7b6db3b69a Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb/scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes scylladb/scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606cd6)

Closes #14647
2023-07-12 09:58:59 +03:00
Botond Dénes
b096c0d97d Merge '[backport 5.3] view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept mutation objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into mutations. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial mutation objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next mutation object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Backported from c25201c1a3. view_build_test.cc needed some tiny adjustments for the backport.

Closes #14622
Fixes #14503

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-11 15:05:56 +03:00
Michał Chojnowski
55e157be4d test: view_build_test: add range tombstones to test_view_update_generator_buffering
This patch adds a full-range tombstone to the compacted mutation.
This raises the coverage of the test. In particular, it reproduces
issue #14503, which should have been caught by this test, but wasn't.
2023-07-11 10:49:40 +02:00
Michał Chojnowski
75593a6178 test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
A random mutation test for view_updating_consumer's buffering logic.
Reproduces #14503.
2023-07-11 10:49:40 +02:00
Michał Chojnowski
bbbc4aafef view_updating_consumer: make buffer limit a variable
The limit doesn't change at runtime, but we this patch makes it variable for
unit testing purposes.
2023-07-11 10:48:34 +02:00
Michał Chojnowski
b74411f0ed view: fix range tombstone handling on flushes in view_updating_consumer
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of
mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to
convert the fragment stream into `mutation`s. This is done by piping
the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might
have to be split into multiple partial `mutation` objects.
view_update_consumer does that, but in improper way -- when the
split/flush happens inside an active range tombstone, the range
tombstone isn't closed properly. This is illegal, and triggers an
internal error.

This patch fixes the problem by closing the active range tombstone
(and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position.
This is not necessary for correctness -- for example we could delay
all processing of the range tombstone until we see its end
bound -- but it seems like the most natural semantic.

Fixes #14503
2023-07-11 10:48:34 +02:00
Calle Wilund
995ffd6ee0 storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294

(cherry picked from commit f18e967939)
2023-07-11 11:42:28 +03:00
Piotr Dulikowski
88e843c9db combined: mergers: remove recursion in operator()()
In mutation_reader_merger and clustering_order_reader_merger, the
operator()() is responsible for producing mutation fragments that will
be merged and pushed to the combined reader's buffer. Sometimes, it
might have to advance existing readers, open new and / or close some
existing ones, which requires calling a helper method and then calling
operator()() recursively.

In some unlucky circumstances, a stack overflow can occur:

- Readers have to be opened incrementally,
- Most or all readers must not produce any fragments and need to report
  end of stream without preemption,
- There has to be enough readers opened within the lifetime of the
  combined reader (~500),
- All of the above needs to happen within a single task quota.

In order to prevent such a situation, the code of both reader merger
classes were modified not to perform recursion at all. Most of the code
of the operator()() was moved to maybe_produce_batch which does not
recur if it is not possible for it to produce a fragment, instead it
returns std::nullopt and operator()() calls this method in a loop via
seastar::repeat_until_value.

A regression test is added.

Fixes: scylladb/scylladb#14415

Closes #14452

(cherry picked from commit ee9bfb583c)

Closes #14606
2023-07-11 11:38:33 +03:00
Takuya ASADA
8cfeb6f509 test/perf/perf_fast_forward: avoid allocating AIO slots on startup
On main.cc, we have early commands which want to run prior to initialize
Seastar.
Currently, perf_fast_forward is breaking this, since it defined
"app_template app" on global variable.
To avoid that, we should defer running app_template's constructor in
scylla_fast_forward_main().

Fixes #13945

Closes #14026

(cherry picked from commit 45ef09218e)
2023-07-10 15:34:06 +03:00
Jan Ciolek
7adc9aa50c forward_service: fix forgetting case-sensitivity in aggregates
There was a bug that caused aggregates to fail when
used on column-sensitive columns.

For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.

The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 7fca350075)
2023-07-10 15:22:25 +03:00
Botond Dénes
15d4475870 Merge 'doc: fix rollback in the 4.3-to-2021.1, 5.0-to-2022.1, and 5.1-to-2022.2 upgrade guides' from Anna Stuchlik
This PR fixes the Restore System Tables section of the upgrade guides by adding a command to clean upgraded SStables during rollback or adding the entire section to restore system tables (which was missing from the older documents).

This PR fixes is a bug and must be backported to branch-5.3, branch-5.2., and branch-5.1.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

- [x]  5.1-to-2022.2 - update command (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  5.0-to-2022.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)
- [x]  4.3-to-2021.1 - add "Restore system tables" to rollback (backport to branch-5.3, branch-5.2, and branch-5.1)

(see https://github.com/scylladb/scylla-enterprise/issues/3046#issuecomment-1604232864)

Closes #14444

* github.com:scylladb/scylladb:
  doc: fix rollback in 4.3-to-2021.1 upgrade guide
  doc: fix rollback in 5.0-to-2022.1 upgrade guide
  doc: fix rollback in 5.1-to-2022.2 upgrade guide

(cherry picked from commit 8a7261fd70)
2023-07-10 15:15:48 +03:00
Kamil Braun
bbe5c323a9 storage_proxy: query_partition_key_range_concurrent: don't access empty range
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.

Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.

This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.

The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.

Also change the `assert` to `on_internal_error`.

Fixes #14284

Closes #14300

(cherry picked from commit 732feca115)

Backport note: the original `assert` was never added to branch-5.3, but
the fix is still applicable, so I backported the fix and the
`on_internal_error` check.
2023-07-05 13:06:04 +02:00
Mikołaj Grzebieluch
82f70a1c19 raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs
Another node can stop after it joined the group0 but before it advertised itself
in gossip. `get_inet_addrs` will try to resolve all IPs and
`wait_for_peers_to_enter_synchronize_state` will loop indefinitely.

But `wait_for_peers_to_enter_synchronize_state` can return early if one of
the nodes confirms that the upgrade procedure has finished. For that, it doesn't
need the IPs of all group 0 members - only the IP of some nodes which can do
the confirmation.

This commit restructures the code so that IPs of nodes are resolved inside the
`max_concurrent_for_each` that `wait_for_peers_to_enter_synchronize_state` performs.
Then, even if some IPs won't be resolved, but one of the nodes confirms a
successful upgrade, we can continue.

Fixes #13543

(cherry picked from commit a45e0765e4)
2023-07-05 13:02:22 +02:00
Yaron Kaikov
0668dc25df release: prepare for 5.3.0-rc1 2023-07-02 11:28:58 +00:00
Anna Stuchlik
148655dc21 doc: fix rollback in 5.2-to-2023.1 upgrade guide
This commit fixes the Restore System Tables section
in the 5.2-to-2023.1 upgrade guide by adding a command
to clean upgraded SStables during rollback.

This is a bug (an incomplete command) and must be
backported to branch-5.3 and branch-5.2.

Refs: https://github.com/scylladb/scylla-enterprise/issues/3046

Closes #14373

(cherry picked from commit f4ae2c095b)
2023-06-29 12:07:17 +03:00
Botond Dénes
c89e5f06ba Merge 'readers: evictable_reader: don't accidentally consume the entire partition' from Kamil Braun
The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between.

The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction.

So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491.

There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order.

Fix the comparison and adjust one of the tests (added in #13563) to detect this case.

After the fix, the evictable reader starts generating some redundant (but expected) range tombstone change fragments since it's now being paused and resumed. For this we need to adjust mutation source tests which were a bit too specific. We modify `flat_mutation_reader_assertions` to squash the redundant `r_t_c`s.

Fixes #13491

Closes #14375

* github.com:scylladb/scylladb:
  readers: evictable_reader: don't accidentally consume the entire partition
  test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position

(cherry picked from commit 586102b42e)
2023-06-29 12:04:13 +03:00
Anna Stuchlik
b38d169f55 doc: add Ubuntu 22 to 2021.1 OS support
Fixes https://github.com/scylladb/scylla-enterprise/issues/3036

This commit adds support for Ubuntu 22.04 to the list
of OSes supported by ScyllaDB Enterprise 2021.1.

This commit fixex a bug and must be backported to
branch-5.3 and branch-5.2.

Closes #14372

(cherry picked from commit 74fc69c825)
2023-06-26 13:42:26 +03:00
Anna Stuchlik
5ac40ed1a8 doc: udpate the OSS docs landing page
Fixes https://github.com/scylladb/scylladb/issues/14333

This commit replaces the documentation landing page with
the Open Source-only documentation landing page.

This change is required as now there is a separate landing
page for the ScyllaDB documentation, so the page is duplicated,
creating bad user experience.

(cherry picked from commit f60f89df17)

Closes #14371
2023-06-23 14:01:19 +02:00
Avi Kivity
37e6e65211 Update seastar submodule (default priority class shares)
* seastar f94b1bb9cb...e45cef9ce8 (1):
  > reactor: change shares for default IO class from 1 to 200

Fixes #13753.
2023-06-21 21:21:29 +03:00
Avi Kivity
994645c03b Switch seastar submodule to scylla-seastar.git
This allows us to start backporting Seastar patches.

Ref #13753.
2023-06-21 21:20:23 +03:00
Botond Dénes
51ed9a0ec0 Merge 'doc: add OS support for ScyllaDB 5.3' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/14084

This commit adds OS support for version 5.3 to the table on the OS Support by Linux Distributions and Version page.

Closes #14228

* github.com:scylladb/scylladb:
  doc: remove OS support for outdated ScyllaDB versions 2.x and 3.x
  doc: add OS support for ScyllaDB 5.3

(cherry picked from commit aaac455ebe)
2023-06-15 10:33:28 +03:00
Raphael S. Carvalho
fa689c811e compaction: Fix sstable cleanup after resharding on refresh
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)

Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.

Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).

Dtest: https://github.com/scylladb/scylla-dtest/pull/3206

Fixes #14001.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14147

(cherry picked from commit 156d771101)
2023-06-13 13:22:36 +03:00
Anna Stuchlik
18774b90a7 doc: remove support for Ubuntu 18
Fixes https://github.com/scylladb/scylladb/issues/14097

This commit removes support for Ubuntu 18 from
platform support for ScyllaDB Enterprise 2023.1.

The update is in sync with the change made for
ScyllaDB 5.2.

This commit must be backported to branch-5.2 and
branch-5.3.

Closes #14118

(cherry picked from commit b7022cd74e)
2023-06-13 12:06:04 +03:00
Avi Kivity
b20a85d651 Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes
With regards to closing the looked-up querier if an exception is thrown. In particular, this requires closing the querier if a semaphore mismatch is detected. Move the table lookup above the line where the querier is looked up, to avoid having to handle the exception from it. As a consequence of closing the querier on the error path, the lookup lambda has to be made a coroutine. This is sad, but this is executed once per page, so its cost should be insignificant when spread over an
entire page worth of work.

Also add a unit test checking that the mismatch is detected in the first place and that readers are closed.

Fixes: #13784

Closes #13790

* github.com:scylladb/scylladb:
  test/boost/database_test: add unit test for semaphore mismatch on range scans
  partition_slice_builder: add set_specific_ranges()
  multishard_mutation_query: make reader_context::lookup_readers() exception safe
  multishard_mutation_query: lookup_readers(): make inner lambda a coroutine

(cherry picked from commit 1c0e8c25ca)
2023-06-07 13:29:48 +03:00
Michał Chojnowski
1f0f3a4464 data_dictionary: fix forgetting of UDTs on ALTER KEYSPACE
Due to a simple programming oversight, one of keyspace_metadata
constructors is using empty user_types_metadata instead of the
passed one. Fix that.

Fixes #14139

Closes #14143

(cherry picked from commit 1a521172ec)
2023-06-06 21:52:23 +03:00
Kamil Braun
698ac3ac4e auth: don't use infinite timeout in default_role_row_satisfies query
A long long time ago there was an issue about removing infinite timeouts
from distributed queries: #3603. There was also a fix:
620e950fc8. But apparently some queries
escaped the fix, like the one in `default_role_row_satisfies`.

With the right conditions and timing this query may cause a node to hang
indefinitely on shutdown. A node tries to perform this query after it
starts. If we kill another node which is required to serve this query
right before that moment, the query will hang; when we try to shutdown
the querying node, it will wait for the query to finish (it's a
background task in auth service), which it never does due to infinite
timeout.

Use the same timeout configuration as other queries in this module do.

Fixes #13545.

Closes #14134

(cherry picked from commit f51312e580)
2023-06-06 19:38:53 +03:00
Raphael S. Carvalho
f975b7890e compaction: Fix incremental compaction for sstable cleanup
After c7826aa910, sstable runs are cleaned up together.

The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.

Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.

Therefore cleanup is affected by the 100% space overhead.

To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.

New unit test reproduces the failure, and passes with the fix.

Fixes #14035.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14038

(cherry picked from commit 23443e0574)
2023-06-06 11:59:31 +03:00
Tzach Livyatan
5da2489e0e Remove Ubuntu 18.04 support from 5.2
Ubuntu [18.04 will be soon out of standard support](https://ubuntu.com/blog/18-04-end-of-standard-support), and can be removed from 5.2 supported list
https://github.com/scylladb/scylla-pkg/issues/3346

Closes #13529

(cherry picked from commit e655060429)
2023-05-30 16:25:15 +03:00
Anna Stuchlik
d9961fc6a2 doc: add the upgrade guide from 5.2 to 5.3
Fixes https://github.com/scylladb/scylladb/issues/13288

This commit adds the upgrade guide from ScyllaDB Open Source 5.2
to 5.3.
The details of the metric update will be added with a separate commit.

Closes #13960

(cherry picked from commit 508b68377e)
2023-05-24 10:01:26 +02:00
Beni Peled
3c3621db07 release: prepare for 5.3.0-rc0 scylla-5.3.0-rc0 2023-05-21 10:38:17 +03:00
Botond Dénes
3b424e391b Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.

Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.

Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.

Fixes #9559

Closes #13812

* github.com:scylladb/scylladb:
  table: signal compaction_manager when staging sstables become eligible for cleanup
  compaction_manager: perform_cleanup: wait until all candidates are cleaned up
  compaction_manager: perform_cleanup: perform_offstrategy if needed
  compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
  sstable_set: add for_each_sstable_gently* helpers
2023-05-19 12:35:59 +03:00
Kefu Chai
031f770557 install.sh: use scylla-jmx for detecting JRE
now that scylla-jmx has a dedicated script for detecting the existence
of OpenJDK, and this script is included in the unified package, let's
just leverage it instead of repeating it in `install.sh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13514
2023-05-19 11:22:57 +03:00
Kefu Chai
8bb1f15542 test: sstable_3_x_test: avoid using helper using generation_type::int_t
this change is one of the series which drops most of the callers
using SSTable generation as integer. as the generation of SSTable
is but an identifier, we should not use it as an integer out of
generation_type's implementation. so, in this change, instead of
using `generation_type::int_t` in the helper functions, we just
pass `generation_type` in place of integer.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13931
2023-05-19 11:21:35 +03:00
Tomasz Grabiec
05be5e969b migration_manager: Fix snapshot transfer failing if TABLETS feature is not enabled
Without the feature, the system schema doesn't have the table, and the
read will fail with:

   Transferring snapshot to ... failed with: seastar::rpc::remote_verb_error (Can't find a column family tablets in keyspace system)

We should not attempt to read tablet metadata in the experimental
feature is not enabled.

Fixes #13946
Closes #13947
2023-05-19 09:58:56 +02:00
Botond Dénes
c2aee26278 Merge 'Keep sstables garbage collection in sstable_directory' from Pavel Emelyanov
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.

Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.

refs: #13024
refs: #13020
refs: #12707

Closes #13767

* github.com:scylladb/scylladb:
  sstable: Toss tempdir extension usage
  sstable: Drop pending_delete_dir_basename()
  sstable: Drop is_pending_delete_dir() helper
  sstable_directory: Make garbage_collect() non-static
  sstable_directory: Move deletion log exists check
  distributed_loader: Move garbage collecting into sstable_directory
  distributed_loader: Collect garbace collecting in one call
  sstable: Coroutinize remove_temp_dir()
  sstable: Coroutinize touch_temp_dir()
  sstable: Use storage::temp_dir instead of hand-crafted path
2023-05-19 08:50:13 +03:00
Jan Ciolek
1bcb4c024c cql3/expr: print expressions in user-friendly way by default
When a CQL expression is printed, it can be done using
either the `debug` mode, or the `user` mode.

`user` mode is basically how you would expect the CQL
to be printed, it can be printed and then parsed back.

`debug` mode is more detailed, for example in `debug`
mode a column name can be displayed as
`unresolved_identifier(my_column)`, which can't
be parsed back to CQL.

The default way of printing is the `debug` mode,
but this requires us to remember to enable the `user`
mode each time we're printing a user-facing message,
for example for an invalid_request_exception.

It's cumbersome and people forget about it,
so let's change the default to `user`.

There issues about expressions being printed
in a `strange` way, this fixes them.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #13916
2023-05-18 20:57:00 +03:00
Kamil Braun
64dc76db55 test: pylib: fix read_barrier implementation
The previous implementation didn't actually do a read barrier, because
the statement failed on an early prepare/validate step which happened
before read barrier was even performed.

Change it to a statement which does not fail and doesn't perform any
schema change but requires a read barrier.

This breaks one test which uses `RandomTables.verify_schema()` when only
one node is alive, but `verify_schema` performs a read barrier. Unbreak
it by skipping the read barrier in this case (it makes sense in this
particular test).

Closes #13933
2023-05-18 18:30:11 +02:00
Kamil Braun
13df85ea11 Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov
This implicit link it pretty bad, because feature service is a low-level
one which lots of other services depend on. System keyspace is opposite
-- a high-level one that needs e.g. query processor and database to
operate. This inverse dependency is created by the feature service need
to commit enabled features' names into system keyspace on cluster join.
And it uses the qctx thing for that in a best-effort manner (not doing
anything if it's null).

The dependency can be cut. The only place when enabled features are
committed is when gossiper enables features on join or by receiving
state changes from other nodes. By that time the
sharded<system_keyspace> is up and running and can be used.

Despite gossiper already has system keyspace dependency, it's better not
to overload it with the need to mess with enabling and persisting
features. Instead, the feature_enabler instance is equipped with needed
dependencies and takes care of it. Eventually the enabler is also moved
to feature_service.cc where it naturally belongs.

Fixes: #13837

Closes #13172

* github.com:scylladb/scylladb:
  gossiper: Remove features and sysks from gossiper
  system_keyspace: De-static save_local_supported_features()
  system_keyspace: De-static load_|save_local_enabled_features()
  system_keyspace: Move enable_features_on_startup to feature_service (cont)
  system_keyspace: Move enable_features_on_startup to feature_service
  feature_service: Open-code persist_enabled_feature_info() into enabler
  gms: Move feature enabler to feature_service.cc
  gms: Move gossiper::enable_features() to feature_service::enable_features_on_join()
  gms: Persist features explicitly in features enabler
  feature_service: Make persist_enabled_feature_info() return a future
  system_keyspace: De-static load_peer_features()
  gms: Move gossiper::do_enable_features to persistent_feature_enabler::enable_features()
  gossiper: Enable features and register enabler from outside
  gms: Add feature_service and system_keyspace to feature_enabler
2023-05-18 18:21:06 +02:00
Gleb Natapov
701d6941a5 storage_proxy: raft topology: use gossiper state to populate peers table
Some state that is used to fill in 'peeers' table is still propagated
over gossiper.  When moving a node into the normal state in raft
topology code use the data from the gossiper to populate peers table because
storage_service::on_change() will not do it in case the node was not in
normal state at the time it was called.

Fixes: #13911

Message-Id: <ZGYk/V1ymIeb8qMK@scylladb.com>
2023-05-18 16:00:29 +02:00
Pavel Emelyanov
5216dcb1b3 Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes
The `system_keyspace` has several methods to query the tables in it. These currently require a storage proxy parameter, because the read has to go through storage-proxy. This PR uses the observation that all these reads are really local-replica reads and they only actually need a relatively small code snippet from storage proxy. These small code snippets are exported into standalone function in a new header (`replica/query.hh`). Then the system keyspace code is patched to use these new standalone functions instead of their equivalent in storage proxy. This allows us to replace the storage proxy dependency with a much more reasonable dependency on `replica::database`.

This PR patches the system keyspace code and the signatures of the affected methods as well as their immediate callers. Indirect callers are only patched to the extent it was needed to avoid introducing new includes (some had only a forward-declaration of storage proxy and so couldn't get database from it). There are a lot of opportunities left to free other methods or maybe even entire subsystems from storage proxy dependency, but this is not pursued in this PR, instead being left for follow-ups.

This PR was conceived to help us break the storage proxy -> storage service -> system tables -> storage proxy dependency loop, which become a major roadblock in migrating from IP -> host_id. After this PR, system keyspace still indirectly depends on storage proxy, because it still uses `cql3::query_processor` in some places. This will be addressed in another PR.

Refs: #11870

Closes #13869

* github.com:scylladb/scylladb:
  db/system_keyspace: remove dependency on storage_proxy
  db/system_keyspace: replace storage_proxy::query*() with  replica:: equivalent
  replica: add query.hh
2023-05-18 10:53:27 +03:00
Raphael S. Carvalho
38b226f997 Resurrect optimization to avoid bloom filter checks during compaction
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.

Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.

This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.

The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.

Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
    45.64%    45.64%  sstable_compact  sstable_compaction_test_g
        [.] utils::filter::bloom_filter::is_present

With its resurrection, the problem is gone.

This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.

Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])

[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.

After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).

With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13908
2023-05-18 09:01:50 +03:00
Kefu Chai
03be1f438c sstables: move get_components_lister() into sstable_directory
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.

with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13853
2023-05-18 08:43:35 +03:00