Compare commits

...

245 Commits

Author SHA1 Message Date
Anna Stuchlik
ec9e5b82a0 doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:27:24 +02:00
Aleksandra Martyniuk
4640b3efd3 repair: use find_column_family in insert_repair_meta
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.

Use find_column_family at each table occurrence in insert_repair_meta
instead.

Fixes: #20057

(cherry picked from commit 719999b34c)

Refs #19953

Closes scylladb/scylladb#20078
2024-08-14 22:20:38 +03:00
Michał Chojnowski
136ccff353 cql_test_env: ensure shutdown() before stop() for system_keyspace
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.

Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.

Fixes scylladb/scylla-enterprise#4380

(cherry picked from commit eeaf4c3443)

Closes scylladb/scylladb#20146
2024-08-14 20:15:50 +03:00
Michael Litvak
29c352d9c8 db: fix waiting for counter update operations on table stop
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.

Fixes scylladb/scylla-enterprise#4475

(cherry picked from commit ff86c864ff)

Closes scylladb/scylladb#19980
2024-08-07 10:52:39 +02:00
Kamil Braun
888d5fe1a3 Merge '[Backport 5.4] raft: fix the shutdown phase being stuck' from Emil Maskovsky
Some of the calls inside the raft_group0_client::start_operation() method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the ownership of the raft group semaphore (get_units(semaphore)) - these waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes #19223

(cherry picked from commit 2dbe9ef2f2)

(cherry picked from commit 5dfc50d354)

Refs #19860

Closes scylladb/scylladb#19986

* github.com:scylladb/scylladb:
  raft: fix the shutdown phase being stuck
  raft: use the abort source reference in raft group0 client interface
2024-08-05 16:28:19 +02:00
Emil Maskovsky
6e8911ed51 raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 5dfc50d354)
2024-08-02 11:00:08 +02:00
Emil Maskovsky
9c4fa2652c raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-08-02 10:59:44 +02:00
Avi Kivity
58377036b0 Merge '[Backport 5.4] sstables: fix some mixups between the writer's schema and the sstable's schema ' from Michał Chojnowski
There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.

Fixes scylladb/scylladb#16065

The bug is present in all supported versions, so the patch has to be backported to all of them.

(cherry picked from commit a1834efd82)

(cherry picked from commit d10b38ba5b)

(cherry picked from commit 1a8ee69a43)

Refs scylladb/scylladb#19695

Closes scylladb/scylladb#19878

* github.com:scylladb/scylladb:
  sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
  sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
  sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
2024-07-28 18:15:27 +03:00
Michał Chojnowski
5b29da123f sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.

(cherry picked from commit 1a8ee69a43)
2024-07-25 12:28:00 +02:00
Michał Chojnowski
92ee525f22 sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.

This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.

(cherry picked from commit d10b38ba5b)
2024-07-25 12:28:00 +02:00
Michał Chojnowski
bc1c6275a4 sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
As of this patch, those static_casts are actually invalid in some cases
(they cast to the wrong type) because of an oversight.
A later patch will fix that. But to even write a reliable reproducer
for the problem, we must force the invalid casts to manifest as a crash
(instead of weird results).

This patch both allows writing a reproducer for the bug and serves
as a bit of defensive programming for the future.

(cherry picked from commit a1834efd82)

# Conflicts:
#	sstables/sstables.cc
2024-07-25 12:28:00 +02:00
Nadav Har'El
79629a80cd alternator: fix "/localnodes" to not return nodes still joining
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.

The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.

The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().

But the hard part of this test is the testing:

1. An existing multi-node test for "/localnodes" assummed that right after
   a new node was created, it appears on "/localnodes". But after this
   patch, it may take a bit more time for the bootstrapping to complete
   and the new node to appear in /localnodes - so I had to add a retry loop.

2. I added a test that reproduces the bug fixed here, and verifies its
   fix. The test is in the multi-node topology framework. It adds an
   injection which delays the bootstrap, which leaves a new node in JOINING
   state for a long time. The test then verifies that the new node is
   alive (as checked by the REST API), but is not returned by "/localnodes".

3. The new injection for delaying the bootstrap is unfortunately not
   very pretty - I had to do it in three places because we have several
   code paths of how bootstrap works without repair, with repair, without
   Raft and with Raft - and I wanted to delay all of them.

Fixes #19694.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19725

(cherry picked from commit bac7c33313)
(deleted test for cherry-pick)
(cherry picked from commit af39675c38)
2024-07-24 11:58:16 +03:00
Lakshmi Narayanan Sreethar
9f0b75bcd2 [Backport 5.4]: sstables: do not reload components of unlinked sstables
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.

The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.

Fixes https://github.com/scylladb/scylladb/issues/19722

Closes scylladb/scylladb#19717

* github.com:scylladb/scylladb:
  boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
  sstables: do not reload components of unlinked sstables
  sstables/sstables_manager: introduce on_unlink method

(cherry picked from commit 591876b44e)

Backported from #19717 to 5.4

Closes scylladb/scylladb#19831
2024-07-23 21:31:23 +03:00
Kamil Braun
0fbec200e9 Merge '[Backport 5.4] Fix lwt semaphore guard accounting' from ScyllaDB
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.

Should be backported to all supported versions.

(cherry picked from commit 87beebeed0)

(cherry picked from commit 4178589826)

Refs #19699

Closes scylladb/scylladb#19795

* github.com:scylladb/scylladb:
  test: add test to check that coordinator lwt semaphore continues functioning after locking failures
  paxos: do not signal semaphore if it was not acquired
2024-07-22 10:34:47 +02:00
Gleb Natapov
972b799773 test: add test to check that coordinator lwt semaphore continues functioning after locking failures
(cherry picked from commit 4178589826)
2024-07-19 19:18:35 +02:00
Gleb Natapov
68c581314a paxos: do not signal semaphore if it was not acquired
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.

Fixes: https://github.com/scylladb/scylladb/issues/19698
(cherry picked from commit 87beebeed0)
2024-07-18 15:34:16 +00:00
Michael Litvak
380ce9a6d8 storage_proxy: remove response handler if no targets
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.

Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.

In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.

Fixes scylladb/scylladb#19529

(cherry picked from commit a9fdd0a93a)

Closes scylladb/scylladb#19679
2024-07-15 16:43:28 +02:00
Wojciech Mitros
2c01dfe12b test: account for multiple flushes of commitlog segments
Currently, when we calculate the number of deactivated segments
in test_commitlog_delete_when_over_disk_limit, we only count the
segments that were active during the first flush. However, during
the test, there may have been more than one flush, and a segment
could have been created between them. This segment would sometimes
get deactivated and even destroyed, and as a result, the count of
destroyed segments would appear larger than the count of deactivated
ones.

This patch fixes this behavior by accounting for all segments that
were active during any flush instead of just segments active during
the first flush.

Fixes #10527

(cherry-picked from commit 39a8f4310d)

Closes scylladb/scylladb#19706
2024-07-14 23:21:02 +02:00
Jenkins Promoter
ab22cb7253 Update ScyllaDB version to: 5.4.10 2024-07-11 15:43:24 +03:00
Tomasz Grabiec
0e02128d28 Merge '[Backport 5.4] mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion' from ScyllaDB
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.

This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.

Fixes https://github.com/scylladb/scylladb/issues/19552

(cherry picked from commit f784be6a7e)

(cherry picked from commit 7b3f55a65f)

(cherry picked from commit 78d6471ce4)

Refs #19617

Closes scylladb/scylladb#19676

* github.com:scylladb/scylladb:
  mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
  logalloc: add hold_reserve
  logalloc: generalize refill_emergency_reserve()
2024-07-10 14:29:39 +02:00
Botond Dénes
1e548770cf Merge '[Backport 5.4] reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.

This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.

Fixes: https://github.com/scylladb/scylladb/issues/19017

This PR backports https://github.com/scylladb/scylladb/pull/19018 and its follow-up https://github.com/scylladb/scylladb/pull/19600.

Closes scylladb/scylladb#19646

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
  test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
  test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
  reader_concurrency_semaphore: wire in the configurable cpu concurrency
  reader_concurrency_semaphore: add cpu_concurrency constructor parameter
  db/config: introduce reader_concurrency_semahore_cpu_concurrency
2024-07-10 13:32:07 +03:00
Wojciech Przytuła
3e879c1bfa storage_proxy: fix uninitialized LWT contention counter
When debugging the issue of high LWT contention metric, we (the
drivers team) discovered that at least 3 drivers (Go, Java, Rust)
cause high numbers in that metrics in LWT workloads - we doubted that
all those drivers route LWT queries badly. We tried to understand that
metric and its semantics. It took 3 people over 10 hours to figure out
what it is supposed to count.

People from core team suspected that it was the drivers sending
requests to different shards, causing contention. Then we ran the
workload against a single node single shard cluster... and observed
contention. Finally, we looked into the Scylla code and saw it.

**Uninitialized stack value.**

The core member was shocked. But we, the drivers people, felt we always
knew it. It's yet another time that we are blamed for a server-side
issue. We rebuilt scylla with the variable initialized to 0 and the
metric kept being 0.

To prevent such errors in the future, let's consider some lints that
warn against uninitialized variables. This is such an obvious feature
of e.g. Rust, and yet this has shown to be cause a painful bug in 2024.

Fixes: #19654
(cherry picked from commit 36a125bf97)

Closes scylladb/scylladb#19656
2024-07-10 13:29:31 +03:00
Michał Chojnowski
5e9a2193db mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.

This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.

Fixes scylladb/scylladb#19552

(cherry picked from commit 78d6471ce4)
2024-07-10 08:36:12 +00:00
Michał Chojnowski
c2e5d9e726 logalloc: add hold_reserve
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.

This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.

(cherry picked from commit 7b3f55a65f)
2024-07-10 08:36:12 +00:00
Michał Chojnowski
80ff0688b1 logalloc: generalize refill_emergency_reserve()
In the next patch, we will want to do the thing as
refill_emergency_reserve() does, just with a quantity different
than _emergency_reserve_max. So we split off the shareable part
to a new function, and use it to implement refill_emergency_reserve().

(cherry picked from commit f784be6a7e)
2024-07-10 08:36:10 +00:00
Raphael S. Carvalho
a319085870 compaction: Check for key presence in memtable when calculating max purgeable timestamp
It was observed that some use cases might append old data constantly to
memtable, blocking GC of expired tombstones.

That's because timestamp of memtable is unconditionally used for
calculating max purgeable, even when the memtable doesn't contain the
key of the tombstone we're trying to GC.

The idea is to treat memtable as we treat L0 sstables, i.e. it will
only prevent GC if it contains data that is possibly shadowed by the
expired tombstone (after checking for key presence and timestamp).

Memtable will usually have a small subset of keys in largest tier,
so after this change, a large fraction of keys containing expired
tombstones can be GCed when memtable contains old data.

Fixes #17599.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 38699f6c3d)

Closes scylladb/scylladb#19551
2024-07-10 07:30:40 +03:00
Botond Dénes
b24bd4d176 Merge '[Backport 5.4] Reduce TWCS off-strategy space overhead' from Raphael "Raph" Carvalho
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.

Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.

That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.

Fixes https://github.com/scylladb/scylladb/issues/16514.

(cherry picked from commit b8bd4c51c2)

(cherry picked from commit 51c7ee889e)

(cherry picked from commit 0ce8ee03f1)

(cherry picked from commit ace4e5111e)

Refs https://github.com/scylladb/scylladb/pull/18137

note for maintainer: first cleanup (conflicting) patch was removed and doesn't change anything.

Closes scylladb/scylladb#19549

* github.com:scylladb/scylladb:
  compaction: Reduce twcs off-strategy space overhead to 10% of free space
  compaction: wire storage free space into reshape procedure
  sstables: Allow to get free space from underlying storage
2024-07-09 14:36:27 +03:00
Botond Dénes
f628e7439c reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
Now that the CPU concurency limit is configurable, new reads might be
ready to execute right after the current one was executed. So move the
poll for admitting new reads into the inner loop, to prevent the
situation where the inner loop yields and a concurrent
do_wait_admission() finds that there are waiters (queued because at the
time they arrived to the semaphore, the _ready_list was not empty) but it
is is possible to admit a new read. When this happens the semaphore will
dump diagnostics to help debug the apparent contradiction, which can
generate a lot of log spam. Moving the poll into the inner loop prevents
the false-positive contradiction detection from firing.

Refs: scylladb/scylladb#19017

Closes scylladb/scylladb#19600

(cherry picked from commit 155acbb306)
2024-07-09 13:11:25 +03:00
Botond Dénes
42da43b5b4 test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
(cherry picked from commit b4f3809ad2)
2024-07-09 13:11:25 +03:00
Botond Dénes
679fa0f72a test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
This is currently a lambda in a test, hoist it into the global scope and
make it into a function, so other tests can use it too (in the next
patch).

(cherry picked from commit 9cbdd8ef92)
2024-07-09 11:40:13 +03:00
Botond Dénes
89733a1f18 reader_concurrency_semaphore: wire in the configurable cpu concurrency
Before this patch, the semaphore was hard-wired to stop admission, if
there is even a single permit, which is in the need_cpu state.
Therefore, keeping the CPU concurrency at 1.
This patch makes use of the new cpu_concurrency parameter, which was
wired in in the last patches, allowing for a configurable amount of
concurrent need_cpu permits. This is to address workloads where some
small subset of reads are expected to be slow, and can hold up faster
reads behind them in the semaphore queue.

(cherry picked from commit 07c0a8a6f8)
2024-07-09 11:40:13 +03:00
Botond Dénes
f185a227a2 reader_concurrency_semaphore: add cpu_concurrency constructor parameter
In the case of the user semaphore, this receives the new
reader_concurrency_semaphore_cpu_limit config item.
Not used yet.

(cherry picked from commit 59faa6d4ff)
2024-07-09 11:40:12 +03:00
Botond Dénes
89fd08b955 db/config: introduce reader_concurrency_semahore_cpu_concurrency
To allow increasing the semaphore's CPU concurrency, which is currently
hard-limited to 1. Not wired yet.

(cherry picked from commit c7317be09a)
2024-07-08 08:41:02 +03:00
Nadav Har'El
f29d51d9c3 cql: don't crash when creating a view during a truncate
The test dtest materialized_views_test.py::TestMaterializedViews::
test_mv_populating_from_existing_data_during_truncate reproduces an
assertion failure, and crash, while doing a CREATE MATERIALIZED VIEW
during a TRUNCATE operation.

This patch fixes the crash by removing the assert() call for a view
(replacing it by a warning message) - we'll explain below why this is fine.
Also for base tables change we change the assertion to an on_internal_error
(Refs #7871).
This makes the test stop crashing Scylla, but it still fails due to
issue #17635.

Let's explain the crash, and the fix:

The test starts TRUNCATE on table that doesn't yet have a view.
truncate_table_on_all_shards() begins by disabling compaction on
the table and all its views (of which there are none, at this
point). At this point, the test creates a new view is on this table.
The new view has, by default, compaction enabled. Later, TRUNCATE
calls discard_sstables() on this new view, asserts that it has
compaction disabled - and this assertion fails.

The fix in this patch is to not do the assert() for views. In other words,
we acknowledge that in this use case, the view *will* have compactions
enabled while being truncated. I claim that this is "good enough", if we
remember *why* we disable compaction in the first place: It's important
to disable compaction while truncating because truncating during compaction
can lead us to data resurection when the old sstable is deleted during
truncation but the result of the compaction is written back. True,
this can now happen in a new view (a view created *DURING* the
truncation). But I claim that worse things can happen for this
new view: Notably, we may truncate a view and then the ongoing
view building (which happens in a new view) might copy data from
the base to the view and only then truncate the base - ending up
with an empty base and non-empty view. This problem - issue #17635 -
is more likely, and more serious, than the compaction problem, so
will need to be solved in a separate patch.

Fixes #17543.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17634

(cherry picked from commit 8df2ea3f95)

Closes scylladb/scylladb#19580
2024-07-05 11:19:33 +03:00
Avi Kivity
01d5169593 Merge '[Backport 5.4] Close output_stream in get_compaction_history() API handler' from ScyllaDB
If an httpd body writer is called with output_stream<>, it mist close the stream on its own regardless of any exceptions it may generate while working, otherwise stream destructor may step on non-closed assertion. Stepped on with different handler, see #19541

Coroutinize the handler as the first step while at it (though the fix would have been notably shorter if done with .finally() lambda)

(cherry picked from commit acb351f4ee)

(cherry picked from commit 6d4ba98796)

(cherry picked from commit b4f9387a9d)

Refs #19543

Closes scylladb/scylladb#19602

* github.com:scylladb/scylladb:
  api: Close response stream of get_compaction_history()
  api: Flush output stream in get_compaction_history() call
  api: Coroutinize get_compaction_history inner function
2024-07-04 15:06:53 +03:00
Pavel Emelyanov
3116ea7d8e api: Close response stream of get_compaction_history()
The function must close the stream even if it throws along the way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b4f9387a9d)
2024-07-03 18:30:16 +00:00
Pavel Emelyanov
36ccf67bee api: Flush output stream in get_compaction_history() call
It's currently implicitly flushed on its close, but in that case close
can throw while flusing. Next patch wants close not to throw and that's
possible if flushing the stream in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 6d4ba98796)
2024-07-03 18:30:15 +00:00
Pavel Emelyanov
bae47ca197 api: Coroutinize get_compaction_history inner function
The handler returns a function which is then invoked with output_stream
argument to render the json into. This function is converted into
coroutine. It has yet another inner lambda that's passed into
compaction_manager::get_compaction_history() as consumer lambda. It's
coroutinized too.

The indentation looks weird as preparation for future patching.
Hopefullly it's still possible to understand what's going on.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit acb351f4ee)
2024-07-03 18:30:15 +00:00
Tzach Livyatan
fdcbbb85ad Docs: Fix a typo in sstable-corruption.rst
(cherry picked from commit a7115124ce)

Closes scylladb/scylladb#19590
2024-07-03 10:24:15 +02:00
Piotr Dulikowski
65daae0fbe Merge '[Backport 5.4] cql3/statement/select_statement: do not parallelize single-partition aggregations #19414' from Michał Jadwiszczak
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.

Fixes scylladb/scylladb#19349

(cherry picked from commit e9ace7c203)

(cherry picked from commit 8eb5ca8202)

Refs scylladb/scylladb#19350

Closes scylladb/scylladb#19500

* github.com:scylladb/scylladb:
  test/boost/cql_query_test: add test for single-partition aggregation
  cql3/select_statement: do not parallelize single-partition aggregations
2024-07-03 05:57:49 +02:00
Amnon Heiman
f07fbcf929 replica/table.cc: Add metrics per-table-per-node
This patch adds metrics that will be reported per-table per-node.
The added metrics (that are part of the per-table per-shard metrics)
are:
scylla_column_family_cache_hit_rate
scylla_column_family_read_latency
scylla_column_family_write_latency
scylla_column_family_live_disk_space

Fixes #18642

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit bc3cc6777b)

Closes scylladb/scylladb#19553
2024-07-02 10:49:00 +02:00
Pavel Emelyanov
1d0a6672d6 Merge '[Backport 5.4] Close output stream in task manager's API get_tasks handler' from Pavel Emelyanov
If client stops reading response early, the server-side stream throws but must be closed anyway. Seen in another endpoint and fixed by https://github.com/scylladb/scylladb/pull/19541

Original commit is 0ce00ebfbd
Had simple conflict with ffb5ad494f

closes: #19560

Closes scylladb/scylladb#19571

* https://github.com/scylladb/scylladb:
  api: Fix indentation after previous patch
  api: Close response stream on error
  api: Flush response output stream before closing
2024-07-01 18:02:35 +03:00
Pavel Emelyanov
e16327034c [Backport 5.4] Close output_stream in get_snapshot_details() API handler
All streams used by httpd handlers are to be closed by the handler itself,
caller doesn't take care of that.

On master this place is coroutinized, thus the original fix doesn't fit
as is. Original commit is 0ce00ebfbd

refs: #19494
closes: #19561

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19570
2024-07-01 18:02:26 +03:00
Pavel Emelyanov
3510ff3179 api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1be8b2fd25)
2024-07-01 10:58:57 +03:00
Pavel Emelyanov
e9b8d08b74 api: Close response stream on error
The handler's lambda is called with && stream object and must close the
stream on its own regardless of what.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 986a04cb11)
2024-07-01 10:58:47 +03:00
Pavel Emelyanov
d67af8da1c api: Flush response output stream before closing
The .close() method flushes the stream, but it may throw doing it. Next
patch will want .close() not to throw, for that stream must be flushed
explicitly before closing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4897d8f145)
2024-07-01 10:58:14 +03:00
Jenkins Promoter
66fc7c0494 Update ScyllaDB version to: 5.4.9 2024-06-30 16:23:57 +03:00
Raphael S. Carvalho
67be26ff7d compaction: Reduce twcs off-strategy space overhead to 10% of free space
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.

To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.

Fixes #16514.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ace4e5111e)
2024-06-29 11:29:59 -03:00
Raphael S. Carvalho
97893a4f6d compaction: wire storage free space into reshape procedure
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0ce8ee03f1)
2024-06-29 11:29:59 -03:00
Raphael S. Carvalho
ab9683d182 sstables: Allow to get free space from underlying storage
That will be used in turn to restrict reshape to 10% of available space
in underlying storage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 51c7ee889e)
2024-06-29 11:29:57 -03:00
Yaron Kaikov
892ffa966d .github/scripts/label_promoted_commits.py: fix adding labels when PR is closed
`prs = response.json().get("items", [])` will return empty when there are no merged PRs, and this will just skip the all-label replacement process.

This is a regression following the work done in #19442

Adding another part to handle closed PRs (which is the majority of the cases we have in Scylla core)

Fixes: https://github.com/scylladb/scylladb/issues/19441
(cherry picked from commit efa94b06c2)

Closes scylladb/scylladb#19526
2024-06-27 20:58:56 +03:00
Botond Dénes
4b6e462266 Merge 'alternator: fix REST API access to an Alternator LSI' from Nadav Har'El
The name of the Scylla table backing an Alternator LSI looks like `basename:!lsiname`. Some REST API clients (including Scylla Manager) when they send a "!" character in the REST API request path may decide to "URL encode" it - convert it to `%21`.

Because of a Seastar bug (https://github.com/scylladb/seastar/issues/725) Scylla's REST API server forgets to do the URL decoding on the path part of the request, which leads to the REST API request failing to address the LSI table.

The first patch in this PR fixes the bug by using a new Seastar API introduced in https://github.com/scylladb/seastar/pull/2125 that does the URL decoding as appropriate. The second patch in the PR is a new test for this bug, which fails without the fix, and passes afterwards.

Fixes #5883.

Closes scylladb/scylladb#18286

* github.com:scylladb/scylladb:
  test/alternator: test addressing LSI using REST API
  REST API: stop using deprecated, buggy, path parameter

(cherry picked from commit 0438febdc9)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-06-27 20:58:56 +03:00
Nadav Har'El
ce22d0071b Update Seastar submodule
Update Seastar submodule with cherry-picked commit:

  > Merge 'http: path parameter parsing convenience methods' from Gellért Peresztegi-Nagy

This is required for cherry-picking the fix for #5883.
2024-06-27 20:58:56 +03:00
Aleksandra Martyniuk
a5d34b62ac repair: drop timeout from table_sync_and_check
Delete 10s timeout from read barrier in table_sync_and_check,
so that the function always considers all previous group0 changes.

Fixes: #18490.
(cherry picked from commit f947cc5477)

Closes scylladb/scylladb#19513
2024-06-27 14:58:09 +03:00
Botond Dénes
f121720898 Merge '[Backport 5.4] batchlog replay: bypass tombstones generated by past replays' from ScyllaDB
The `system.batchlog` table has a partition for each batch that failed to complete. After finally applying the batch, the partition is deleted. Although the table has gc_grace_second = 0, tombstones can still accumulate in memory, because we don't purge partition tombstones from either the memtable or the cache. This can lead to the cache and memtable of this table to accumulate many thousands of even millions of tombstones, making batchlog replay very slow. We didn't notice this before, because we would only replay all failed batches on unbootstrap, which is rare and a heavy and slow operation on its own right already.
With repair-based tombstone-gc however, we do a full batchlog replay at the beginning of each repair, and now this extra delay is noticeable.
Fix this by making sure batchlog replays don't have to scan through all the tombstones generated by previous replays:
* flush the `system.batchlog` memtable at the end of each batchlog replay, so it is cleared of tombstones
* bypass the cache

Fixes: https://github.com/scylladb/scylladb/issues/19376

Although this is not a regression -- replay was like this since forever -- now that repair calls into batchlog replay, every release which uses repair-based tombstone-gc should get this fix

(cherry picked from commit 4e96e320b4)

(cherry picked from commit 2dd057c96d)

(cherry picked from commit 29f610d861)

(cherry picked from commit 31c0fa07d8)

Refs #19377

Closes scylladb/scylladb#19501

* github.com:scylladb/scylladb:
  db/batchlog_manager: bypass cache when scanning batchlog table
  db/batchlog_manager: replace open-coded paging with internal one
  db/batchlog_manager: implement cleanup after all batchlog replay
  cql3/query_processor: for_each_cql_result(): move func to the coro frame
2024-06-27 14:57:19 +03:00
Botond Dénes
d217ab9cc7 Merge ' [Backport 5.4] schema: Make "describe" use extensions to string #19358 ' from Calle Wilund
Fixes https://github.com/scylladb/scylladb/issues/19334

Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.

Note: required to make enterprise CI happy again.

(cherry picked from commit d27620e146)

(cherry picked from commit 73abc56d79)

Refs https://github.com/scylladb/scylladb/pull/19337'

(Manually re-done from mergify fail #19358)

Closes scylladb/scylladb#19487

* github.com:scylladb/scylladb:
  schema: Make "describe" use extensions to string
  schema_extensions: Add an option to string method
2024-06-27 14:52:17 +03:00
Kamil Braun
6eac67628e Merge '[Backport 5.4] join_token_ring, gossip topology: recalculate sync nodes in wait_alive' from Patryk Jędrzejczak
The node booting in gossip topology waits until all NORMAL
nodes are UP. If we removed a different node just before,
the booting node could still see it as NORMAL and wait for
it to be UP, which would time out and fail the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

Although the issue fixed by this PR caused only test flakiness,
it could also manifest in real clusters. It's best to backport this
PR to 5.4 and 6.0.

Fixes scylladb/scylladb#17526

The first patch of the original PR merged to master,
017134fd38, differs because
the conflicting patch, e4c3c07510,
hasn't been backported to 5.4. However, we don't have to
additionally backport that patch. These changes are orthogonal.

Closes scylladb/scylladb#19506

* github.com:scylladb/scylladb:
  join_token_ring, gossip topology: update obsolete comment
  join_token_ring, gossip topology: fix indendation after previous patch
  join_token_ring, gossip topology: recalculate sync nodes in wait_alive
2024-06-26 18:38:40 +02:00
Patryk Jędrzejczak
a28b38d0a9 join_token_ring, gossip topology: update obsolete comment
The code mentioned in the comment has already been added. We change
the comment to prevent confusion.
2024-06-26 16:09:31 +02:00
Patryk Jędrzejczak
614cabf9cd join_token_ring, gossip topology: fix indendation after previous patch 2024-06-26 16:09:31 +02:00
Patryk Jędrzejczak
193fda6dfb join_token_ring, gossip topology: recalculate sync nodes in wait_alive
Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
`wait_alive` loop.

The original patch merged to master,
017134fd38, differs because
the conflicting patch, e4c3c07510,
hasn't been backported to 5.4. However, we don't have to
additionally backport that patch. These changes are orthogonal.
2024-06-26 16:09:30 +02:00
Yaron Kaikov
50f3f3d1a3 .github/workflow: close and replace label when backport promoted
Today after Mergify opened a Backport PR, it will stay open until someone manually close the backport PR , also we can't track using labels which backport was done or not since there is no indication for that except digging into the PR and looking for a comment or a commit ref

The following changes were made in this PR:
* trigger add-label-when-promoted.yaml also when the push was made to `branch-x.y`
* Replace label `backport/x.y` with `backport/x.y-done` in the original PR (this will automatically update the original Issue as well)
* Add a comment on the backport PR and close it

Fixes: https://github.com/scylladb/scylladb/issues/19441
(cherry picked from commit 394cba3e4b)

Closes scylladb/scylladb#19495
2024-06-26 12:44:27 +03:00
Botond Dénes
9ce5c2e6ce db/batchlog_manager: bypass cache when scanning batchlog table
Scans should not pollute the cache with cold data, in general. In the
case of the batchlog table, there is another reason to bypass the cache:
this table can have a lot of partition tombstones, which currently are
not purged from the cache. So in certain cases, using the cache can make
batch replay very slow, because it has to scan past tombstones of
already replayed batches.

(cherry picked from commit 31c0fa07d8)
2024-06-26 09:05:13 +00:00
Botond Dénes
2f2bc18376 db/batchlog_manager: replace open-coded paging with internal one
query_processor has built-in paging support, no need to open-code paging
in batchlog manager code.

(cherry picked from commit 29f610d861)
2024-06-26 09:05:13 +00:00
Botond Dénes
c795275675 db/batchlog_manager: implement cleanup after all batchlog replay
We have a commented code snippet from Origin with cleanup and a FIXME to
implement it. Origin flushes the memtables and kicks a compaction. We
only implement the flush here -- the flush will trigger a compaction
check and we leave it up to the compaction manager to decide when a
compaction is worthwhile.
This method used to be called only from unbootstrap, so a cleanup was
not really needed. Now it is also called at the end of repair, if the
table is using repair-based tombstone-gc. If the memtable is filled with
tombstones, this can add a lot of time to the runtime of each repair. So
flush the memtable at the end, so the tombstones can be purged (they
aren't purged from memtables yet).

(cherry picked from commit 2dd057c96d)
2024-06-26 09:05:12 +00:00
Botond Dénes
b13cee4c7c cql3/query_processor: for_each_cql_result(): move func to the coro frame
Said method has a func parameter (called just f), which it receives as
rvalue ref and just uses as a reference. This means that if caller
doesn't keep the func alive, for_each_cql_result() will run into
use-after-free after the first suspention point. This is unexpected for
callers, who don't expect to have to keep something alive, which they
passed in with std::move().
Adjust the signature to take a value instead, value parameters are moved
to the coro frame and survive suspention points.
Adjust internal callers (query_internal()) the same way.

There are no known vulnerable external callers.

(cherry picked from commit 4e96e320b4)
2024-06-26 09:05:12 +00:00
Michał Jadwiszczak
bea1f4891d test/boost/cql_query_test: add test for single-partition aggregation
(cherry picked from commit 8eb5ca8202)
2024-06-26 08:57:50 +02:00
Michał Jadwiszczak
9535abf552 cql3/select_statement: do not parallelize single-partition aggregations
Currently reads with WHERE clause which limits them to be
single-partition reads, are unnecessarily parallelized.

This commit checks this condition and the query doesn't use
forward_service in single-partition reads.

(cherry picked from commit e9ace7c203)
2024-06-26 08:56:34 +02:00
Calle Wilund
abb4751e00 schema: Make "describe" use extensions to string
Fixes #19334

Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.

(cherry picked from commit 73abc56d79)
2024-06-25 23:51:52 +00:00
Calle Wilund
39ec136a09 schema_extensions: Add an option to string method
Allow an extension to describe itself as the CQL property
string that created it (and is serialized to schema tables)

Only paxos extension requires override.

(cherry picked from commit d27620e146)
2024-06-25 12:54:13 +00:00
Nadav Har'El
b7ef9652fb test: unflake test test_alternator_ttl_scheduling_group
This test in topology_experimental_raft/test_alternator.py wants to
check that during Alternator TTL's expiration scans, ALL of the CPU was
used in the "streaming" scheduling group and not in the "statement"
scheduling group. But to allow for some fluke requests (e.g., from the
driver), the test actually allows work in the statement group to be
up to 1% of the work.

Unfortunately, in one test run - a very slow debug+aarch64 run - we
saw the work on the statement group reach 1.4%, failing the test.
I don't know exactly where this work comes from, perhaps the driver,
but before this bug was fixed we saw more than 58% of the work in the
wrong scheduling group, so neither 1% or 1.4% is a sign that the bug
came back. In fact, let's just change the threshold in the test to 10%,
which is also much lower than the pre-fix value of 58%, so is still a
valid regression test.

Fixes #19307

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19323

(cherry picked from commit 9fc70a28ca)
2024-06-24 14:15:57 +03:00
Botond Dénes
d3b2702be1 replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
This config item is propagated to the table object via table::config.
Although the field in table::config, used to propagate the value, was
utils::updateable_value<T>, it was assigned a constant and so the
live-update chain was broken.
This patch fixes this.

(cherry picked from commit dbccb61636)

Closes scylladb/scylladb#19239
2024-06-21 20:14:57 +03:00
Wojciech Mitros
4ef4893f7e mv: replicate the gossiped backlog to all shards
On each shard of each node we store the view update backlogs of
other nodes to, depending on their size, delay responses to incoming
writes, lowering the load on these nodes and helping them get their
backlog to normal if it were too high.

These backlogs are propagated between nodes in two ways: the first
one is adding them to replica write responses. The seconds one
is gossiping any changes to the node's backlog every 1s. The gossip
becomes useful when writes stop to some node for some time and we
stop getting the backlog using the first method, but we still want
to be able to select a proper delay for new writes coming to this
node. It will also be needed for the mv admission control.

Currently, the backlog is gossiped from shard 0, as expected.
However, we also receive the backlog only on shard 0 and only
update this shard's backlogs for the other node. Instead, we'd
want to have the backlogs updated on all shards, allowing us
to use proper delays also when requests are received on shards
different than 0.

This patch changes the backlog update code, so that the backlogs
on all shards are updated instead. This will only be performed
up to once per second for each other node, and is done with
a lower priority, so it won't severly impact other work.

Fixes: scylladb/scylladb#19232
(cherry picked from commit d31437b589)

Closes scylladb/scylladb#19306
2024-06-21 09:58:40 +02:00
Calle Wilund
19999554e7 main/minio_server.py: Respect any preexisting AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY vars
Fixes scylladb/scylla-pkg#3845

Don't overwrite (or rather change) AWS credentials variables if already set in
enclosing environment. Ensures EAR tests for AWS KMS can run properly in CI.

v2:
* Allow environment variables in reading obj storage config - allows CI to
  use real credentials in env without risking putting them info less seure
  files
* Don't write credentials info from miniserver into config, instead use said
  environment vars to propagate creds.

v3:
* Fix python launch scripts to not clear environment, thus retaining above aws envs.

(cherry picked from commit 5056a98289)

Closes scylladb/scylladb#19336
2024-06-20 13:23:40 +03:00
Botond Dénes
43f77c71c7 [Backport 5.4] : Merge 'Fix usage of utils/chunked_vector::reserve_partial' from Lakshmi Narayanan Sreethar
utils/chunked_vector::reserve_partial: fix usage in callers

The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.

Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.

This PR updates the method comment and all the callers to use the
right way.

Fixes #19254

Closes scylladb/scylladb#19279

* github.com:scylladb/scylladb:
  utils/large_bitset: remove unused includes identified by clangd
  utils/large_bitset: use thread::maybe_yield()
  test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
  utils/lsa/chunked_managed_vector: fix reserve_partial()
  utils/chunked_vector: return void from reserve_partial and make_room
  test/boost/chunked_vector_test: fix testcase tests_reserve_partial
  utils/chunked_vector::reserve_partial: fix usage in callers

(cherry picked from commit b2ebc172d0)

Backported from #19308 to 5.4

Closes scylladb/scylladb#19355
2024-06-19 14:34:29 +03:00
Botond Dénes
4aa0b84ba7 Merge '[Backport 5.4] sstables_manager: use maintenance scheduling group to run components reload fiber' from Lakshmi Narayanan Sreethar
PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group.

Fixes https://github.com/scylladb/scylladb/issues/18675

(cherry picked from commit 79f6746298)

(cherry picked from commit 6f58768c46)

Backported from https://github.com/scylladb/scylladb/pull/18721 to 5.4.

Closes scylladb/scylladb#19354

* github.com:scylladb/scylladb:
  sstables_manager: use maintenance scheduling group to run components reload fiber
  sstables_manager: add member to store maintenance scheduling group
2024-06-18 16:29:07 +03:00
Botond Dénes
427127de57 Merge ' [Backport 5.4] alternator: keep TTL work in the maintenance scheduling group' from Nadav Har'El
This is a fairly elaborate backport of commit b2a500a9a1 to branch 5.4
The code patch itself is trivial, and backported cleanly. The big problem was the test, which was written using the "topology" test framework - because it needs to test a cluster, not a single node, because the scheduling group problem only happened when sending requests between different Scylla nodes.

I had to fix in the backport the following problems:
1. The test used a library function add_servers() which didn't exist in branch 5.4, so needed to switch to making three individual add_server() calls.
2. The test was randomly placed in the topology_experimental_raft directory, which runs with the tablets experimental flag enabled. In 5.4, the tablets code was broken with Alternator, and CreateTable fails (it fails in the callback to create tablets, and doesn't even get to check that tablets weren't requested). So I needed to move it the test file to a different directory.
3. Even after moving the file, it still ran with the tablets experimental feature! Turns out that test.py enabled tablets experimental feature unconditionally. This is a mistake, and I'm sure was never intended (tablets were never meant to be supported in 5.4), so I removed enabling this feature. It's still enabled in the topology_experimental_raft directory, where it is explicitly enabled.

After all that, the test passes with the patch, showing that the code fix is correct also for 5.4.

Closes scylladb/scylladb#19321

* github.com:scylladb/scylladb:
  alternator, scheduler: test reproducing RPC scheduling group bug
  test.py: don't enable "tablets" experimental feature
  main: add maintenance tenant to messaging_service's scheduling config
2024-06-18 12:29:20 +03:00
Lakshmi Narayanan Sreethar
d7b1116170 sstables_manager: use maintenance scheduling group to run components reload fiber
Fixes #18675

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 6f58768c46)
2024-06-18 14:41:46 +05:30
Lakshmi Narayanan Sreethar
72155312e5 sstables_manager: add member to store maintenance scheduling group
Store that maintenance scheduling group inside the sstables_manager. The
next patch will use this to run the components reloader fiber.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 79f6746298)
2024-06-18 14:39:45 +05:30
Nadav Har'El
f8dcbc6037 alternator, scheduler: test reproducing RPC scheduling group bug
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22d89)

Modifications in the cherry-pick:
 * Moved test to topology_custom directory, so it runs without tablets
 * use the server_add() function instead of the newer add_servers() which
   didn't yet exist in this branch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-06-18 10:17:45 +03:00
Nadav Har'El
dc1968cb9e test.py: don't enable "tablets" experimental feature
This branch (5.4) does NOT support tablets, and we don't want to run
any tests with the "tablets" experimental feature. When we made test.py
enable that feature by default, it was probably considered harmless -
the partial implementation we had in this branch won't do anything if
tablets aren't actually enabled for a specific keyspace.

But unfortunately, Alternator doesn't work with tablets enabled (there
was a bug in the callback during table creation), so we can't run any
Alternator tests from test.py (like the one we we wan to backport for
Alternator TTL scheduling groups) unless we drop that experimental
feature.

Note that one specific test subdirectory,
test/topology_experimental_raft, does enable this experimental
flag. The others shouldn't.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-06-18 10:14:53 +03:00
Botond Dénes
7e40b658c8 main: add maintenance tenant to messaging_service's scheduling config
Currently only the user tenant (statement scheduling group) and system
(default scheduling group) tenants exist, as we used to have only
user-initiated operations and sytem (internal) ones. Now there is need
to distinguish between two kinds of system operation: foreground and
background ones. The former should use the system tenant while the
latter will use the new maintenance tenant (streaming scheduling group).

(cherry picked from commit 5d3f7c13f9)
2024-06-18 10:08:46 +03:00
Tomasz Grabiec
58671274d8 test: pylib: Fetch all pages by default in run_async
Fetching only the first page is not the intuitive behavior expected by users.

This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.

Fixes #18774

(cherry picked from commit 43b907b499)

Closes scylladb/scylladb#19129
2024-06-17 10:41:20 +02:00
Botond Dénes
c18f14cd78 Merge '[Backport 5.4] test: memtable_test: increase unspooled_dirty_soft_limit ' from ScyllaDB
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.

after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.

Fixes https://github.com/scylladb/scylladb/issues/19034

---

the issue applies to both 5.4 and 6.0, and this issue hurts the CI stability, hence we should backport it.

(cherry picked from commit 2df4e9cfc2)

(cherry picked from commit 223fba3243)

Refs #19252

Closes scylladb/scylladb#19256

* github.com:scylladb/scylladb:
  test: memtable_test: increase unspooled_dirty_soft_limit
  test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
2024-06-14 15:50:58 +03:00
Michał Chojnowski
c19f980802 storage_proxy: avoid infinite growth of _throttled_writes
storage_proxy has a throttling mechanism which attempts to limit the number
of background writes by forcefully raising CL to ALL
(it's not implemented exactly like that, but that's the effect) when
the amount of background and queued writes is above some fixed threshold.
If this is applied to a write, it becomes "throttled",
and its ID is appended to into _throttled_writes.

Whenever the amount of background and queued writes falls below the threshold,
writes are "unthrottled" — some IDs are popped from _throttled_writes
and the writes represented by these IDs — if their handlers still exist —
have their CL lowered back.

The problem here is that IDs are only ever removed from _throttled_writes
if the number of queued and background writes falls below the threshold.
But this doesn't have to happen in any finite time, if there's constant write
pressure. And in fact, in one load test, it hasn't happened in 3 hours,
eventually causing the buffer to grow into gigabytes and trigger OOM.

This patch is intended to be a good-enough-in-practice fix for the problem.

Fixes #17476
Fixes #1834

(cherry picked from commit 97e1518eb9)

Closes scylladb/scylladb#19179
2024-06-14 15:49:34 +03:00
Kamil Braun
0abccd212d raft: fsm: add details to on_internal_error_noexcept message
If we receive a message in the same term but from a different leader
than we expect, we print:
```
Got append request/install snapshot/read_quorum from an unexpected leader
```
For some reason the message did not include the details (who the leader
was and who the sender was) which requires almost zero effort and might
be useful for debugging. So let's include them.

Ref: scylladb/scylla-enterprise#4276
(cherry picked from commit 99a0599e1e)

Closes scylladb/scylladb#19264
2024-06-13 11:24:28 +02:00
Kefu Chai
b8d0df24ed test: memtable_test: increase unspooled_dirty_soft_limit
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.

after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.

Fixes #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 223fba3243)
2024-06-12 15:43:58 +00:00
Kefu Chai
b3de65a8fb test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.

after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.

Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 2df4e9cfc2)
2024-06-12 15:43:58 +00:00
Kefu Chai
3eb15e841a docs: correct the link pointing to Scylla U
before this change it points to
https://university.scylladb.com/courses/scylla-operations/lessons/change-data-capture-cdc/
which then redirects the browser to
https://university.scylladb.com/courses/scylla-operations/,
but it should have point to
https://university.scylladb.com/courses/data-modeling/lessons/change-data-capture-cdc/

in this change, the hyperlink is corrected.

Fixes #19163
Refs 6e97b83b60
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit b5dce7e3d0)

Closes scylladb/scylladb#19197
2024-06-11 18:13:58 +03:00
Jenkins Promoter
017524c7d8 Update ScyllaDB version to: 5.4.8 2024-06-10 12:22:35 +03:00
Wojciech Mitros
1680bc2902 mv gossip: check errno instead of value returned by strtoull
Currently, when a view update backlog is changed and sent
using gossip, we check whether the strtoll/strtoull
function used for reading the backlog returned
LLONG_MAX/ULLONG_MAX, signaling an error of a value
exceeding the type's limit, and if so, we do not store
it as the new value for the node.

However, the ULLONG_MAX value can also be used as the max
backlog size when sending empty backlogs that were never
updated. In theory, we could avoid sending the default
backlog because each node has its real backlog (based on
the node's memory, different than the ULLONG_MAX used in
the default backlog). In practice, if the node's
backlog changed to 0, the backlog sent by it will be
likely the default backlog, because when selecting
the biggest backlog across node's shards, we use the
operator<=>(), which treats the default backlog as
equal to an empty backlog and we may get the default
backlog during comparison if the backlog of some shard
was never changed (also it's the initial max value
we compare shard's backlogs against).

This patch removes the (U)LLONG_MAX check and replaces
it with the errno check, which is also set to ERANGE during
the strtoll error, and which won't prevent empty backlogs
from being read

Fixes: #18462
(cherry picked from commit 5154429713)

Closes scylladb/scylladb#18697
2024-06-05 09:16:07 +03:00
Lakshmi Narayanan Sreethar
2e836fa077 db/config.cc: increment components_memory_reclaim_threshold config default
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.

Fixes #18607

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)

Closes scylladb/scylladb#19013
2024-06-04 07:11:43 +03:00
Botond Dénes
98139a8716 Merge '[Backport 5.4] : Reload reclaimed bloom filters when memory is available' from Lakshmi Narayanan Sreethar
PR #17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).

The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.

Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.4.

Closes scylladb/scylladb#18660

* github.com:scylladb/scylladb:
  sstable_datafile_test: add testcase to test reclaim during reload
  sstable_datafile_test: add test to verify auto reload of reclaimed components
  sstables_manager: reload previously reclaimed components when memory is available
  sstables_manager: start a fiber to reload components
  sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
  sstable_datafile_test: add test to verify reclaimed components reload
  sstables: support reloading reclaimed components
  sstables_manager: add new intrusive set to track the reclaimed sstables
  sstable: add link and comparator class to support new instrusive set
  sstable: renamed intrusive list link type
  sstable: track memory reclaimed from components per sstable
  sstable: rename local variable in sstable::total_reclaimable_memory_size
2024-05-30 11:09:51 +03:00
Kefu Chai
ee942874de docs: fix typos in upgrade document
s/Montioring/Monitoring/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit f1f3f009e7)

Closes scylladb/scylladb#18911
2024-05-30 11:06:40 +03:00
Nadav Har'El
4099833587 cql3, secondary index: consistently choose index to use in a query
When a table has secondary indexes on *multiple* columns, and several
such columns are used for filtering in a query, Scylla chooses one
of these indexes as the main driver of the query, and the second
column's restriction is implemented as filtering.

Before this patch, the index to use was chosen fairly randomly, based on
the order of the indexes in the schema. This order may be different in
different coordinators, and may even change across restarts on the same
coordinators. This is not only inconsistent, it can cause outright wrong
results when using *paging* and switching (or restarting) coordinates
in the middle of a paged scan... One coordinator saves one index's key
in the paging state, and then the other coordinator gets this paging
state and wrongly believes it is supposed to be a key of a *different*
index.

The fix in this patch is to pick the index suitable for the first
indexed column mentioned in the query. This has two benefits over
the situation before the patch:

1. The decision of which index to use no longer changes between
   coordinators or across restarts - it just depends on the schema
   and the specific query.

2. Different indexes can have different "specificity" so using one
   or the other can change the query's performance. After this patch,
   the user is in control over which index is used by changing the
   order of terms in the query. A curious user can use tracing to
   check which index was used to implement a particular query.

An xfailing test we had for this issue no longer fails, so the "xfail"
marker is removed.

Fixes #7969

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 77c61f907e)

Closes scylladb/scylladb#18963
2024-05-29 18:04:16 +03:00
Botond Dénes
27802511c0 Merge '[Backport 5.4] repair: Introduce repair_partition_count_estimation_ratio config option' from Asias He
This PR backport the "repair_partition_count_estimation_ratio config option" support to 5.4 branch.
In addition to the main patch "repair: Introduce repair_partition_count_estimation_ratio config option",
the patch "repair: Add missing db/config.hh" is added too.

Closes scylladb/scylladb#18881

* github.com:scylladb/scylladb:
  repair: Introduce repair_partition_count_estimation_ratio config option
  repair: Add missing db/config.hh
2024-05-27 15:13:11 +03:00
Asias He
30ffce4c79 repair: Introduce repair_partition_count_estimation_ratio config option
In commit 642f9a1966 (repair: Improve
estimated_partitions to reduce memory usage), a 10% hard coded
estimation ratio is used.

This patch introduces a new config option to specify the estimation
ratio of partitions written by repair out of the total partitions.

It is set to 0.1 by default.

Fixes #18615

(cherry picked from commit 340eae007a)
2024-05-27 16:32:56 +08:00
Asias He
9869276192 repair: Add missing db/config.hh
Since commit 952dfc6157 "repair: Introduce
repair_partition_count_estimation_ratio config option", get_config() is
used. We need to include db/config.hh for that.

Spotted when backporting to 5.4 branch.

Refs #18615

Closes scylladb/scylladb#18780

(cherry picked from commit 1a03e3d5ae)
2024-05-27 16:32:56 +08:00
Takuya ASADA
53a9dfba3a dist/docker: revert dropping systemd package
On 7ce6962141 we dropped openssh-server,
it also dropped systemd package and caused an error on Scylla Operator
(#17787).

This reverts dropping systemd package and fix the issue.

Fix #17787

(cherry picked from commit 0c7aa9342d)

Closes scylladb/scylladb#18834
2024-05-23 12:00:16 +03:00
Nadav Har'El
0d4e22ef55 cql: fix hang during certain SELECT statements
The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.

As noted in issue #18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.

This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by
Michał Chojnowski.

Fixes #18688
Refs #18690

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 27ab560abd)

Closes scylladb/scylladb#18717
2024-05-21 16:31:21 +03:00
Botond Dénes
ae6c8753e6 Merge '[Backport 5.4] utils: chunked_vector: fill ctor: make exception safe' from ScyllaDB
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

- [x] Although the fixes is for a rare bug, it has very low risk and so it's worth backporting to all live versions

(cherry picked from commit 64c51cf32c)

(cherry picked from commit 88b3173d03)

(cherry picked from commit 4bbb66f805)

Refs #18636

Closes scylladb/scylladb#18679

* github.com:scylladb/scylladb:
  chunked_vector_test: add more exception safety tests
  chunked_vector_test: exception_safe_class: count also moved objects
  utils: chunked_vector: fill ctor: make exception safe
2024-05-21 16:29:46 +03:00
Benny Halevy
36c66d5a8f chunked_vector_test: add more exception safety tests
For insertion, with and without reservation,
and for fill and copy constructors.

Reproduces https://github.com/scylladb/scylladb/issues/18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:31:23 +03:00
Benny Halevy
9413afce41 chunked_vector_test: exception_safe_class: count also moved objects
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.

However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:05:38 +03:00
Benny Halevy
8e20379305 utils: chunked_vector: fill ctor: make exception safe
Currently, if the fill ctor throws an exception,
the destructor won't be called, as it object is not
fully constructed yet.

Call the default ctor first (which doesn't throw)
to make sure the destructor will be called on exception.

Fixes scylladb/scylladb#18635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-21 11:05:38 +03:00
Kefu Chai
d32550f953 service/storage_proxy: capture tr_state by copy in handle_paxos_accept()
this change is inspired by following warning from clang-tidy

```
Warning: /home/runner/work/scylladb/scylladb/service/storage_proxy.cc:884:13: warning: 'tr_state' used after it was moved [bugprone-use-after-move]
  884 |         if (tr_state) {
      |             ^
/home/runner/work/scylladb/scylladb/service/storage_proxy.cc:872:139: note: move occurred here
  872 |         auto f = get_schema_for_read(proposal.update.schema_version(), src_addr, *timeout).then([&sp = _sp, &sys_ks = _sys_ks, tr_state = std::move(tr_state),
      |                                                                                                                                           ^
```

this is not a false positive. as `tr_state` is a captured by move for
constructing a variable in the captured list of a lambda which is in
turn passed to the expression evaluated to `f`.

even the expression itself is not evaluated yet when we reference
`tr_state` to check if it is empty after preparing the expression,
`tr_state` is already moved away into the captured variable. so
at that moment, the statement of `f = f.finally(...)` is never
evaluated, because `tr_state` is always empty by then.

so before this change, the trace message is never recorded.

in this change, we address this issue by capturing `tr_state` by
copying it. as `tr_state` is backed by a `lw_shared_ptr`, the overhead is
neglectable.

after this change, the tracing message is recorded.

the change introduced this issue was 548767f91e.

please note, we could coroutinize this function to improve its
readability, but since this is a fix and should be backported,
let's start with a minimal fix, and worry about the readability
in a follow-up change.

Refs 548767f91e
Fixes #18725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit a429e7b1fe)

Closes scylladb/scylladb#18763
2024-05-21 09:03:01 +03:00
Botond Dénes
4e9ed69a75 Merge '[Backport 5.4] mutation_fragment_stream_validating_filter: respect validating_level::none' from ScyllaDB
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.

Fixes: #18662

(cherry picked from commit f6511ca1b0)

(cherry picked from commit e7b07692b6)

(cherry picked from commit 78afb3644c)

Refs #18667

Closes scylladb/scylladb#18724

* github.com:scylladb/scylladb:
  test/boost/mutation_fragment_test.cc: add test for validator validation levels
  mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
  mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
2024-05-20 09:02:52 +03:00
Botond Dénes
7552c4b187 test/boost/mutation_fragment_test.cc: add test for validator validation levels
To make sure that the validator doesn't validate what the validation
level doesn't include.

(cherry picked from commit 78afb3644c)
2024-05-17 07:55:05 +00:00
Botond Dénes
87dcd29ec3 mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
Despite its name, this validation level still did some validation. Fix
this, by short-circuiting the catch-all operator(), preventing any
validation when the user asked for none.

(cherry picked from commit e7b07692b6)
2024-05-17 07:55:04 +00:00
Botond Dénes
9e7cd767dd mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
When set to false, no exceptions will be raised from the validator on
validation error. Instead, it will just return false from the respective
validator methods. This makes testing simpler, asserting exceptions is
clunky.
When true (default), the previous behaviour will remain: any validation
error will invoke on_internal_error(), resulting in either std::abort()
or an exception.

(cherry picked from commit f6511ca1b0)
2024-05-17 07:55:04 +00:00
Botond Dénes
63d1c763fc Merge '[Backport 5.4] tools/scylla-sstable: add scylla sstable shard-of command' from Kefu Chai
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.

in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.

Fixes #16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes https://github.com/scylladb/scylladb/pull/16345

(cherry picked from commit 273ee36bee)

Fixes #18381

- [x] need to backport, because we have needs in production to figure out the mapping from an sstable identifier to the shard which "owns" it.

Closes scylladb/scylladb#18681

* github.com:scylladb/scylladb:
  tools: Make sstable shard-of efficient by loading minimum to compute owners
  test/cql-pytest/test_tools.py: test shard-of with a single partition
  tools/scylla-sstable: add `scylla sstable shard-of` command
2024-05-16 11:07:47 +03:00
Pavel Emelyanov
29c892ea5a functions: Do not crash when schema is missing
Getting token() function first tries to find a schema for underlying
table and continues with nullptr if there's no one. Later, when creating
token_fct, the schema is passed as is and referenced. If it's null crash
happens.

It used to throw before 5983e9e7b2 (cql3: test_assignment: pass optional
schema everywhere) on missing schema, but this commit changed the way
schema is looked up, so nullptr is now possible.

fixes: #18637

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit df8a446437)

Closes scylladb/scylladb#18698
2024-05-16 11:06:25 +03:00
Raphael S. Carvalho
9bb175852d tools: Make sstable shard-of efficient by loading minimum to compute owners
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18440

(cherry picked from commit d7a01598ce)
2024-05-15 14:32:43 +08:00
Kefu Chai
daf4ffb9b4 test/cql-pytest/test_tools.py: test shard-of with a single partition
test_scylla_sstable_shard_of takes lots of time preparing the keys for a
certain shard. with the debug build, it takes 3 minutes to complete the
test.

so in order to test the "shard-of" subcommand in an more efficient way,
in this change, we improve the test in two ways:

1. cache the output of 'scylla types shardof`. so we can avoid the
   overhead of running a seastar application repeatly for the
   same keys.
2. reduce the number of partitions from 42 to 1. as the number of
   partitions in an sstable does not matter when testing the
   output of "shard-of" command of a certain sstable. because,
   the sstable is always generated by a certain shard.

before this change, with pytest-profiling:

```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      4/3    0.000    0.000  181.950   60.650 runner.py:219(call_and_report)
      4/3    0.000    0.000  181.948   60.649 runner.py:247(call_runtest_hook)
      4/3    0.000    0.000  181.948   60.649 runner.py:318(from_call)
      4/3    0.000    0.000  181.948   60.649 runner.py:262(<lambda>)
    44/11    0.000    0.000  181.935   16.540 _hooks.py:427(__call__)
    43/11    0.000    0.000  181.935   16.540 _manager.py:103(_hookexec)
    43/11    0.000    0.000  181.935   16.540 _callers.py:30(_multicall)
      361    0.001    0.000  181.531    0.503 contextlib.py:141(__exit__)
   782/81    0.001    0.000  177.578    2.192 {built-in method builtins.next}
     1044    0.006    0.000   92.452    0.089 base_events.py:1894(_run_once)
       11    0.000    0.000   91.129    8.284 fixtures.py:686(<lambda>)
    17/11    0.000    0.000   91.129    8.284 fixtures.py:1025(finish)
        4    0.000    0.000   91.128   22.782 fixtures.py:913(_teardown_yield_fixture)
      2/1    0.000    0.000   91.055   91.055 runner.py:111(pytest_runtest_protocol)
      2/1    0.000    0.000   91.055   91.055 runner.py:119(runtestprotocol)
        2    0.000    0.000   91.052   45.526 conftest.py:50(cql)
        2    0.000    0.000   91.040   45.520 util.py:161(cql_session)
        1    0.000    0.000   91.040   91.040 runner.py:180(pytest_runtest_teardown)
        1    0.000    0.000   91.040   91.040 runner.py:509(teardown_exact)
     1945    0.002    0.000   90.722    0.047 events.py:82(_run)
```

after this change:
```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      4/3    0.000    0.000    8.271    2.757 runner.py:219(call_and_report)
    44/11    0.000    0.000    8.270    0.752 _hooks.py:427(__call__)
    44/11    0.000    0.000    8.270    0.752 _manager.py:103(_hookexec)
    44/11    0.000    0.000    8.270    0.752 _callers.py:30(_multicall)
      4/3    0.000    0.000    8.269    2.756 runner.py:247(call_runtest_hook)
      4/3    0.000    0.000    8.269    2.756 runner.py:318(from_call)
      4/3    0.000    0.000    8.269    2.756 runner.py:262(<lambda>)
       48    0.000    0.000    8.269    0.172 {method 'send' of 'generator' objects}
       27    0.000    0.000    5.671    0.210 contextlib.py:141(__exit__)
       11    0.000    0.000    4.297    0.391 fixtures.py:686(<lambda>)
      2/1    0.000    0.000    4.228    4.228 runner.py:111(pytest_runtest_protocol)
      2/1    0.000    0.000    4.228    4.228 runner.py:119(runtestprotocol)
        2    0.000    0.000    4.213    2.106 capture.py:877(pytest_runtest_teardown)
        1    0.000    0.000    4.213    4.213 runner.py:180(pytest_runtest_teardown)
        1    0.000    0.000    4.213    4.213 runner.py:509(teardown_exact)
        2    0.000    0.000    3.628    1.814 capture.py:872(pytest_runtest_call)
        1    0.000    0.000    3.627    3.627 runner.py:160(pytest_runtest_call)
        1    0.000    0.000    3.627    3.627 python.py:1797(runtest)
   114/81    0.001    0.000    3.505    0.043 {built-in method builtins.next}
       15    0.784    0.052    3.183    0.212 subprocess.py:417(check_output)
```

Fixes #16516
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16523

(cherry picked from commit 642652efab)
2024-05-15 14:32:43 +08:00
Kefu Chai
03a54a4c07 tools/scylla-sstable: add scylla sstable shard-of command
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.

in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.

Fixes #16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16345

(cherry picked from commit 273ee36bee)
2024-05-15 14:32:42 +08:00
Lakshmi Narayanan Sreethar
4b0c60cdc3 compaction: improve partition estimates for garbage collected sstables
When a compaction strategy uses garbage collected sstables to track
expired tombstones, do not use complete partition estimates for them,
instead, use a fraction of it based on the droppable tombstone ratio
estimate.

Fixes #18283

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18465

(cherry picked from commit d39adf6438)

Closes scylladb/scylladb#18656
2024-05-14 07:53:07 +03:00
Patryk Wrobel
28d0fc1b6b scylla_io_setup: ensure correct RLIMIT_NOFILE for iotune
The default limit of open file descriptors
per process may be too small for iotune on
certain machines with large number of cores.

In such case iotune reports failure due to
unability to create files or to set up seastar
framework.

This change configures the limit of open file
descriptors before running iotune to ensure
that the failure does not occur.

The limit is set via 'resource.setrlimit()' in
the parent process. The limit is then inherited
by the child process.

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
(cherry picked from commit ec820e214c)

Closes scylladb/scylladb#18655
2024-05-14 07:48:53 +03:00
Israel Fruchter
393880f355 Update tools/cqlsh submodule to v6.0.17
Mostly a set of fixes in the area of ssl handling

* tools/cqlsh 99b2b777...9d49b385 (21):
  > cqlshlib/sslhandling: fix logic of `ssl_check_hostname`
  > cqlshlib/sslhandling.py: don't use empty userkey/usercert
  > Dockerfile: noninteractive isn't enough for answering yet on apt-get
  > fix cqlsh version print
  > cqlshlib/sslhandling: change `check_hostname` deafult to False
  > Introduce new ssl configuration for disableing check_hostname
  > set the hostname in ssl_options.server_hostname when SSL is used
  > issue-73 Fixed a bug where username and password from the credentials file were ignored.
  > issue-73 Fixed a bug where username and password from the credentials file were ignored.
  > issue-73
  > github actions: update `cibuildwheel==v2.16.5`
  > dist/debian: fix the trailer line format
  > `COPY TO STDOUT` shouldn't put None where a function is expected
  > Make cqlsh work with unix domain sockets
  > Bump python-driver version
  > dist/debian: add trailer line
  > dist/debian: wrap long line
  > Draft: explicit build-time packge dependencies
  > stop retruning status_code=2 on schema disagreement
  > Fix minor typos in the code
  > Dockerfile: apt-get update and apt-get upgrade to get latest OS packages

Ref: #18590

Closes scylladb/scylladb#18652
2024-05-14 07:47:37 +03:00
Lakshmi Narayanan Sreethar
e30a2af700 sstable_datafile_test: add testcase to test reclaim during reload
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4d22c4b68b)
2024-05-14 01:04:42 +05:30
Lakshmi Narayanan Sreethar
e0b4483bb8 sstable_datafile_test: add test to verify auto reload of reclaimed components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a080daaa94)
2024-05-14 00:10:28 +05:30
Lakshmi Narayanan Sreethar
8a6300be4c sstables_manager: reload previously reclaimed components when memory is available
When an SSTable is dropped, the associated bloom filter gets discarded
from memory, bringing down the total memory consumption of bloom
filters. Any bloom filter that was previously reclaimed from memory due
to the total usage crossing the threshold, can now be reloaded back into
memory if the total usage can still stay below the threshold. Added
support to reload such reclaimed filters back into memory when memory
becomes available.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0b061194a7)
2024-05-14 00:10:21 +05:30
Lakshmi Narayanan Sreethar
d8e0cba45d sstables_manager: start a fiber to reload components
Start a fiber that gets notified whenever an sstable gets deleted. The
fiber doesn't do anything yet but the following patch will add support
to reload reclaimed components if there is sufficient memory.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f758d7b114)
2024-05-14 00:10:10 +05:30
Lakshmi Narayanan Sreethar
3933fc25de sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.

The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 24064064e9)
2024-05-14 00:10:01 +05:30
Lakshmi Narayanan Sreethar
a741202ef0 sstable_datafile_test: add test to verify reclaimed components reload
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 69b2a127b0)
2024-05-14 00:09:54 +05:30
Lakshmi Narayanan Sreethar
0529055b1e sstables: support reloading reclaimed components
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 54bb03cff8)
2024-05-14 00:09:47 +05:30
Lakshmi Narayanan Sreethar
81c0829993 sstables_manager: add new intrusive set to track the reclaimed sstables
The new set holds the sstables from where the memory has been reclaimed
and is sorted in ascending order of the total memory reclaimed.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2340ab63c6)
2024-05-14 00:09:40 +05:30
Lakshmi Narayanan Sreethar
1979cde07a sstable: add link and comparator class to support new instrusive set
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 140d8871e1)
2024-05-14 00:09:15 +05:30
Lakshmi Narayanan Sreethar
7e2e8ddda6 sstable: renamed intrusive list link type
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3ef2f79d14)
2024-05-14 00:06:39 +05:30
Lakshmi Narayanan Sreethar
986516fa4e sstable: track memory reclaimed from components per sstable
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 02d272fdb3)
2024-05-14 00:06:30 +05:30
Lakshmi Narayanan Sreethar
020c7b662b sstable: rename local variable in sstable::total_reclaimable_memory_size
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a53af1f878)
2024-05-14 00:06:23 +05:30
Botond Dénes
e008060f39 Merge 'doc: fix Rust Driver release information' from Anna Stuchlik
This PR removes the incorrect information that the ScyllaDB Rust Driver is not GA.

In addition, it replaces "Scylla" with "ScyllaDB".

Fixes https://github.com/scylladb/scylladb/issues/16178

Closes scylladb/scylladb#16199

* github.com:scylladb/scylladb:
  doc: remove the "preview" label from Rust driver
  doc: fix Rust Driver release information

(cherry picked from commit 56c3515751)
2024-05-10 12:22:03 +02:00
Aleksandra Martyniuk
7589981898 tasks: use default task_ttl in scylla.yaml
Currently default task_ttl_in_seconds is 0, but scylla.yaml changes
the value to 10.

Change task_ttl_in_seconds in scylla.yaml to 0, so that there are
consistent defaults. Comment it out.

Fixes: #16714.
(cherry picked from commit 67bbaad62e)

Closes scylladb/scylladb#18584
2024-05-09 16:10:14 +03:00
Kamil Braun
ed89deab40 direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

(cherry picked from commit 8df6d10e88)

Closes scylladb/scylladb#18559
2024-05-08 14:57:09 +02:00
Pavel Emelyanov
905b8f59bd Update seastar submodule (iotune iodepth underflow fix)
* seastar ae05c138...cfb015d0 (1):
  > iotune: ignore shards with id above max_iodepth

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-06 19:27:31 +03:00
Nadav Har'El
862e2affe0 cql3: Fix invalid JSON parsing for JSON object with different key types
More than three years ago, in issue #7949, we noticed that trying to
set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the
fromJson() function) fails - the ascii key is incorrectly parsed.
We fixed that issue in commit 75109e9519
but unfortunately, did not do our due diligence: We did not write enough
tests inspired by this bug, and failed to discover that actually we have
the same bug for many other key types, not just for "ascii". Specifically,
the following key types have exactly the same bug:

  * blob
  * date
  * inet
  * time
  * timestamp
  * timeuuid
  * uuid

Other types, like numbers or boolean worked "by accident" - instead of
parsing them as a normal string, we asked the JSON parser to parse them
again after removing the quotes, and because unquoted numbers and
unquoted true/false happwn to work in JSON, this didn't fail.

The fix here is very simple - for all *native* types (i.e., not
collections or tuples), the encoding of the key in JSON is simply a
quoted string - and removing the quotes is all we need to do and there's
no need to run the JSON parser a second time. Only for more elaborate
types - collections and tuples - we need to run the JSON parser a
second time on the key string to build the more elaborate object.

This patch also includes tests for fromJson() reading a map with all
native key types, confirming that all the aforementioned key types
were broken before this patch, and all key types (including the numbers
and booleans which worked even befoe this patch) work with this patch.

Fixes #18477.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 21557cfaa6)

Closes scylladb/scylladb#18522
2024-05-05 23:53:19 +03:00
Pavel Emelyanov
d68d765247 view-builder: Print correct exception in built ste exception handler
Inside .handle_exception() continuation std::current_exception() doesn't
work, there's std::exception ex argument to handler's lambda instead

fixes #18423

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18349

(cherry picked from commit 4ac30e5337)
2024-05-01 10:19:28 +03:00
Michał Chojnowski
d8df02f490 docs: clarify that DELETE can be used with USING TIMEOUT
The current text seems to suggest that `USING TIMEOUT` doesn't work with `DELETE` and `BATCH`. But that's wrong.

Closes scylladb/scylladb#18424

(cherry picked from commit c1146314a1)
2024-05-01 10:17:05 +03:00
Anna Stuchlik
d85d37921a doc: run repair after changing RF of system_auth
This commit adds the requirement to run repair after changing
the replication factor of the system_auth keyspace
in the procedure of adding a new node to a cluster.

Refs: https://github.com/scylladb/scylla-enterprise/issues/4129

Closes scylladb/scylladb#18466
2024-04-30 19:15:55 +03:00
Asias He
2d4825835c streaming: Fix use after move in fire_stream_event
The event is used in a loop.

Found by clang-tidy:

```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
        listener->handle_stream_event(std::move(event));
                                                ^
streaming/stream_result_future.cc:80:39: note: move occurred here
        listener->handle_stream_event(std::move(event));
                                      ^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
        listener->handle_stream_event(std::move(event));
                                                ^
```

Fixes #18332

Closes scylladb/scylladb#18333

(cherry picked from commit 1ca779d287)
2024-04-26 11:00:01 +03:00
Lakshmi Narayanan Sreethar
201d990072 sstables: reclaim_memory_from_components: do not update _recognised_components
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.

Fixes #18398

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#18400

(cherry picked from commit 6af2659b57)
2024-04-26 10:59:13 +03:00
Wojciech Mitros
678948e671 mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

(cherry picked from commit 9789a3dc7c)

Refs #17891

Closes scylladb/scylladb#18108
2024-04-25 12:45:01 +02:00
Raphael S. Carvalho
8acedb9255 sstables: Fix use-after-move in an error path of FS-based sstable writer
```
sstables/storage.cc:152:21: warning: 'file_path' used after it was moved [bugprone-use-after-move]
        remove_file(file_path).get();
                    ^
sstables/storage.cc:145:64: note: move occurred here
    auto w = file_writer(output_stream<char>(std::move(sink)), std::move(file_path));

```

It's a regression when TOC is found for a new sstable, and we try to delete temporary TOC.

courtesy of clang-tidy.

Fixes #18323.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2fba1f936d)

Closes scylladb/scylladb#18382
2024-04-25 13:13:48 +03:00
Kefu Chai
3ed0826292 thrift: avoid use-after-move in make_non_overlapping_ranges()
in handler.cc, `make_non_overlapping_ranges()` references a moved
instance of `ColumnSlice` when something unexpected happens to
format the error message in an exception, the move constructor of
`ColumnSlice` is default-generated, so the members' move constructors
are used to construct the new instance in the move constructor. this
could lead to undefined behavior when dereferencing the move instance.

in this change, in order to avoid use-after free, let's keep
a copy of the referenced member variables and reference them when
formatting error message in the exception.

this use-after-move issue was introduced in 822a315dfa, which implemented
`get_multi_slice` verb and this piece in the first place. since both 5.2
and 5.4 include this commit, we should backport this change to them.

Refs 822a315dfa
Fixes #18356
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1ad3744edc)

Closes scylladb/scylladb#18374
2024-04-25 11:35:38 +03:00
Botond Dénes
154fcffbc7 Merge '[branch-5.4] Fix false-positive errors in scrub validate-mode' from Botond Dénes
The new MX-native validator, which validates the index in tandem with the data file, was discovered to print false-positive errors, related to range-tombstones and promoted-index positions.
This series fixes that. But first, it refactors the scrub-related tests. These are currently dominated by boiler-plate code. They are hard to read and hard to write. In the first half of the series, a new scrub_test is introduced, which moves all the boiler-plate to a central place, allowing the tests to focus on just the aspect of scrub that is tested.
Then, all the found bugs in validate are fixed and finally a new test, checking validate with valid sstable is introduced.

This PR backports https://github.com/scylladb/scylladb/pull/16327.

Fixes: https://github.com/scylladb/scylladb/issues/16326

Closes scylladb/scylladb#18404

* github.com:scylladb/scylladb:
  test/boost/sstable_compaction_test: add validation test with valid sstable
  sstablex/mx/reader: validate(): print trace message when finishing the PI block
  sstablex/mx/reader: validate(): make index-data PI position check message consistent
  sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
  sstablex/mx/reader: validate(): reset the current PI block on partition-start
  sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
  sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
  test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
  test/boost/sstable_compaction_test: fix indentation
  test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
  test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
  test/boost/sstable_compaction_test: introduce scrub_test_framework
  test/lib/random_schema: add uncompatible_timestamp_generator()
2024-04-25 08:36:30 +03:00
Jenkins Promoter
7e946925c3 Update ScyllaDB version to: 5.4.7 2024-04-24 23:13:50 +03:00
Botond Dénes
d09e2a2311 test/boost/sstable_compaction_test: add validation test with valid sstable
Add a positive test, as it turns out we had some false-positive
validation bugs in the validator and we need a regression test for this.

(cherry picked from commit 2335f42b2b)
2024-04-24 09:38:57 -04:00
Botond Dénes
6462a2f391 sstablex/mx/reader: validate(): print trace message when finishing the PI block
(cherry picked from commit a19a2d76c9)
2024-04-24 09:37:15 -04:00
Botond Dénes
2e19e7cb6d sstablex/mx/reader: validate(): make index-data PI position check message consistent
The message says "index-data" but when printing the position, the data
position is printed first, causing confusion. Fix this and while at it,
also print the position of the partition start.

(cherry picked from commit 677be168c4)
2024-04-24 09:33:36 -04:00
Botond Dénes
fef7498da2 sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
The validate() consumes the content of partitions in a consume-loop.
Every time the consumer asks for a "break", the next PI block is loaded
and set on the validator, so it can validate that further clustering
elements are indeed from this block.
This loop assumed the consumer would only request interruption when the
current clustering block is finished. This is wrong, the consumer can
also request interruption when yielding is needed. When this is the
case, the next PI block doesn't have to be loaded yet, the current one
is not exhausted yet. Check this condition, before loading the next PI
block, to prevent false positive errors, due to mismatched PI block
and clustering elements from the sstable.

(cherry picked from commit 5bff7c40d3)
2024-04-24 09:32:28 -04:00
Botond Dénes
f45c878149 sstablex/mx/reader: validate(): reset the current PI block on partition-start
It is possible that the next partition has no PI and thus there won't be
a new PI block to overwrite the old one. This will result in
false-positive messages about rows being outside of the finished PI
block.

(cherry picked from commit e073df1dbb)
2024-04-24 09:31:10 -04:00
Botond Dénes
7225410af7 sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
Promoted index entries can be written on any clustering elements,
icluding range tombstones. So the validating consumer also has the check
whether the current expected clustering block is finished, when
consuming a range tombstone. If it is, consumption has to be
interrupted, so that the outer-loop can load up the next promoted index
block, before moving on to the next clustering element.

(cherry picked from commit 2737899c21)
2024-04-24 09:29:32 -04:00
Botond Dénes
564b01fda9 sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
For range tombstone end-bounds, the validate_fragment_order() should be
passed a null tombstone, not a disengaged optional. The latter means no
change in the current tombstone. This caused the end bound of range
tombstones to not make it to the validator and the latter complained
later on partition-end that the partition has unclosed range tombstone.

(cherry picked from commit f46b458f0d)
2024-04-24 09:28:18 -04:00
Botond Dénes
1d80427888 test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper
It is not used anymore.

(cherry picked from commit 8be97884ec)
2024-04-24 09:26:46 -04:00
Botond Dénes
ff17ec81e4 test/boost/sstable_compaction_test: fix indentation
(cherry picked from commit da0f4d3a9f)
2024-04-24 09:26:30 -04:00
Botond Dénes
121f2a530e test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit c35092aff6)
2024-04-24 09:03:44 -04:00
Botond Dénes
b77581c84c test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 3f76aad609)
2024-04-24 08:58:18 -04:00
Botond Dénes
ea176bf4ce test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test
The test becomes a lot shorter and it now uses random schema and random
data. The test is also split in two: one test for abort mode and one for
skip mode.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 5237e8133b)
2024-04-24 08:45:53 -04:00
Botond Dénes
3835fd681d test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test
The test becomes a lot shorter and it now uses random schema and random
data.
Indentation is left broken, to be fixed in a future patch.

(cherry picked from commit 76785baf43)
2024-04-24 07:57:54 -04:00
Botond Dénes
14da273c4c test/boost/sstable_compaction_test: introduce scrub_test_framework
Scrub tests require a lot of boilerplate code to work. This has a lot of
disadvantages:
* Tests are long
* The "meat" of the test is lost between all the boiler-plate, it is
  hard to glean what a test actually does
* Tests are hard to write, so we have only a few of them and they test
  multiple things.
* The boiler-plate differs sligthly from test-to-test.

To solve this, this patch introduces a new class, `scrub_test_frawmework`,
which is a central place for all the boiler-plate code needed to write
scrub-related tests. In the next patches, we will migrate scrub related
tests to this class.

(cherry picked from commit b6f0c4efa0)
2024-04-24 07:57:54 -04:00
Botond Dénes
33d5f27244 test/lib/random_schema: add uncompatible_timestamp_generator()
Guarantees that produced mutations will not be compactible.

(cherry picked from commit e412673c44)
2024-04-24 07:57:54 -04:00
Wojciech Mitros
c4515a9b99 mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

(cherry picked from commit efcb718e0a)

Closes scylladb/scylladb#18107
2024-04-24 09:42:18 +02:00
Asias He
10f137e367 repair: Improve estimated_partitions to reduce memory usage
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Closes scylladb/scylladb#18141

(cherry picked from commit 642f9a1966)
2024-04-18 11:40:06 +03:00
Kamil Braun
53e1ed0ebb Merge '[Backport 5.4] gossiper: lock local endpoint when updating heart_beat' from ScyllaDB
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b310)

(cherry picked from commit 72955093eb)

Refs scylladb/scylladb#18184

Closes scylladb/scylladb#18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
2024-04-17 17:50:30 +02:00
Botond Dénes
1aedc7372d Merge '[Backport 5.4] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.

The feature can be manually verified by doing the following :

1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB

The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.

Fixes https://github.com/scylladb/scylladb/issues/17747

Backported from #17771 to 5.4.

Closes scylladb/scylladb#18248

* github.com:scylladb/scylladb:
  test_bloom_filter.py: disable reclaiming memory from components
  sstable_datafile_test: add tests to verify auto reclamation of components
  test/lib: allow overriding available memory via test_env_config
  sstables_manager: support reclaiming memory from components
  sstables_manager: store available memory size
  sstables_manager: add variable to track component memory usage
  db/config: add a new variable to limit memory used by table components
  sstable_datafile_test: add testcase to verify reclamation from sstables
  sstables: support reclaiming memory from components
2024-04-17 14:33:19 +03:00
Kamil Braun
28781ca37e test: reproducer for missing gossiper updates
Regression test for scylladb/scylladb#17493.

(cherry picked from commit 72955093eb)

Backport note: removed `timeout` parameter passed to `server_add`,
missing on this branch. (If server adding hangs, it will timeout after
`TOPOLOGY_TIMEOUT` from scylla_cluster.py)

Removed `force_gossip_join_boot` error injection from test, not present
in this branch. Starting nodes with `experimental_features` disabled.

Added missing `handle_state_normal.*finished` message.
2024-04-17 13:09:39 +02:00
Beni Peled
9218bbb9b9 test.py: add the pytest junit_suite_name parameter
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename

This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename

Closes scylladb/scylladb#18172

(cherry picked from commit 223275b4d1)
2024-04-17 07:02:33 +03:00
Kamil Braun
8faef135cb gossiper: lock local endpoint when updating heart_beat
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b310)
2024-04-16 14:39:05 +02:00
Lakshmi Narayanan Sreethar
75962d3e94 test_bloom_filter.py: disable reclaiming memory from components
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d86505e399)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
034304127c sstable_datafile_test: add tests to verify auto reclamation of components
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d261f0fbea)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
95068d3c00 test/lib: allow overriding available memory via test_env_config
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 169629dd40)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
3dca49c524 sstables_manager: support reclaiming memory from components
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.

Fixes #17747

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit a36965c474)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
505a0714a6 sstables_manager: store available memory size
The available memory size is required to calculate the reclaim memory
threshold, so store that within the sstables manager.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 2ca4b0a7a2)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
b8bd650e32 sstables_manager: add variable to track component memory usage
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit f05bb4ba36)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
fd0b083414 db/config: add a new variable to limit memory used by table components
A new configuration variable, components_memory_reclaim_threshold, has
been added to configure the maximum allowed percentage of available
memory for all SSTable components in a shard. If the total memory usage
exceeds this threshold, it will be reclaimed from the components to
bring it back under the limit. Currently, only the memory used by the
bloom filters will be restricted.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e8026197d2)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
1609b77b45 sstable_datafile_test: add testcase to verify reclamation from sstables
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit e0b6186d16)
2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar
aec4d157da sstables: support reclaiming memory from components
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 4f0aee62d1)
2024-04-16 13:05:36 +05:30
Tzach Livyatan
80d7bf7366 Update Driver root page
The right term is Amazon DynamoDB not AWS DynamoDB
See https://aws.amazon.com/dynamodb/

Closes scylladb/scylladb#18214

(cherry picked from commit 289793d964)
2024-04-16 09:53:53 +03:00
Pavel Emelyanov
0dc50ac449 view_update_generator: Unplug from database later
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.

This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.

fixes: #16001

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18132

(cherry picked from commit 3471f30b58)
2024-04-15 14:33:56 +03:00
Botond Dénes
e9d38b3c9e Merge '[Backport 5.4] repair: fix memory counting in repair' from ScyllaDB
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.

Modify what's counted to repair memory limit.

Fixes: #16710.

(cherry picked from commit a4dc6553ab)

(cherry picked from commit 51c09a84cc)

Refs #17785

Closes scylladb/scylladb#18205

* github.com:scylladb/scylladb:
  test: add test for repair_row::size()
  repair: fix memory accounting in repair_row
2024-04-15 14:05:43 +03:00
Aleksandra Martyniuk
bfc4104eb9 test: add test for repair_row::size()
Add test which checs whether repair_row::size() considers external
memory.

(cherry picked from commit 51c09a84cc)
2024-04-06 22:44:51 +00:00
Aleksandra Martyniuk
c1c1fde90f repair: fix memory accounting in repair_row
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.

Include other repair_row's members' memory size in repair memory
limit.

(cherry picked from commit a4dc6553ab)
2024-04-06 22:44:51 +00:00
Ferenc Szili
94a551e671 logging: Don't log PK/CK in large partition/row/cell warning
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.

This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.

The logged data will look like this:

Large cells:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db

Large rows:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db

Large partitions:
WARN  2024-04-02 16:49:48,602 [shard 3:  mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db

Fixes #18041

Closes scylladb/scylladb#18166

(cherry picked from commit f1cc6252fd)
2024-04-05 16:02:22 +03:00
Kefu Chai
d6c7a26419 utils/logalloc: do not allocate memory in reclaim_timer::report()
before this change, `reclaim_timer::report()` calls

```c++
fmt::format(", at {}", current_backtrace())
```

which allocates a `std::string` on heap, so it can fail and throw. in
that case, `std::terminate()` is called. but at that moment, the reason
why `reclaim_timer::report()` gets called is that we fail to reclaim
memory for the caller. so we are more likely to run into this issue. anyway,
we should not allocate memory in this path.

in this change, a dedicated printer is created so that we don't format
to a temporary `std::string`, and instead write directly to the buffer
of logger. this avoids the memory allocation.

Fixes #18099
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18100

(cherry picked from commit fcf7ca5675)
2024-04-02 15:13:49 +03:00
Kamil Braun
7e05a54b9c schema_tables: pass reload flag when calling merge_schema cross-shard
In 0c86abab4d `merge_schema` obtained a new flag, `reload`.

Unfortunately, the flag was assigned a default value, which I think is
almost always a bad idea, and indeed it was in this case. When
`merge_schema` is called on shard different than 0, it recursively calls
itself on shard 0. That recursive call forgot to pass the `reload` flag.

Fix this.

(cherry picked from commit 5223d32fab)
2024-04-02 12:53:58 +02:00
Tzach Livyatan
2fa581d8fb Docs: Fix link fro scylla-sstable.rst to /architecture/sstable/
Fix https://github.com/scylladb/scylladb/issues/18096

Closes scylladb/scylladb#18097

(cherry picked from commit 4930095d39)
2024-04-02 13:46:43 +03:00
Wojciech Mitros
892c97295b mv: adjust the overhead estimation for view updates
In order to avoid running out of memory, we can't
underestimate the memory used when processing a view
update. Particularly, we need to handle the remote
view updates well, because we may create many of them
at the same time in contrast to local updates which
are processed synchronously.

After investigating a coredump generated in a crash
caused by running out of memory due to these remote
view updates, we found that the current estimation
is much lower than what we observed in practice; we
identified overhead of up to 2288 bytes for each
remote view update. The overhead consists of:
- 512 bytes - a write_response_handler
- less than 512 bytes - excessive memory allocation
for the mutation in bytes_ostream
- 448 bytes - the apply_to_remote_endpoints coroutine
started in mutate_MV()
- 192 bytes - a continuation to the coroutine above
- 320 bytes - the coroutine in result_parallel_for_each
started in mutate_begin()
- 112 bytes - a continuation to the coroutine above
- 192 bytes - 5 unspecified allocations of 32, 32, 32,
48 and 48 bytes

This patch changes the previous overhead estimate
of 256 bytes to 2288 bytes, which should take into
account all allocations in the current version of the
code. It's worth noting that changes in the related
pieces of code may result in a different overhead.

The allocations seem to be mostly captures for the
background tasks. Coroutines seem to allocate extra,
however testing shows that replacing a coroutine with
continuations may result in generating a few smaller
futures/continuations with a larger total size.
Besides that, considering that we're waiting for
a response for each remote view update, we need the
relatively large write_response_handler, which also
includes the mutation in case we needed to reuse it.

The change should not majorly affect workloads with many
local updates because we don't keep many of them at
the same time anyway, and an added benefit of correct
memory utilization estimation is avoiding evictions
of other memory that would be otherwise necessary
to handle the excessive memory used by view updates.

Fixes #17364

(cherry picked from commit 4c767c379c)

Closes scylladb/scylladb#18105
2024-04-02 10:26:28 +02:00
Jenkins Promoter
5f6b1dc5b9 Update ScyllaDB version to: 5.4.6 2024-04-01 23:37:05 +03:00
Kefu Chai
f1d547e74e bytes.hh: stop at '}' in fmt::formatter<fmt_hex>
according to {fmt}'s document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,

```
  // the range will contain "f} continued". The formatter should parse
  // specifiers until '}' or the end of the range. In this example the
  // formatter should parse the 'f' specifier and return an iterator
  // pointing to '}'.
```

so we should check for _both_ '}' and end of the range. when building
scylla with {fmt} 10.2.1, it fails to build code like

```c++
fmt::format_to(out, "{}", fmt_hex(frag))
```

as {fmt}'s compile-time checker fails to parse this format string
along with given argument, as at compile time,
```c++
throw format_error("invalid group_size")
```
is executed.

so, in this change, we check both '}' and the end of range.

the change which introduced this formatter was
2f9dfba800

Refs 2f9dfba800
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit ef2afa75d1)

Closes scylladb/scylladb#18086
2024-03-29 09:45:00 +02:00
Michał Chojnowski
3df5de60a9 cache_flat_mutation_reader: only call get_iterator_in_latest() when pointing at a row
Calling `_next_row.get_iterator_in_latest()` is illegal when `_next_row` is not
pointing at a row. In particular, the iterator returned by such call might be
dangling.

We have observed this to cause a use-after-free in the field, when a reverse
read called `maybe_add_to_cache` after `_latest_it` was left dangling after
a dead row removal in `copy_from_cache_to_buffer`.

To fix this, we should ensure that we only call `_next_row.get_iterator_in_latest`
is pointing at a row.

Only the occurrences of this problem in `maybe_add_to_cache` are truly dangerous.
As far as I can see, other occurrences can't break anything as of now.
But we apply fixes to them anyway.

(cherry picked from commit 04db6d4bb1)

Closes scylladb/scylladb#18075
2024-03-28 11:04:28 +01:00
Botond Dénes
fd7d57b9fa tools/toolchain: update python driver
Backports scylladb/scylladb#17604 and scylladb/scylladb#17956.

Fixes scylladb/scylladb#16709
Fixes scylladb/scylladb#17353

Closes scylladb/scylladb#17653
2024-03-26 13:27:34 +02:00
Botond Dénes
8a6c543033 Merge '[Backport 5.4] tests: utils: error injection: print time duration instead of count' from ScyllaDB
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.

Fixes #15902

(cherry picked from commit 8a5689e7a7)

(cherry picked from commit 1d33a68dd7)

Closes scylladb/scylladb#17909

* github.com:scylladb/scylladb:
  tests: utils: error injection: print time duration instead of count
  error_injection: do not cast to milliseconds when injecting timeout
2024-03-25 17:40:31 +02:00
Pavel Emelyanov
27beb0fe60 Update seastar submodule (dupliex IO queue activation fix)
* seastar 9d44e5eb...ae05c138 (1):
  > fair_queue: Do not pop unplugged class immediately

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-25 13:01:07 +03:00
Botond Dénes
e868ade258 repair: resolve start-up deadlock
Repairs have to obtain a permit to the reader concurrency semaphore on
each shard they have a presence on. This is prone to deadlocks:

node1                              node2
repair1_master (takes permit)      repair1_follower (waits on permit)
repair2_master (waits for permit)  repair2_follower (takes permit)

In lieu of strong central coordination, we solved this by making permits
evictable: if repair2 can evict repair1's permit so it can obtain one
and make progress. This is not efficient as evicting a permit usually
means discarding already done work, but it prevents the deadlocks.
We recently discovered that there is a window when deadlocks can still
happen. The permit is made evictable when the disk reader is created.
This reader is an evictable one, which effectively makes the permit
evictable. But the permit is obtained when the repair constrol
structrure -- repair meta -- is create. Between creating the repair meta
and reading the first row from disk, the deadlock is still possible. And
we know that what is possible, will happen (and did happen). Fix by
making the permit evictable as soon as the repair meta is created. This
is very clunky and we should have a better API for this (refs #17644),
but for now we go with this simple patch, to make it easy to backport.

Refs: #17644
Fixes: #17591

Closes scylladb/scylladb#17646

(cherry picked from commit 65b9e10543)

Closes scylladb/scylladb#17950
2024-03-21 13:12:40 +02:00
Kefu Chai
57fa61e2ca tests: utils: error injection: print time duration instead of count
instead of casting / comparing the count of duration unit, let's just
compare the durations, so that boost.test is able to print the duration
in a more informative and user friendly way (line wrapped)

test/boost/error_injection_test.cc(167): fatal error:
    in "test_inject_future_disabled":
      critical check wait_time > sleep_msec has failed [23839ns <= 10ms]

Refs #15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1d33a68dd7)
2024-03-20 09:32:00 +00:00
Kefu Chai
6fc3c62223 error_injection: do not cast to milliseconds when injecting timeout
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.

Fixes #15902

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 8a5689e7a7)
2024-03-20 09:32:00 +00:00
Anna Stuchlik
8414aa292a doc: fix the image upgrade page
This commit updates the Upgrade ScyllaDB Image page.

- It removes the incorrect information that updating underlying OS packages is mandatory.
- It adds information about the extended procedure for non-official images.

(cherry picked from commit fc90112b97)

Closes scylladb/scylladb#17886
2024-03-19 14:38:22 +02:00
Aleksandra Martyniuk
881ac7a9af test: fix regular compaction tasks check
Since 6b87778 regular compaction tasks are removed from task manager
immediately after they are finished.

test_regular_compaction_task lists compaction tasks and then requests
their statuses. Only one regular compaction task is guaranteed to still
be running at that time, the rest of them may finish before their status
is requested and so it will no longer be in task manager, causing the test
to fail.

Fix statuses check to consider the possibility of a regular compaction
task being removed from task manager.

Fixes: #17776.
(cherry picked from commit 80c5eb4ecb)

Closes scylladb/scylladb#17810
2024-03-15 08:54:00 +02:00
Tomasz Grabiec
24db04dbe4 Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun
This is a backport of 0c376043eb and follow-up fix 57b14580f0 to 5.4.

We haven't identified any specific issues in test or field in 5.4/2024.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways.

For 5.4 it's even more important than 5.2 because in 5.4 in Raft mode schema pulls are disabled.

Closes scylladb/scylladb#17641

* github.com:scylladb/scylladb:
  raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
  migration_manager: fix indentation after the previous patch.
  messaging_service: process migration_request rpc on shard 0
  migration_manager: take group0 lock during raft snapshot taking
2024-03-14 23:42:49 +01:00
Jenkins Promoter
9fcb7baa9e Update ScyllaDB version to: 5.4.5 2024-03-13 16:18:59 +02:00
Nadav Har'El
8a1f01ad88 alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172

(cherry picked from commit 21e7deafeb)
2024-03-13 14:46:03 +02:00
Raphael S. Carvalho
db1c8e8754 Fix potential data resurrection when another compaction type does cleanup work
Since commit f1bbf70, many compaction types can do cleanup work, but turns out
we forgot to invalidate cache on their completion.

So if a node regains ownership of token that had partition deleted in its previous
owner (and tombstone is already gone), data can be resurrected.

Tablet is not affected, as it explicitly invalidates cache during migration
cleanup stage.

Scylla 5.4 is affected.

Fixes #17501.
Fixes #17452.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17502

(cherry picked from commit f07c233ad5)
2024-03-13 14:10:05 +02:00
Kamil Braun
6e01e821d7 test: unflake test_topology_remove_garbage_group0
The test is booting nodes, and then immediately starts shutting down
nodes and removing them from the cluster. The shutting down and
removing may happen before driver manages to connect to all nodes in the
cluster. In particular, the driver didn't yet connect to the last
bootstrapped node. Or it can even happen that the driver has connected,
but the control connection is established to the first node, and the
driver fetched topology from the first node when the first node didn't
yet consider the last node to be normal. So the driver decides to close
connection to the last node like this:
```
22:34:03.159 DEBUG> [control connection] Removing host not found in
   peers metadata: <Host: 127.42.90.14:9042 datacenter1>
```

Eventually, at the end of the test, only the last node remains, all
other nodes have been removed or stopped. But the driver does not have a
connection to that last node.

Fix this problem by ensuring that:
- all nodes see each other as NORMAL,
- the driver has connected to all nodes
at the beginning of the test, before we start shutting down and removing
nodes.

Fixes scylladb/scylladb#16373

(cherry picked from commit a68701ed4f)

Closes scylladb/scylladb#17702
2024-03-12 10:50:27 +01:00
Israel Fruchter
02182caff4 Update tools/cqlsh submodule
* tools/cqlsh 426fa0ea...99b2b777 (1):
  > `COPY TO STDOUT` shouldn't put None where a function is expected

Closes scylladb/scylladb#17728
2024-03-11 21:52:02 +02:00
Michał Chojnowski
42d9e36454 sstables: fix a use-after-free in key_view::explode()
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.

This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.

Fixes #17625

(cherry picked from commit 7a7b8972e5)

Closes scylladb/scylladb#17717
2024-03-11 12:05:55 +02:00
Lakshmi Narayanan Sreethar
05d2078911 reader_permit: store schema_ptr instead of raw schema pointer
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.

Fixes #16180

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#16658

(cherry picked from commit 76f0d5e35b)

Closes scylladb/scylladb#17677
2024-03-08 08:29:22 +02:00
Gleb Natapov
7908f69b7c raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
group0 operations a valid on shard 0 only. Assert that.

(cherry picked from commit 9847e272f9)
2024-03-05 20:36:44 +01:00
Gleb Natapov
a762ab6283 migration_manager: fix indentation after the previous patch.
(cherry picked from commit 77907b97f1)
2024-03-05 20:36:34 +01:00
Gleb Natapov
9aa39850fd messaging_service: process migration_request rpc on shard 0
Commit 0c376043eb added access to group0
semaphore which can be done on shard0 only. Unlike all other group0 rpcs
(that already always forwarded to shard0) migration_request does not
since it is an rpc that what reused from non raft days. The patch adds
the missing jump to shard0 before executing the rpc.

(cherry picked from commit 4a3c79625f)
2024-03-05 20:34:28 +01:00
Gleb Natapov
b51abf1853 migration_manager: take group0 lock during raft snapshot taking
Group0 state machine access atomicity is guaranteed by a mutex in group0
client. A code that reads or writes the state needs to hold the log. To
transfer schema part of the snapshot we used existing "migration
request" verb which did not follow the rule. Fix the code to take group0
lock before accessing schema in case the verb is called as part of
group0 snapshot transfer.

Fixes scylladb/scylladb#16821

(cherry picked from commit 0c376043eb)
2024-03-05 20:33:17 +01:00
Kamil Braun
111264e3b1 Merge 'misc_services: fix data race from bad usage of get_next_version' from Piotr Dulikowski
The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc.

The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0.

The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0.

There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A:

1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted,
2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state,
3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state,
4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`,
5. Shard 1 resumes and stores version `v` in the global counter,
6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`.
7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4.

Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by:

1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual,
2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value,
3. running a test which performs schema changes in a loop.

The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway.

Refs: #17493

Closes scylladb/scylladb#17499

* github.com:scylladb/scylladb:
  version_generator: check that get_next_version is called on shard 0
  misc_services: fix data race from bad usage of get_next_version

(cherry picked from commit fd32e2ee10)
2024-02-28 14:55:07 +01:00
Avi Kivity
58a1be93b2 Merge ' test/topology_custom: test_read_repair.py: reduce run-time ' from Botond Dénes
This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.
* Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a6fa,  reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries.
* Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.

The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.

With this two changes, we can reduce the data size:
* total_rows: 20000 -> 100
* max_live_rows: 32 -> 8

The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).

Fixes: https://github.com/scylladb/scylladb/issues/15425
Fixes: https://github.com/scylladb/scylladb/issues/16899

Closes scylladb/scylladb#17529

* github.com:scylladb/scylladb:
  test/topology_custom: test_read_repair.py: reduce run-time
  replica/database: get_query_max_result_size(): use query_page_size_in_bytes
  replica/database: use include page-size in max-result-size
  query-request: max_result_size: add without_page_limit()
  db/config: introduce query_page_size_in_bytes

(cherry picked from commit 616eec2214)
2024-02-28 11:23:22 +02:00
Botond Dénes
6a6450a82d Merge '[Backport 5.4] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk
RPC calls lose information about the type of returned exception.
Thus, if a table is dropped on receiver node, but it still exists
on a sender node and sender node streams the table's data, then
the whole operation fails.

To prevent that, add a method which synchronizes schema and then
checks, if the exception was caused by table drop. If so,
the exception is swallowed.

Use the method in streaming and repair to continue them when
the table is dropped in the meantime.

Fixes: https://github.com/scylladb/scylladb/issues/17028.
Fixes: https://github.com/scylladb/scylladb/issues/15370.
Fixes: https://github.com/scylladb/scylladb/issues/15598.

Closes scylladb/scylladb#17525

* github.com:scylladb/scylladb:
  repair: handle no_such_column_family from remote node gracefully
  test: test drop table on receiver side during streaming
  streaming: fix indentation
  streaming: handle no_such_column_family from remote node gracefully
  repair: add methods to skip dropped table
2024-02-27 10:57:48 +02:00
Botond Dénes
75805a7f23 Merge '[Backport 5.4] sstables: close index_reader in has_partition_key' from Aleksandra Martyniuk
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.

Close index_reader in has_partition_key before destroying it.

Fixes: https://github.com/scylladb/scylladb/issues/17232.

Closes scylladb/scylladb#17531

* github.com:scylladb/scylladb:
  test: add test to check if reader is closed
  sstables: close index_reader in has_partition_key
2024-02-27 10:41:42 +02:00
Aleksandra Martyniuk
f843e7181b test: add test to check if reader is closed
Add test to check if reader is closed in sstable::has_partition_key.

(cherry picked from commit 4530be9e5b)
2024-02-26 15:40:49 +01:00
Aleksandra Martyniuk
c76fa47cc4 sstables: close index_reader in has_partition_key
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.

Close index_reader in has_partition_key before destroying it.

(cherry picked from commit 5227336a32)
2024-02-26 15:38:49 +01:00
Aleksandra Martyniuk
2caef424fe repair: handle no_such_column_family from remote node gracefully
If no_such_column_family is thrown on remote node, then repair
operation fails as the type of exception cannot be determined.

Use repair::with_table_drop_silenced in repair to continue operation
if a table was dropped.

(cherry picked from commit cf36015591)
2024-02-26 13:05:41 +01:00
Aleksandra Martyniuk
5e665cd7fb test: test drop table on receiver side during streaming
(cherry picked from commit 2ea5d9b623)
2024-02-26 13:00:58 +01:00
Aleksandra Martyniuk
b770be8f78 streaming: fix indentation
(cherry picked from commit b08f539427)
2024-02-26 12:53:38 +01:00
Aleksandra Martyniuk
b5ff9a2bf8 streaming: handle no_such_column_family from remote node gracefully
If no_such_column_family is thrown on remote node, then streaming
operation fails as the type of exception cannot be determined.

Use repair::with_table_drop_silenced in streaming to continue
operation if a table was dropped.

(cherry picked from commit 219e1eda09)
2024-02-26 10:15:32 +01:00
Aleksandra Martyniuk
0da3772d50 repair: add methods to skip dropped table
Schema propagation is async so one node can see the table while on
the other node it is already dropped. So, if the nodes stream
the table data, the latter node throws no_such_column_family.
The exception is propagated to the other node, but its type is lost,
so the operation fails on the other node.

Add method which waits until all raft changes are applied and then
checks whether given table exists.

Add the function which uses the above to determine, whether the function
failed because of dropped table (eg. on the remote node so the exact
exception type is unknown). If so, the exception isn't rethrown.

(cherry picked from commit 5202bb9d3c)
2024-02-26 10:10:37 +01:00
Nadav Har'El
72e804306c mv: fix missing view deletions in some cases of range tombstones
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.

This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).

Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.

This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:

1. The counter "_op_count" that decides where to break the output mutation
   should only be incremented when adding rows to this output mutation.
   The existing code strangely incrmented it on every read (!?) which
   resulted in the counter being incremented on every *input* fragment,
   and in particular could reach the limit 100 between two range
   tombstone pieces.

2. Moreover, the length of output was checked in the wrong place...
   The existing code could get to 100 rows, not check at that point,
   read the next input - half a range tombstone - and only *then*
   check that we reached 100 rows and stop. The fix is to calculate
   the number of rows in the right place - exactly when it's needed,
   not before the step.

The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.

This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).

By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).

Fixes #17117

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17164

(cherry picked from commit 14315fcbc3)
2024-02-22 15:04:28 +02:00
Avi Kivity
384a0628b0 Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.

This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.

The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.

Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.

Apart from this change, this PR adds tests for it and updates
the documentation.

This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.

Fixes scylladb/scylladb#7251
Fixes scylladb/scylladb#15260

Closes scylladb/scylladb#17134

* github.com:scylladb/scylladb:
  docs: using-scylla: cdc: remove info about failing writes to old generations
  docs: dev: cdc: document writing to previous CDC generations
  test: add test_writes_to_previous_cdc_generations
  cdc: generation: allow increasing generation_leeway through error injection
  cdc: metadata: allow sending writes to the previous generations

(cherry picked from commit 9bb4482ad0)

Backport note: in tests, replaced `servers_add` with loop of `server_add`
2024-02-22 12:44:24 +01:00
Wojciech Mitros
435000ee70 rust: update dependencies
The currently used versions of "time" and "rustix" depencies
had minor security vulnerabilities.
In this patch:
- the "rustix" crate is updated
- the "chrono" crate that we depend on was not compatible
with the version of the "time" crate that had fixes, so
we updated the "chrono" crate, which actually removed the
dependency on "time" completely.
Both updated were performed using "cargo update" on the
relevant package and the corresponding version.

Refs #15772

Closes scylladb/scylladb#17407
2024-02-19 22:12:13 +02:00
Anna Stuchlik
e691604823 doc: remove Enterprise OS support from Open Source
With this commit:
- The information about ScyllaDB Enterprise OS support
  is removed from the Open Source documentation.
- The information about ScyllaDB Open Source OS support
  is moved to the os-support-info file in the _common folder.
- The os-support-info file is included in the os-support page
  using the scylladb_include_flag directive.

This update employs the solution we added with
https://github.com/scylladb/scylladb/pull/16753.
It allows to dynamically add content to a page
depending on the opensource/enterprise flag.

Refs https://github.com/scylladb/scylladb/issues/15484

Closes scylladb/scylladb#17310

(cherry picked from commit ef1468d5ec)
2024-02-19 11:16:19 +02:00
Lakshmi Narayanan Sreethar
46098c5a0e replica/database: quiesce compaction before closing system tables during shutdown
During shutdown, as all system tables are closed in parallel, there is a
possibility of a race condition between compaction stoppage and the
closure of the compaction_history table. So, quiesce all the compaction
tasks before attempting to close the tables.

Fixes #15721

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#17218

(cherry picked from commit 3b7b315f6a)
2024-02-19 09:45:17 +02:00
Anna Stuchlik
b2fe98bfc6 doc: add missing redirections
This commit adds the missing redirections
to the pages whose source files were
previously stored in the install-scylla folder
and were moved to another location.

Closes scylladb/scylladb#17367

(cherry picked from commit e132ffdb60)
2024-02-19 09:18:14 +02:00
Botond Dénes
e4526449a1 query: do not kill unpaged queries when they reach the tombstone-limit
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.

So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.

Fixes: #17241

Closes scylladb/scylladb#17242

(cherry picked from commit f068d1a6fa)
2024-02-15 12:50:09 +02:00
Jenkins Promoter
c44bb1544d Update ScyllaDB version to: 5.4.4 2024-02-14 16:23:48 +02:00
Avi Kivity
fcfcd6d35a Regenerate frozen toolchain
For gnutls 3.8.3.

Fixes #17285.

Closes scylladb/scylladb#17291
2024-02-12 19:39:28 +02:00
Pavel Emelyanov
cf42ca0c2a Update seastar submodule
* seastar 95a38bb0...9d44e5eb (1):
  > Merge "Slowdown IO scheduler based on dispatched/completed ratio" int branch-5.4

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 12:21:47 +03:00
Botond Dénes
62d8c7274a Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.

The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.

Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.

Fixes #17035

Closes scylladb/scylladb#17073

* github.com:scylladb/scylladb:
  utils: replace assert() by on_internal_error()
  utils: add on_internal_error with common logger
  utils: add a timeuuid minimum, like we had maximum
  test/cql-pytest: tests for "date" type

(cherry picked from commit 2a4b991772)
2024-02-07 13:47:55 +02:00
Botond Dénes
8080c15d7a Merge '[Backport 5.4] Raft snapshot fixes' from Kamil Braun
Backports required to fix https://github.com/scylladb/scylladb/issues/16683 in 5.4:
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot (we may get such deployments by upgrading from 5.2)

Closes scylladb/scylladb#17123

* github.com:scylladb/scylladb:
  test_raft_snapshot_request: fix flakiness (again)
  test_raft_snapshot_request: fix flakiness
  Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
  Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
2024-02-07 11:52:51 +02:00
Botond Dénes
8398f361cd Merge 'doc: add the 5.4-to-2024.1 upgrade guide' from Anna Stuchlik
This PR:
- Adds the upgrade guide from ScyllaDB Open Source 5.4 to ScyllaDB Enterprise 2024.1. Note: The need to include the "Restore system tables" step in rollback has been confirmed; see https://github.com/scylladb/scylladb/issues/11907#issuecomment-1842657959.
- Removes the 5.1-to-2022.2 upgrade guide (unsupported versions).

Fixes https://github.com/scylladb/scylladb/issues/16445

Closes scylladb/scylladb#16887

* github.com:scylladb/scylladb:
  doc: fix the OSS version number
  doc: metric updates between 2024.1. and 5.4
  doc: remove the 5.1-to-2022.2 upgrade guide
  doc: add the 5.4-to-2024.1 upgrade guide

(cherry picked from commit edb983d165)
2024-02-06 12:24:26 +02:00
Anna Stuchlik
dba6070794 doc: add 2024.1 to the OSS vs. Enterprise matrix
This commit adds the information that
ScyllaDB Enterprise 2024.1 is based
on ScyllaDB Open Source 5.4
to the OSS vs. Enterprise matrix.

Closes scylladb/scylladb#16880

(cherry picked from commit a462b914cb)
2024-02-05 14:13:14 +02:00
Jenkins Promoter
0a6a52e08c Update ScyllaDB version to: 5.4.3 2024-02-04 20:35:41 +02:00
Michał Chojnowski
25c0510015 row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes #16759

Closes scylladb/scylladb#17138

(cherry picked from commit ed98102c45)
2024-02-04 14:46:26 +02:00
Kamil Braun
311e31b36f test_raft_snapshot_request: fix flakiness (again)
At the end of the test, we wait until a restarted node receives a
snapshot from the leader, and then verify that the log has been
truncated.

To check the snapshot, the test used the `system.raft_snapshots` table,
while the log is stored in `system.raft`.

Unfortunately, the two tables are not updated atomically when Raft
persists a snapshot (scylladb/scylladb#9603). We first update
`system.raft_snapshots`, then `system.raft` (see
`raft_sys_table_storage::store_snapshot_descriptor`). So after the wait
finishes, there's no guarantee the log has been truncated yet -- there's
a race between the test's last check and Scylla doing that last delete.

But we can check the snapshot using `system.raft` instead of
`system.raft_snapshots`, as `system.raft` has the latest ID. And since
1640f83fdc, storing that ID and truncating
the log in `system.raft` happens atomically.

Closes scylladb/scylladb#17106

(cherry picked from commit c911bf1a33)
2024-02-02 13:02:30 +01:00
Kamil Braun
6a6a4fde79 test_raft_snapshot_request: fix flakiness
Add workaround for scylladb/python-driver#295.

Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.

(cherry picked from commit 74bf60a8ca)
2024-02-02 13:02:30 +01:00
Botond Dénes
390414c99e Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.

Also add a test.

Fixes scylladb/scylladb#16683

Closes scylladb/scylladb#17072

* github.com:scylladb/scylladb:
  test: add test for fixing a broken group 0 snapshot
  raft_group0: trigger snapshot if existing snapshot index is 0

(cherry picked from commit 181f68f248)
2024-02-02 13:02:30 +01:00
Botond Dénes
26b812067b Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb/scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call

(cherry picked from commit d202d32f81)

Backport note: the HTTP API is only started if raft_group_registry is
started.
2024-02-02 12:35:46 +01:00
Tzach Livyatan
e83c4cc75c Update link to sizing / pricing calc
Closes scylladb/scylladb#17015

(cherry picked from commit 06a9a925a5)
2024-01-29 14:32:56 +02:00
Avi Kivity
df1843311a Merge 'Invalidate prepared statements for views when their schema changes.' from Eliran Sinvani
When a base table changes and altered, so does the views that might
refer to the added column (which includes "SELECT *" views and also
views that might need to use this column for rows lifetime (virtual
columns).
However the query processor implementation for views change notification
was an empty function.
Since views are tables, the query processor needs to at least treat them
as such (and maybe in the future, do also some MV specific stuff).
This commit adds a call to `on_update_column_family` from within
`on_update_view`.
The side effect true to this date is that prepared statements for views
which changed due to a base table change will be invalidated.

Fixes https://github.com/scylladb/scylladb/issues/16392

This series also adds a test which fails without this fix and passes when the fix is applied.

Closes scylladb/scylladb#16897

* github.com:scylladb/scylladb:
  Add test for mv prepared statements invalidation on base alter
  query processor: treat view changes at least as table changes

(cherry picked from commit 5810396ba1)
2024-01-23 19:34:10 +02:00
Anna Stuchlik
fcaae2ea78 doc: remove upgrade for unsupported versions
This commit removes the upgrade guides
from ScyllaDB Open Source to Enterprise
for versions we no longer support.

In addition, it removes a link to
one of the removed pages from
the Troubleshooting section (the link is
redundant).

(cherry picked from commit 0ad3ef4c55)

Closes scylladb/scylladb#16913
2024-01-22 16:45:36 +02:00
David Garcia
a1b6edd5d3 docs: dynamic include based on flag
docs: extend include options

Closes scylladb/scylladb#16753

(cherry picked from commit f555a2cb05)
2024-01-19 10:14:56 +02:00
Botond Dénes
6c625e8cd3 Merge '[Backport 5.4] tasks: compaction: drop regular compaction tasks after they are finished' from Aleksandra Martyniuk
Make compaction tasks internal. Drop all internal tasks without parents
immediately after they are done.

Fixes: https://github.com/scylladb/scylladb/issues/16735
Refs: https://github.com/scylladb/scylladb/issues/16694.

Closes scylladb/scylladb#16798

* github.com:scylladb/scylladb:
  compaction: make regular compaction tasks internal
  tasks: don't keep internal root tasks after they complete
2024-01-17 09:34:08 +02:00
Aleksandra Martyniuk
081a36e34f compaction: make regular compaction tasks internal
Regular compaction tasks are internal.

Adjust test_compaction_task accordingly: modify test_regular_compaction_task,
delete test_running_compaction_task_abort (relying on regular compaction)
which checks are already achived by test_not_created_compaction_task_abort.
Rename the latter.

(cherry picked from commit 6b87778ef2)
2024-01-16 11:15:41 +01:00
Aleksandra Martyniuk
c0c7de8fd1 tasks: don't keep internal root tasks after they complete
(cherry picked from commit 6b2b384c83)
2024-01-16 10:53:16 +01:00
279 changed files with 5921 additions and 3567 deletions

87
.github/scripts/label_promoted_commits.py vendored Executable file
View File

@@ -0,0 +1,87 @@
from github import Github
import argparse
import re
import sys
import os
try:
github_token = os.environ["GITHUB_TOKEN"]
except KeyError:
print("Please set the 'GITHUB_TOKEN' environment variable")
sys.exit(1)
def parser():
parser = argparse.ArgumentParser()
parser.add_argument('--repository', type=str, required=True,
help='Github repository name (e.g., scylladb/scylladb)')
parser.add_argument('--commit_before_merge', type=str, required=True, help='Git commit ID to start labeling from ('
'newest commit).')
parser.add_argument('--commit_after_merge', type=str, required=True,
help='Git commit ID to end labeling at (oldest '
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
def add_comment_and_close_pr(pr, comment):
if pr.state == 'open':
pr.create_issue_comment(comment)
pr.edit(state="closed")
def mark_backport_done(repo, ref_pr_number, branch):
pr = repo.get_pull(int(ref_pr_number))
label_to_remove = f'backport/{branch}'
label_to_add = f'{label_to_remove}-done'
current_labels = [label.name for label in pr.get_labels()]
if label_to_remove in current_labels:
pr.remove_from_labels(label_to_remove)
if label_to_add not in current_labels:
pr.add_to_labels(label_to_add)
def main():
# This script is triggered by a push event to either the master branch or a branch named branch-x.y (where x and y represent version numbers). Based on the pushed branch, the script performs the following actions:
# - When ref branch is `master`, it will add the `promoted-to-master` label, which we need later for the auto backport process
# - When ref branch is `branch-x.y` (which means we backported a patch), it will replace in the original PR the `backport/x.y` label with `backport/x.y-done` and will close the backport PR (Since GitHub close only the one referring to default branch)
args = parser()
pr_pattern = re.compile(r'Closes .*#([0-9]+)')
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
processed_prs = set()
# Print commit information
for commit in commits.commits:
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
pr_number = int(match.group(1))
if pr_number in processed_prs:
continue
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'
add_comment_and_close_pr(pr, comment)
else:
print(f'master branch, pr number is: {pr_number}')
pr = repo.get_pull(pr_number)
pr.add_to_labels('promoted-to-master')
processed_prs.add(pr_number)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,36 @@
name: Check if commits are promoted
on:
push:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
jobs:
check-commit:
runs-on: ubuntu-latest
permissions:
pull-requests: write
issues: write
steps:
- name: Dump GitHub context
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Install dependencies
run: sudo apt-get install -y python3-github
- name: Run python script
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.4.2
VERSION=5.4.10
if test -f version
then

View File

@@ -208,7 +208,10 @@ protected:
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (_gossiper.is_alive(ip)) {
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We need the node to be in normal state. See #19694.
if (_gossiper.is_normal(ip)) {
rjson::push_back(results, rjson::from_string(ip.to_sstring()));
}
}

43
api/api-doc/raft.json Normal file
View File

@@ -0,0 +1,43 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/raft",
"produces":[
"application/json"
],
"apis":[
{
"path":"/raft/trigger_snapshot/{group_id}",
"operations":[
{
"method":"POST",
"summary":"Triggers snapshot creation and log truncation for the given Raft group",
"type":"string",
"nickname":"trigger_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"group_id",
"description":"The ID of the group which should get snapshotted",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the endpoint returns a failure. If not provided, 60s is used.",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -31,6 +31,7 @@
#include "api/config.hh"
#include "task_manager.hh"
#include "task_manager_test.hh"
#include "raft.hh"
logging::logger apilog("api");
@@ -294,6 +295,18 @@ future<> set_server_task_manager_test(http_context& ctx) {
#endif
future<> set_server_raft(http_context& ctx, sharded<service::raft_group_registry>& raft_gr) {
auto rb = std::make_shared<api_registry_builder>(ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &raft_gr] (routes& r) {
rb->register_function(r, "raft", "The Raft API");
set_raft(ctx, r, raft_gr);
});
}
future<> unset_server_raft(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });
}
void req_params::process(const request& req) {
// Process mandatory parameters
for (auto& [name, ent] : params) {
@@ -301,7 +314,7 @@ void req_params::process(const request& req) {
continue;
}
try {
ent.value = req.param[name];
ent.value = req.get_path_param(name);
} catch (std::out_of_range&) {
throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));
}

View File

@@ -23,6 +23,7 @@ class load_meter;
class storage_proxy;
class storage_service;
class raft_group0_client;
class raft_group_registry;
} // namespace service
@@ -117,5 +118,7 @@ future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx, lw_shared_ptr<db::config> cfg);
future<> set_server_task_manager_test(http_context& ctx);
future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);
future<> unset_server_raft(http_context&);
}

View File

@@ -54,7 +54,7 @@ static const char* str_to_regex(const sstring& v) {
void set_collectd(http_context& ctx, routes& r) {
cd::get_collectd.set(r, [](std::unique_ptr<request> req) {
auto id = ::make_shared<scollectd::type_instance_id>(req->param["pluginid"],
auto id = ::make_shared<scollectd::type_instance_id>(req->get_path_param("pluginid"),
req->get_query_param("instance"), req->get_query_param("type"),
req->get_query_param("type_instance"));
@@ -91,7 +91,7 @@ void set_collectd(http_context& ctx, routes& r) {
});
cd::enable_collectd.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
std::regex plugin(req->param["pluginid"].c_str());
std::regex plugin(req->get_path_param("pluginid").c_str());
std::regex instance(str_to_regex(req->get_query_param("instance")));
std::regex type(str_to_regex(req->get_query_param("type")));
std::regex type_instance(str_to_regex(req->get_query_param("type_instance")));

View File

@@ -333,7 +333,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t{0}, [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed(std::mem_fn(&replica::memtable::partition_count)), uint64_t(0));
}, std::plus<>());
});
@@ -353,7 +353,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_memtable_off_heap_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().total_space();
}), uint64_t(0));
@@ -369,7 +369,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_memtable_live_data_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return boost::accumulate(cf.active_memtables() | boost::adaptors::transformed([] (replica::memtable* active_memtable) {
return active_memtable->region().occupancy().used_space();
}), uint64_t(0));
@@ -394,7 +394,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return cf.occupancy().total_space();
}, std::plus<int64_t>());
});
@@ -410,7 +410,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_cf_all_memtables_live_data_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
warn(unimplemented::cause::INDEXES);
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return cf.occupancy().used_space();
}, std::plus<int64_t>());
});
@@ -425,7 +425,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::memtable_switch_count);
return get_cf_stats(ctx,req->get_path_param("name") ,&replica::column_family_stats::memtable_switch_count);
});
cf::get_all_memtable_switch_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -434,7 +434,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), utils::estimated_histogram(0), [](replica::column_family& cf) {
utils::estimated_histogram res(0);
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_partition_size);
@@ -446,7 +446,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
uint64_t res = 0;
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res += i->get_stats_metadata().estimated_partition_size.count();
@@ -457,7 +457,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), utils::estimated_histogram(0), [](replica::column_family& cf) {
utils::estimated_histogram res(0);
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_cells_count);
@@ -474,7 +474,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_pending_flushes.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats(ctx,req->param["name"] ,&replica::column_family_stats::pending_flushes);
return get_cf_stats(ctx,req->get_path_param("name") ,&replica::column_family_stats::pending_flushes);
});
cf::get_all_pending_flushes.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -482,7 +482,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_read.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_count(ctx,req->param["name"] ,&replica::column_family_stats::reads);
return get_cf_stats_count(ctx,req->get_path_param("name") ,&replica::column_family_stats::reads);
});
cf::get_all_read.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -490,7 +490,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_write.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_count(ctx, req->param["name"] ,&replica::column_family_stats::writes);
return get_cf_stats_count(ctx, req->get_path_param("name") ,&replica::column_family_stats::writes);
});
cf::get_all_write.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -498,19 +498,19 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::reads);
});
cf::get_read_latency_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::reads);
return get_cf_rate_and_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::reads);
});
cf::get_read_latency.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_sum(ctx,req->param["name"] ,&replica::column_family_stats::reads);
return get_cf_stats_sum(ctx,req->get_path_param("name") ,&replica::column_family_stats::reads);
});
cf::get_write_latency.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats_sum(ctx, req->param["name"] ,&replica::column_family_stats::writes);
return get_cf_stats_sum(ctx, req->get_path_param("name") ,&replica::column_family_stats::writes);
});
cf::get_all_read_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -522,11 +522,11 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::writes);
});
cf::get_write_latency_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_rate_and_histogram(ctx, req->param["name"], &replica::column_family_stats::writes);
return get_cf_rate_and_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::writes);
});
cf::get_all_write_latency_histogram_depricated.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -538,7 +538,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_pending_compactions.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), [](replica::column_family& cf) {
return cf.estimate_pending_compactions();
}, std::plus<int64_t>());
});
@@ -550,7 +550,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_stats(ctx, req->param["name"], &replica::column_family_stats::live_sstable_count);
return get_cf_stats(ctx, req->get_path_param("name"), &replica::column_family_stats::live_sstable_count);
});
cf::get_all_live_ss_table_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -558,11 +558,11 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_unleveled_sstables.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_unleveled_sstables(ctx, req->param["name"]);
return get_cf_unleveled_sstables(ctx, req->get_path_param("name"));
});
cf::get_live_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return sum_sstable(ctx, req->param["name"], false);
return sum_sstable(ctx, req->get_path_param("name"), false);
});
cf::get_all_live_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -570,7 +570,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_total_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return sum_sstable(ctx, req->param["name"], true);
return sum_sstable(ctx, req->get_path_param("name"), true);
});
cf::get_all_total_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
@@ -579,7 +579,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_min_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], INT64_MAX, min_partition_size, min_int64);
return map_reduce_cf(ctx, req->get_path_param("name"), INT64_MAX, min_partition_size, min_int64);
});
// FIXME: this refers to partitions, not rows.
@@ -589,7 +589,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_max_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), max_partition_size, max_int64);
return map_reduce_cf(ctx, req->get_path_param("name"), int64_t(0), max_partition_size, max_int64);
});
// FIXME: this refers to partitions, not rows.
@@ -600,7 +600,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// FIXME: this refers to partitions, not rows.
cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_partition_size, std::plus<integral_ratio_holder>());
return map_reduce_cf(ctx, req->get_path_param("name"), integral_ratio_holder(), mean_partition_size, std::plus<integral_ratio_holder>());
});
// FIXME: this refers to partitions, not rows.
@@ -610,7 +610,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
@@ -628,7 +628,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
@@ -646,7 +646,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
@@ -658,7 +658,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), ratio_holder(), [] (replica::column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
@@ -670,7 +670,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
@@ -688,7 +688,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
@@ -706,7 +706,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), uint64_t(0), [] (replica::column_family& cf) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
@@ -729,7 +729,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
// We are missing the off heap memory calculation
// Return 0 is the wrong value. It's a work around
// until the memory calculation will be available
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -742,7 +742,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_speculative_retries.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -755,7 +755,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_key_cache_hit_rate.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -780,7 +780,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_row_cache_hit_out_of_range.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
return make_ready_future<json::json_return_type>(0);
});
@@ -791,7 +791,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_row_cache_hit.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {
return map_reduce_cf_raw(ctx, req->get_path_param("name"), utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().hits.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -807,7 +807,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_row_cache_miss.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], utils::rate_moving_average(), [](const replica::column_family& cf) {
return map_reduce_cf_raw(ctx, req->get_path_param("name"), utils::rate_moving_average(), [](const replica::column_family& cf) {
return cf.get_row_cache().stats().misses.rate();
}, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {
return make_ready_future<json::json_return_type>(meter_to_json(m));
@@ -824,57 +824,57 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_cas_prepare.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().cas_prepare.histogram();
});
});
cf::get_cas_propose.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().cas_accept.histogram();
});
});
cf::get_cas_commit.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().cas_learn.histogram();
});
});
cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](replica::column_family& cf) {
return map_reduce_cf(ctx, req->get_path_param("name"), utils::estimated_histogram(0), [](replica::column_family& cf) {
return cf.get_stats().estimated_sstable_per_read;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
cf::get_tombstone_scanned_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::tombstone_scanned);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::tombstone_scanned);
});
cf::get_live_scanned_histogram.set(r, [&ctx] (std::unique_ptr<http::request> req) {
return get_cf_histogram(ctx, req->param["name"], &replica::column_family_stats::live_scanned);
return get_cf_histogram(ctx, req->get_path_param("name"), &replica::column_family_stats::live_scanned);
});
cf::get_col_update_time_delta_histogram.set(r, [] (std::unique_ptr<http::request> req) {
//TBD
unimplemented();
//auto id = get_uuid(req->param["name"], ctx.db.local());
//auto id = get_uuid(req->get_path_param("name"), ctx.db.local());
std::vector<double> res;
return make_ready_future<json::json_return_type>(res);
});
cf::get_auto_compaction.set(r, [&ctx] (const_req req) {
auto uuid = get_uuid(req.param["name"], ctx.db.local());
auto uuid = get_uuid(req.get_path_param("name"), ctx.db.local());
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
return !cf.is_auto_compaction_disabled_by_user();
});
cf::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/enable_auto_compaction: name={}", req->param["name"]);
apilog.info("column_family/enable_auto_compaction: name={}", req->get_path_param("name"));
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
return foreach_column_family(ctx, req->get_path_param("name"), [](replica::column_family &cf) {
cf.enable_auto_compaction();
}).then([g = std::move(g)] {
return make_ready_future<json::json_return_type>(json_void());
@@ -883,10 +883,10 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/disable_auto_compaction: name={}", req->param["name"]);
apilog.info("column_family/disable_auto_compaction: name={}", req->get_path_param("name"));
return ctx.db.invoke_on(0, [&ctx, req = std::move(req)] (replica::database& db) {
auto g = replica::database::autocompaction_toggle_guard(db);
return foreach_column_family(ctx, req->param["name"], [](replica::column_family &cf) {
return foreach_column_family(ctx, req->get_path_param("name"), [](replica::column_family &cf) {
return cf.disable_auto_compaction();
}).then([g = std::move(g)] {
return make_ready_future<json::json_return_type>(json_void());
@@ -895,14 +895,14 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_tombstone_gc.set(r, [&ctx] (const_req req) {
auto uuid = get_uuid(req.param["name"], ctx.db.local());
auto uuid = get_uuid(req.get_path_param("name"), ctx.db.local());
replica::table& t = ctx.db.local().find_column_family(uuid);
return t.tombstone_gc_enabled();
});
cf::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/enable_tombstone_gc: name={}", req->param["name"]);
return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {
apilog.info("column_family/enable_tombstone_gc: name={}", req->get_path_param("name"));
return foreach_column_family(ctx, req->get_path_param("name"), [](replica::table& t) {
t.set_tombstone_gc_enabled(true);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -910,8 +910,8 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
apilog.info("column_family/disable_tombstone_gc: name={}", req->param["name"]);
return foreach_column_family(ctx, req->param["name"], [](replica::table& t) {
apilog.info("column_family/disable_tombstone_gc: name={}", req->get_path_param("name"));
return foreach_column_family(ctx, req->get_path_param("name"), [](replica::table& t) {
t.set_tombstone_gc_enabled(false);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -919,7 +919,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_built_indexes.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {
auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);
auto ks_cf = parse_fully_qualified_cf_name(req->get_path_param("name"));
auto&& ks = std::get<0>(ks_cf);
auto&& cf_name = std::get<1>(ks_cf);
return sys_ks.local().load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace_view_build_progress>& vb) mutable {
@@ -957,7 +957,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_compression_ratio.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto uuid = get_uuid(req->param["name"], ctx.db.local());
auto uuid = get_uuid(req->get_path_param("name"), ctx.db.local());
return ctx.db.map_reduce(sum_ratio<double>(), [uuid](replica::database& db) {
replica::column_family& cf = db.find_column_family(uuid);
@@ -968,21 +968,21 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().reads.histogram();
});
});
cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<http::request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->get_path_param("name"), [](const replica::column_family& cf) {
return cf.get_stats().writes.histogram();
});
});
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<http::request> req) {
sstring strategy = req->get_query_param("class_name");
apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->param["name"], strategy);
return foreach_column_family(ctx, req->param["name"], [strategy](replica::column_family& cf) {
apilog.info("column_family/set_compaction_strategy_class: name={} strategy={}", req->get_path_param("name"), strategy);
return foreach_column_family(ctx, req->get_path_param("name"), [strategy](replica::column_family& cf) {
cf.set_compaction_strategy(sstables::compaction_strategy::type(strategy));
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -990,7 +990,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_compaction_strategy_class.set(r, [&ctx](const_req req) {
return ctx.db.local().find_column_family(get_uuid(req.param["name"], ctx.db.local())).get_compaction_strategy().name();
return ctx.db.local().find_column_family(get_uuid(req.get_path_param("name"), ctx.db.local())).get_compaction_strategy().name();
});
cf::set_compression_parameters.set(r, [](std::unique_ptr<http::request> req) {
@@ -1006,7 +1006,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
cf::get_sstable_count_per_level.set(r, [&ctx](std::unique_ptr<http::request> req) {
return map_reduce_cf_raw(ctx, req->param["name"], std::vector<uint64_t>(), [](const replica::column_family& cf) {
return map_reduce_cf_raw(ctx, req->get_path_param("name"), std::vector<uint64_t>(), [](const replica::column_family& cf) {
return cf.sstable_count_per_level();
}, concat_sstable_count_per_level).then([](const std::vector<uint64_t>& res) {
return make_ready_future<json::json_return_type>(res);
@@ -1015,7 +1015,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::get_sstables_for_key.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
auto uuid = get_uuid(req->get_path_param("name"), ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {
auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);
@@ -1031,7 +1031,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
cf::toppartitions.set(r, [&ctx] (std::unique_ptr<http::request> req) {
auto name = req->param["name"];
auto name = req->get_path_param("name");
auto [ks, cf] = parse_fully_qualified_cf_name(name);
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
@@ -1058,7 +1058,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
}
auto [ks, cf] = parse_fully_qualified_cf_name(*params.get("name"));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
apilog.info("column_family/force_major_compaction: name={} flush={}", req->param["name"], flush);
apilog.info("column_family/force_major_compaction: name={} flush={}", req->get_path_param("name"), flush);
auto keyspace = validate_keyspace(ctx, ks);
std::vector<table_info> table_infos = {table_info{

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include "compaction_manager.hh"
#include "compaction/compaction_manager.hh"
@@ -109,7 +110,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto ks_name = validate_keyspace(ctx, req->param);
auto ks_name = validate_keyspace(ctx, req);
auto table_names = parse_tables(ks_name, ctx, req->query_parameters, "tables");
if (table_names.empty()) {
table_names = map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
@@ -152,10 +153,13 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::get_compaction_history.set(r, [&ctx] (std::unique_ptr<http::request> req) {
std::function<future<>(output_stream<char>&&)> f = [&ctx](output_stream<char>&& s) {
return do_with(output_stream<char>(std::move(s)), true, [&ctx] (output_stream<char>& s, bool& first){
return s.write("[").then([&ctx, &s, &first] {
return ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable {
std::function<future<>(output_stream<char>&&)> f = [&ctx] (output_stream<char>&& out) -> future<> {
auto s = std::move(out);
bool first = true;
std::exception_ptr ex;
try {
co_await s.write("[");
co_await ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {
cm::history h;
h.id = entry.id.to_sstring();
h.ks = std::move(entry.ks);
@@ -169,18 +173,21 @@ void set_compaction_manager(http_context& ctx, routes& r) {
e.value = it.second;
h.rows_merged.push(std::move(e));
}
auto fut = first ? make_ready_future<>() : s.write(", ");
if (!first) {
co_await s.write(", ");
}
first = false;
return fut.then([&s, h = std::move(h)] {
return formatter::write(s, h);
});
}).then([&s] {
return s.write("]").then([&s] {
return s.close();
});
co_await formatter::write(s, h);
});
});
});
co_await s.write("]");
co_await s.flush();
} catch (...) {
ex = std::current_exception();
}
co_await s.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
return make_ready_future<json::json_return_type>(std::move(f));
});

View File

@@ -91,7 +91,7 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
});
cs::find_config_id.set(r, [&cfg] (const_req r) {
auto id = r.param["id"];
auto id = r.get_path_param("id");
for (auto&& cfg_ref : cfg.values()) {
auto&& cfg = cfg_ref.get();
if (id == cfg.name()) {

View File

@@ -24,7 +24,7 @@ namespace hf = httpd::error_injection_json;
void set_error_injection(http_context& ctx, routes& r) {
hf::enable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
sstring injection = req->get_path_param("injection");
bool one_shot = req->get_query_param("one_shot") == "True";
auto params = req->content;
@@ -56,7 +56,7 @@ void set_error_injection(http_context& ctx, routes& r) {
});
hf::disable_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
sstring injection = req->get_path_param("injection");
auto& errinj = utils::get_local_injector();
return errinj.disable_on_all(injection).then([] {
@@ -72,7 +72,7 @@ void set_error_injection(http_context& ctx, routes& r) {
});
hf::message_injection.set(r, [](std::unique_ptr<request> req) {
sstring injection = req->param["injection"];
sstring injection = req->get_path_param("injection");
auto& errinj = utils::get_local_injector();
return errinj.receive_message_on_all(injection).then([] {
return make_ready_future<json::json_return_type>(json::json_void());

View File

@@ -80,9 +80,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->param["addr"]));
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->get_path_param("addr")));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->get_path_param("addr")));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);

View File

@@ -31,21 +31,21 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
});
httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_generation_number(ep).then([] (gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_heart_beat_version(ep).then([] (gms::version_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
@@ -53,17 +53,17 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
if (req->get_query_param("unsafe") != "True") {
return g.assassinate_endpoint(req->param["addr"]).then([] {
return g.assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
return g.unsafe_assassinate_endpoint(req->param["addr"]).then([] {
return g.unsafe_assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
gms::inet_address ep(req->param["addr"]);
gms::inet_address ep(req->get_path_param("addr"));
return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {
return make_ready_future<json::json_return_type>(json_void());
});

70
api/raft.cc Normal file
View File

@@ -0,0 +1,70 @@
/*
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "api/api.hh"
#include "api/api-doc/raft.json.hh"
#include "service/raft/raft_group_registry.hh"
using namespace seastar::httpd;
extern logging::logger apilog;
namespace api {
namespace r = httpd::raft_json;
using namespace json;
void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr) {
r::trigger_snapshot.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
raft::group_id gid{utils::UUID{req->get_path_param("group_id")}};
auto timeout_dur = std::invoke([timeout_str = req->get_query_param("timeout")] {
if (timeout_str.empty()) {
return std::chrono::seconds{60};
}
auto dur = std::stoll(timeout_str);
if (dur <= 0) {
throw std::runtime_error{"Timeout must be a positive number."};
}
return std::chrono::seconds{dur};
});
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout_dur, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
auto* srv = raft_gr.find_server(gid);
if (!srv) {
co_return;
}
found_srv = true;
abort_on_expiry aoe(lowres_clock::now() + timeout_dur);
apilog.info("Triggering Raft group {} snapshot", gid);
auto result = co_await srv->trigger_snapshot(&aoe.abort_source());
if (result) {
apilog.info("New snapshot for Raft group {} created", gid);
} else {
apilog.info("Could not create new snapshot for Raft group {}, no new entries applied", gid);
}
});
if (!found_srv) {
throw std::runtime_error{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_void{};
});
}
void unset_raft(http_context&, httpd::routes& r) {
r::trigger_snapshot.unset(r);
}
}

18
api/raft.hh Normal file
View File

@@ -0,0 +1,18 @@
/*
* Copyright (C) 2023-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api_init.hh"
namespace api {
void set_raft(http_context& ctx, httpd::routes& r, sharded<service::raft_group_registry>& raft_gr);
void unset_raft(http_context& ctx, httpd::routes& r);
}

View File

@@ -58,15 +58,19 @@ namespace ss = httpd::storage_service_json;
namespace sp = httpd::storage_proxy_json;
using namespace json;
sstring validate_keyspace(http_context& ctx, sstring ks_name) {
sstring validate_keyspace(const http_context& ctx, sstring ks_name) {
if (ctx.db.local().has_keyspace(ks_name)) {
return ks_name;
}
throw bad_param_exception(replica::no_such_keyspace(ks_name).what());
}
sstring validate_keyspace(http_context& ctx, const parameters& param) {
return validate_keyspace(ctx, param["keyspace"]);
sstring validate_keyspace(const http_context& ctx, const std::unique_ptr<http::request>& req) {
return validate_keyspace(ctx, req->get_path_param("keyspace"));
}
sstring validate_keyspace(const http_context& ctx, const http::request& req) {
return validate_keyspace(ctx, req.get_path_param("keyspace"));
}
locator::host_id validate_host_id(const sstring& param) {
@@ -171,7 +175,7 @@ using ks_cf_func = std::function<future<json::json_return_type>(http_context&, s
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
@@ -338,7 +342,7 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair) {
// returns immediately, not waiting for the repair to finish. The user
// then has other mechanisms to track the ongoing repair's progress,
// or stop it.
return repair_start(repair, validate_keyspace(ctx, req->param),
return repair_start(repair, validate_keyspace(ctx, req),
options_map).then([] (int i) {
return make_ready_future<json::json_return_type>(i);
});
@@ -421,7 +425,7 @@ void unset_repair(http_context& ctx, routes& r) {
void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader) {
ss::load_new_ss_tables.set(r, [&ctx, &sst_loader](std::unique_ptr<http::request> req) {
auto ks = validate_keyspace(ctx, req->param);
auto ks = validate_keyspace(ctx, req);
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
@@ -452,8 +456,8 @@ void unset_sstables_loader(http_context& ctx, routes& r) {
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb) {
ss::view_build_statuses.set(r, [&ctx, &vb] (std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto view = req->param["view"];
auto keyspace = validate_keyspace(ctx, req);
auto view = req->get_path_param("view");
return vb.local().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
@@ -590,7 +594,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_range_to_endpoint_map.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
std::vector<ss::maplist_mapper> res;
co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(keyspace),
[](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){
@@ -615,7 +619,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_pending_range_to_endpoint_map.set(r, [&ctx](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
std::vector<ss::maplist_mapper> res;
return make_ready_future<json::json_return_type>(res);
});
@@ -631,7 +635,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::describe_ring.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) {
return describe_ring_as_json(ss, validate_keyspace(ctx, req->param));
return describe_ring_as_json(ss, validate_keyspace(ctx, req));
});
ss::get_host_id_map.set(r, [&ss](const_req req) {
@@ -664,7 +668,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_natural_endpoints.set(r, [&ctx, &ss](const_req req) {
auto keyspace = validate_keyspace(ctx, req.param);
auto keyspace = validate_keyspace(ctx, req);
return container_to_vec(ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"),
req.get_query_param("key")));
});
@@ -733,7 +737,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
@@ -796,7 +800,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto& db = ctx.db;
@@ -905,7 +909,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::truncate.set(r, [&ctx](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto column_family = req->get_query_param("cf");
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1039,14 +1043,14 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::bulk_load.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto path = req->param["path"];
auto path = req->get_path_param("path");
return make_ready_future<json::json_return_type>(json_void());
});
ss::bulk_load_async.set(r, [](std::unique_ptr<http::request> req) {
//TBD
unimplemented();
auto path = req->param["path"];
auto path = req->get_path_param("path");
return make_ready_future<json::json_return_type>(json_void());
});
@@ -1134,7 +1138,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
@@ -1142,7 +1146,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
@@ -1150,7 +1154,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
@@ -1158,7 +1162,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
@@ -1254,7 +1258,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_effective_ownership.set(r, [&ctx, &ss] (std::unique_ptr<http::request> req) {
auto keyspace_name = req->param["keyspace"] == "null" ? "" : validate_keyspace(ctx, req->param);
auto keyspace_name = req->get_path_param("keyspace") == "null" ? "" : validate_keyspace(ctx, req);
return ss.local().effective_ownership(keyspace_name).then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
@@ -1542,8 +1546,10 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
}).then([&s] {
return s.write("]").then([&s] {
return s.close();
return s.flush();
});
}).finally([&s] {
return s.close();
});
});
};

View File

@@ -37,11 +37,11 @@ namespace api {
// verify that the keyspace is found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective keyspace error.
sstring validate_keyspace(http_context& ctx, sstring ks_name);
sstring validate_keyspace(const http_context& ctx, sstring ks_name);
// verify that the keyspace parameter is found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective keyspace error.
sstring validate_keyspace(http_context& ctx, const httpd::parameters& param);
sstring validate_keyspace(const http_context& ctx, const std::unique_ptr<http::request>& req);
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown

View File

@@ -106,7 +106,7 @@ void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_
});
hs::get_total_incoming_bytes.set(r, [&sm](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
gms::inet_address peer(req->get_path_param("peer"));
return sm.map_reduce0([peer](streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_received;
@@ -127,7 +127,7 @@ void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_
});
hs::get_total_outgoing_bytes.set(r, [&sm](std::unique_ptr<request> req) {
gms::inet_address peer(req->param["peer"]);
gms::inet_address peer(req->get_path_param("peer"));
return sm.map_reduce0([peer] (streaming::stream_manager& sm) {
return sm.get_progress_on_all_shards(peer).then([] (auto sbytes) {
return sbytes.bytes_sent;

View File

@@ -119,9 +119,9 @@ void set_system(http_context& ctx, routes& r) {
hs::get_logger_level.set(r, [](const_req req) {
try {
return logging::level_name(logging::logger_registry().get_logger_level(req.param["name"]));
return logging::level_name(logging::logger_registry().get_logger_level(req.get_path_param("name")));
} catch (std::out_of_range& e) {
throw bad_param_exception("Unknown logger name " + req.param["name"]);
throw bad_param_exception("Unknown logger name " + req.get_path_param("name"));
}
// just to keep the compiler happy
return sstring();
@@ -130,9 +130,9 @@ void set_system(http_context& ctx, routes& r) {
hs::set_logger_level.set(r, [](const_req req) {
try {
logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
logging::logger_registry().set_logger_level(req.param["name"], level);
logging::logger_registry().set_logger_level(req.get_path_param("name"), level);
} catch (std::out_of_range& e) {
throw bad_param_exception("Unknown logger name " + req.param["name"]);
throw bad_param_exception("Unknown logger name " + req.get_path_param("name"));
} catch (boost::bad_lexical_cast& e) {
throw bad_param_exception("Unknown logging level " + req.get_query_param("level"));
}

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include "task_manager.hh"
#include "api/api-doc/task_manager.json.hh"
@@ -124,7 +125,7 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
chunked_stats local_res;
tasks::task_manager::module_ptr module;
try {
module = tm.find_module(req->param["module"]);
module = tm.find_module(req->get_path_param("module"));
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
@@ -139,25 +140,34 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
std::exception_ptr ex;
try {
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
}
}
co_await s.write("]");
co_await s.flush();
} catch (...) {
ex = std::current_exception();
}
co_await s.write("]");
co_await s.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
co_return std::move(f);
});
tm::get_task_status.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
tasks::task_manager::foreign_task_ptr task;
try {
task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
@@ -174,7 +184,7 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
});
tm::abort_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
try {
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
@@ -189,7 +199,7 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
});
tm::wait_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
tasks::task_manager::foreign_task_ptr task;
try {
task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {
@@ -210,7 +220,7 @@ void set_task_manager(http_context& ctx, routes& r, db::config& cfg) {
tm::get_task_status_recursively.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& _ctx = ctx;
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
std::queue<tasks::task_manager::foreign_task_ptr> q;
utils::chunked_vector<full_task_status> res;

View File

@@ -83,7 +83,7 @@ void set_task_manager_test(http_context& ctx, routes& r) {
});
tmt::finish_test_task.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
auto it = req->query_parameters.find("error");
bool fail = it != req->query_parameters.end();
std::string error = fail ? it->second : "";

View File

@@ -89,7 +89,7 @@ public:
// get the delimeter if any
auto it = ctx.begin();
auto end = ctx.end();
if (it != end) {
if (it != end && *it != '}') {
int group_size = *it++ - '0';
if (group_size < 0 ||
static_cast<size_t>(group_size) > sizeof(uint64_t)) {

View File

@@ -453,7 +453,10 @@ future<> cache_flat_mutation_reader::read_from_underlying() {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), std::move(e), cmp);
auto insert_result = rows.insert_before_hint(
_next_row.at_a_row() ? _next_row.get_iterator_in_latest_version() : rows.begin(),
std::move(e),
cmp);
if (insert_result.second) {
auto it = insert_result.first;
_snp->tracker()->insert(*it);
@@ -470,7 +473,10 @@ future<> cache_flat_mutation_reader::read_from_underlying() {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(table_s, to_table_domain(_upper_bound), is_dummy::yes, is_continuous::no));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), std::move(e), cmp);
auto insert_result = rows.insert_before_hint(
_next_row.at_a_row() ? _next_row.get_iterator_in_latest_version() : rows.begin(),
std::move(e),
cmp);
if (insert_result.second) {
clogger.trace("csm {}: L{}: inserted dummy at {}", fmt::ptr(this), __LINE__, _upper_bound);
_snp->tracker()->insert(*insert_result.first);
@@ -631,7 +637,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
current_allocator().construct<rows_entry>(table_schema(), cr.key(), cr.as_deletable_row()));
new_entry->set_continuous(false);
new_entry->set_range_tombstone(_current_tombstone);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
auto it = _next_row.iterators_valid() && _next_row.at_a_row() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), cmp);
auto insert_result = mp.mutable_clustered_rows().insert_before_hint(it, std::move(new_entry), cmp);
it = insert_result.first;
@@ -696,7 +702,7 @@ bool cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone_change
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(table_schema(), to_table_domain(rtc.position()), is_dummy::yes, is_continuous::no));
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
auto it = _next_row.iterators_valid() && _next_row.at_a_row() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(to_table_domain(rtc.position()), cmp);
auto insert_result = mp.mutable_clustered_rows().insert_before_hint(it, std::move(new_entry), cmp);
it = insert_result.first;
@@ -899,7 +905,10 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
auto& rows = _snp->version()->partition().mutable_clustered_rows();
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(table_schema(),
to_table_domain(_lower_bound), is_dummy::yes, is_continuous::no));
return rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), std::move(new_entry), cmp);
return rows.insert_before_hint(
_next_row.at_a_row() ? _next_row.get_iterator_in_latest_version() : rows.begin(),
std::move(new_entry),
cmp);
});
auto it = insert_result.first;
if (insert_result.second) {

View File

@@ -51,8 +51,16 @@ namespace db {
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
api::timestamp_clock::duration get_generation_leeway() {
static thread_local auto generation_leeway =
std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::seconds(5));
utils::get_local_injector().inject("increase_cdc_generation_leeway", [&] {
generation_leeway = std::chrono::duration_cast<api::timestamp_clock::duration>(std::chrono::minutes(5));
});
return generation_leeway;
}
static void copy_int_to_bytes(int64_t i, size_t offset, bytes& b) {
i = net::hton(i);
@@ -372,7 +380,7 @@ db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milli
auto ts = db_clock::now();
if (add_delay && ring_delay != 0ms) {
ts += 2 * ring_delay + duration_cast<milliseconds>(generation_leeway);
ts += 2 * ring_delay + duration_cast<milliseconds>(get_generation_leeway());
}
return ts;
}

View File

@@ -46,6 +46,8 @@ namespace gms {
namespace cdc {
api::timestamp_clock::duration get_generation_leeway();
class stream_id final {
bytes _value;
public:

View File

@@ -15,10 +15,6 @@
extern logging::logger cdc_log;
namespace cdc {
extern const api::timestamp_clock::duration generation_leeway;
} // namespace cdc
static api::timestamp_type to_ts(db_clock::time_point tp) {
// This assumes that timestamp_clock and db_clock have the same epochs.
return std::chrono::duration_cast<api::timestamp_clock::duration>(tp.time_since_epoch()).count();
@@ -73,7 +69,7 @@ bool cdc::metadata::streams_available() const {
cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok) {
auto now = api::new_timestamp();
if (ts > now + generation_leeway.count()) {
if (ts > now + get_generation_leeway().count()) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the future\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps arbitrarily into the future, because we don't"
@@ -86,27 +82,43 @@ cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok)
// Nothing protects us from that until we start using transactions for generation switching.
}
auto it = gen_used_at(now);
if (it == _gens.end()) {
auto it = gen_used_at(now - get_generation_leeway().count());
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
}
if (ts <= now - get_generation_leeway().count()) {
// We reject the write if `ts <= now - generation_leeway` and the write is not to the current generation, which
// happens iff one of the following is true:
// - the write is to no generation,
// - the write is to a generation older than the generation under `it`,
// - the write is to the generation under `it` and that generation is not the current generation.
// Note that we cannot distinguish the first and second cases because we garbage-collect obsolete generations,
// but we can check if one of them takes place (`it == _gens.end() || ts < it->first`). These three conditions
// are sufficient. The write with `ts <= now - generation_leeway` cannot be to one of the generations following
// the generation under `it` because that generation was operating at `now - generation_leeway`.
bool is_previous_gen = it != _gens.end() && std::next(it) != _gens.end() && std::next(it)->first <= now;
if (it == _gens.end() || ts < it->first || is_previous_gen) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream \"from the past\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties.\n"
"We *do* allow sending writes into the near past, but our ability to do that is limited."
" Are you using client-side timestamps? Make sure your clocks are well-synchronized"
" with the database's clocks.", format_timestamp(ts), format_timestamp(now)));
}
}
it = _gens.begin();
if (it == _gens.end() || ts < it->first) {
throw std::runtime_error(format(
"cdc::metadata::get_stream: could not find any CDC stream (current time: {})."
" Are we in the middle of a cluster upgrade?", format_timestamp(now)));
"cdc::metadata::get_stream: could not find any CDC stream for timestamp {}."
" Are we in the middle of a cluster upgrade?", format_timestamp(ts)));
}
// Garbage-collect generations that will no longer be used.
it = _gens.erase(_gens.begin(), it);
if (it->first > ts) {
throw exceptions::invalid_request_exception(format(
"cdc: attempted to get a stream from an earlier generation than the currently used one."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties (write timestamp: {}, current generation started at: {})",
format_timestamp(ts), format_timestamp(it->first)));
}
// With `generation_leeway` we allow sending writes to the near future. It might happen
// that `ts` doesn't belong to the current generation ("current" according to our clock),
// but to the next generation. Adjust for this case:
// Find the generation operating at `ts`.
{
auto next_it = std::next(it);
while (next_it != _gens.end() && next_it->first <= ts) {
@@ -147,8 +159,8 @@ bool cdc::metadata::known_or_obsolete(db_clock::time_point tp) const {
++it;
}
// Check if some new generation has already superseded this one.
return it != _gens.end() && it->first <= api::new_timestamp();
// Check if the generation is obsolete.
return it != _gens.end() && it->first <= api::new_timestamp() - get_generation_leeway().count();
}
bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen) {
@@ -157,7 +169,7 @@ bool cdc::metadata::insert(db_clock::time_point tp, topology_description&& gen)
}
auto now = api::new_timestamp();
auto it = gen_used_at(now);
auto it = gen_used_at(now - get_generation_leeway().count());
if (it != _gens.end()) {
// Garbage-collect generations that will no longer be used.

View File

@@ -42,7 +42,9 @@ class metadata final {
container_t::const_iterator gen_used_at(api::timestamp_type ts) const;
public:
/* Is a generation with the given timestamp already known or superseded by a newer generation? */
/* Is a generation with the given timestamp already known or obsolete? It is obsolete if and only if
* it is older than the generation operating at `now - get_generation_leeway()`.
*/
bool known_or_obsolete(db_clock::time_point) const;
/* Are there streams available. I.e. valid for time == now. If this is false, any writes to
@@ -54,8 +56,9 @@ public:
*
* If the provided timestamp is too far away "into the future" (where "now" is defined according to our local clock),
* we reject the get_stream query. This is because the resulting stream might belong to a generation which we don't
* yet know about. The amount of leeway (how much "into the future" we allow `ts` to be) is defined
* by the `cdc::generation_leeway` constant.
* yet know about. Similarly, we reject queries to the previous generations if the timestamp is too far away "into
* the past". The amount of leeway (how much "into the future" or "into the past" we allow `ts` to be) is defined by
* `get_generation_leeway()`.
*/
stream_id get_stream(api::timestamp_type ts, dht::token tok);

View File

@@ -144,12 +144,21 @@ std::ostream& operator<<(std::ostream& os, compaction_type_options::scrub::quara
}
static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks) {
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks,
const api::timestamp_type compacting_max_timestamp) {
if (!table_s.tombstone_gc_enabled()) [[unlikely]] {
return api::min_timestamp;
}
auto timestamp = table_s.min_memtable_timestamp();
auto timestamp = api::max_timestamp;
auto memtable_min_timestamp = table_s.min_memtable_timestamp();
// Use memtable timestamp if it contains data older than the sstables being compacted,
// and if the memtable also contains the key we're calculating max purgeable timestamp for.
// First condition helps to not penalize the common scenario where memtable only contains
// newer data.
if (memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {
timestamp = memtable_min_timestamp;
}
std::optional<utils::hashed_key> hk;
for (auto&& sst : boost::range::join(selector.select(dk).sstables, table_s.compacted_undeleted_sstables())) {
if (compacting_set.contains(sst)) {
@@ -441,7 +450,9 @@ protected:
uint64_t _end_size = 0;
// fully expired files, which are skipped, aren't taken into account.
uint64_t _compacting_data_file_size = 0;
api::timestamp_type _compacting_max_timestamp = api::min_timestamp;
uint64_t _estimated_partitions = 0;
double _estimated_droppable_tombstone_ratio = 0;
uint64_t _bloom_filter_checks = 0;
db::replay_position _rp;
encoding_stats_collector _stats_collector;
@@ -470,6 +481,26 @@ private:
cdata.compaction_fan_in = descriptor.fan_in();
return cdata;
}
// Called in a seastar thread
dht::partition_range_vector
get_ranges_for_invalidation(const std::vector<shared_sstable>& sstables) {
// If owned ranges is disengaged, it means no cleanup work was done and
// so nothing needs to be invalidated.
if (!_owned_ranges) {
return dht::partition_range_vector{};
}
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto non_owned_ranges = boost::copy_range<dht::partition_range_vector>(sstables
| boost::adaptors::transformed([] (const shared_sstable& sst) {
seastar::thread::maybe_yield();
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}));
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
}
protected:
compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata)
: _cdata(init_compaction_data(cdata, descriptor))
@@ -549,9 +580,10 @@ protected:
return _stats_collector.get();
}
virtual compaction_completion_desc
compaction_completion_desc
get_compaction_completion_desc(std::vector<shared_sstable> input_sstables, std::vector<shared_sstable> output_sstables) {
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables)};
auto ranges_for_for_invalidation = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges_for_for_invalidation)};
}
// Tombstone expiration is enabled based on the presence of sstable set.
@@ -567,7 +599,8 @@ protected:
sstable_writer_config cfg = _table_s.configure_writer("garbage_collection");
cfg.run_identifier = gc_run;
cfg.monitor = monitor.get();
auto writer = sst->get_writer(*schema(), partitions_per_sstable(), cfg, get_encoding_stats());
uint64_t estimated_partitions = std::max(1UL, uint64_t(ceil(partitions_per_sstable() * _estimated_droppable_tombstone_ratio)));
auto writer = sst->get_writer(*schema(), estimated_partitions, cfg, get_encoding_stats());
return compaction_writer(std::move(monitor), std::move(writer), std::move(sst));
}
@@ -686,6 +719,7 @@ private:
auto fully_expired = _table_s.fully_expired_sstables(_sstables, gc_clock::now());
min_max_tracker<api::timestamp_type> timestamp_tracker;
double sum_of_estimated_droppable_tombstone_ratio = 0;
_input_sstable_generations.reserve(_sstables.size());
for (auto& sst : _sstables) {
co_await coroutine::maybe_yield();
@@ -712,7 +746,10 @@ private:
// for a better estimate for the number of partitions in the merged
// sstable than just adding up the lengths of individual sstables.
_estimated_partitions += sst->get_estimated_key_count();
auto gc_before = sst->get_gc_before_for_drop_estimation(gc_clock::now(), _table_s.get_tombstone_gc_state(), _schema);
sum_of_estimated_droppable_tombstone_ratio += sst->estimate_droppable_tombstone_ratio(gc_before);
_compacting_data_file_size += sst->ondisk_data_size();
// TODO:
// Note that this is not fully correct. Since we might be merging sstables that originated on
// another shard (#cpu changed), we might be comparing RP:s with differing shard ids,
@@ -721,12 +758,16 @@ private:
// this is kind of ok, esp. since we will hopefully not be trying to recover based on
// compacted sstables anyway (CL should be clean by then).
_rp = std::max(_rp, sst_stats.position);
_compacting_max_timestamp = std::max(_compacting_max_timestamp, sst->get_stats_metadata().max_timestamp);
}
log_info("{} {}", report_start_desc(), formatted_msg);
if (ssts->size() < _sstables.size()) {
log_debug("{} out of {} input sstables are fully expired sstables that will not be actually compacted",
_sstables.size() - ssts->size(), _sstables.size());
}
// _estimated_droppable_tombstone_ratio could exceed 1.0 in certain cases, so limit it to 1.0.
_estimated_droppable_tombstone_ratio = std::min(1.0, sum_of_estimated_droppable_tombstone_ratio / ssts->size());
_compacting = std::move(ssts);
@@ -841,7 +882,7 @@ private:
};
}
return [this] (const dht::decorated_key& dk) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks);
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp);
};
}
@@ -1248,28 +1289,6 @@ public:
};
class cleanup_compaction final : public regular_compaction {
private:
// Called in a seastar thread
dht::partition_range_vector
get_ranges_for_invalidation(const std::vector<shared_sstable>& sstables) {
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto non_owned_ranges = boost::copy_range<dht::partition_range_vector>(sstables
| boost::adaptors::transformed([] (const shared_sstable& sst) {
seastar::thread::maybe_yield();
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}));
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
}
protected:
virtual compaction_completion_desc
get_compaction_completion_desc(std::vector<shared_sstable> input_sstables, std::vector<shared_sstable> output_sstables) override {
auto ranges_for_for_invalidation = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges_for_for_invalidation)};
}
public:
cleanup_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata)
: regular_compaction(table_s, std::move(descriptor), cdata)

View File

@@ -22,6 +22,7 @@
#include "sstables/exceptions.hh"
#include "sstables/sstable_directory.hh"
#include "locator/abstract_replication_strategy.hh"
#include "utils/error_injection.hh"
#include "utils/fb_utilities.hh"
#include "utils/UUID_gen.hh"
#include "db/system_keyspace.hh"
@@ -1147,6 +1148,11 @@ protected:
}
virtual future<compaction_manager::compaction_stats_opt> do_run() override {
if (!is_system_keyspace(_status.keyspace)) {
co_await utils::get_local_injector().inject_with_handler("compaction_regular_compaction_task_executor_do_run",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + 10s); });
}
co_await coroutine::switch_to(_cm.compaction_sg());
for (;;) {
@@ -1321,13 +1327,20 @@ private:
}));
};
auto get_next_job = [&] () -> std::optional<sstables::compaction_descriptor> {
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), sstables::reshape_mode::strict);
return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
auto get_next_job = [&] () -> future<std::optional<sstables::compaction_descriptor>> {
auto candidates = get_reshape_candidates();
if (candidates.empty()) {
co_return std::nullopt;
}
// all sstables added to maintenance set share the same underlying storage.
auto& storage = candidates.front()->get_storage();
sstables::reshape_config cfg = co_await sstables::make_reshape_config(storage, sstables::reshape_mode::strict);
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), cfg);
co_return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
};
std::exception_ptr err;
while (auto desc = get_next_job()) {
while (auto desc = co_await get_next_job()) {
auto compacting = compacting_sstable_registration(_cm, _cm.get_compaction_state(&t), desc->sstables);
auto on_replace = compacting.update_on_sstable_replacement();
@@ -1845,6 +1858,9 @@ future<> compaction_manager::try_perform_cleanup(owned_ranges_ptr sorted_owned_r
if (found_maintenance_sstables) {
co_await perform_offstrategy(t, info);
}
if (utils::get_local_injector().enter("major_compaction_before_cleanup")) {
co_await perform_major_compaction(t, info);
}
// Called with compaction_disabled
auto get_sstables = [this, &t] () -> future<std::vector<sstables::shared_sstable>> {

View File

@@ -75,7 +75,7 @@ reader_consumer_v2 compaction_strategy_impl::make_interposer_consumer(const muta
}
compaction_descriptor
compaction_strategy_impl::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
compaction_strategy_impl::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
return compaction_descriptor();
}
@@ -700,8 +700,8 @@ compaction_backlog_tracker compaction_strategy::make_backlog_tracker() const {
}
sstables::compaction_descriptor
compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, mode);
compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, cfg);
}
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) const {
@@ -739,6 +739,13 @@ compaction_strategy make_compaction_strategy(compaction_strategy_type strategy,
return compaction_strategy(std::move(impl));
}
future<reshape_config> make_reshape_config(const sstables::storage& storage, reshape_mode mode) {
co_return sstables::reshape_config{
.mode = mode,
.free_storage_space = co_await storage.free_space() / smp::count,
};
}
}
namespace compaction {

View File

@@ -31,6 +31,7 @@ class sstable;
class sstable_set;
struct compaction_descriptor;
struct resharding_descriptor;
class storage;
class compaction_strategy {
::shared_ptr<compaction_strategy_impl> _compaction_strategy_impl;
@@ -122,11 +123,13 @@ public:
//
// The caller should also pass a maximum number of SSTables which is the maximum amount of
// SSTables that can be added into a single job.
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const;
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const;
};
// Creates a compaction_strategy object from one of the strategies available.
compaction_strategy make_compaction_strategy(compaction_strategy_type strategy, const std::map<sstring, sstring>& options);
future<reshape_config> make_reshape_config(const sstables::storage& storage, reshape_mode mode);
}

View File

@@ -76,6 +76,6 @@ public:
return false;
}
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const;
};
}

View File

@@ -8,6 +8,8 @@
#pragma once
#include <cstdint>
namespace sstables {
enum class compaction_strategy_type {
@@ -18,4 +20,10 @@ enum class compaction_strategy_type {
};
enum class reshape_mode { strict, relaxed };
struct reshape_config {
reshape_mode mode;
const uint64_t free_storage_space;
};
}

View File

@@ -146,7 +146,8 @@ int64_t leveled_compaction_strategy::estimated_pending_compactions(table_state&
}
compaction_descriptor
leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::array<std::vector<shared_sstable>, leveled_manifest::MAX_LEVELS> level_info;
auto is_disjoint = [schema] (const std::vector<shared_sstable>& sstables, unsigned tolerance) -> std::tuple<bool, unsigned> {
@@ -203,7 +204,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
if (level_info[0].size() > offstrategy_threshold) {
size_tiered_compaction_strategy stcs(_stcs_options);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, mode);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, cfg);
}
for (unsigned level = leveled_manifest::MAX_LEVELS - 1; level > 0; --level) {

View File

@@ -74,7 +74,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
};
}

View File

@@ -297,8 +297,9 @@ size_tiered_compaction_strategy::most_interesting_bucket(const std::vector<sstab
}
compaction_descriptor
size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const
size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const
{
auto mode = cfg.mode;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));

View File

@@ -96,7 +96,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
friend class ::size_tiered_backlog_tracker;
};

View File

@@ -48,6 +48,7 @@ public:
virtual sstables::shared_sstable make_sstable() const = 0;
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
virtual api::timestamp_type min_memtable_timestamp() const = 0;
virtual bool memtable_has_key(const dht::decorated_key& key) const = 0;
virtual future<> on_compaction_completion(sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;
virtual bool tombstone_gc_enabled() const noexcept = 0;

View File

@@ -555,7 +555,13 @@ future<> shard_reshaping_compaction_task_impl::run() {
| boost::adaptors::filtered([&filter = _filter] (const auto& sst) {
return filter(sst);
}));
auto desc = table.get_compaction_strategy().get_reshaping_job(std::move(reshape_candidates), table.schema(), _mode);
if (reshape_candidates.empty()) {
break;
}
// all sstables were found in the same sstable_directory instance, so they share the same underlying storage.
auto& storage = reshape_candidates.front()->get_storage();
auto cfg = co_await sstables::make_reshape_config(storage, _mode);
auto desc = table.get_compaction_strategy().get_reshaping_job(std::move(reshape_candidates), table.schema(), cfg);
if (desc.sstables.empty()) {
break;
}

View File

@@ -704,6 +704,10 @@ public:
virtual std::string type() const override {
return "regular compaction";
}
virtual tasks::is_internal is_internal() const noexcept override {
return tasks::is_internal::yes;
}
protected:
virtual future<> run() override = 0;
};

View File

@@ -223,12 +223,14 @@ reader_consumer_v2 time_window_compaction_strategy::make_interposer_consumer(con
}
compaction_descriptor
time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::vector<shared_sstable> single_window;
std::vector<shared_sstable> multi_window;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
const uint64_t target_job_size = cfg.free_storage_space * reshape_target_space_overhead;
if (mode == reshape_mode::relaxed) {
offstrategy_threshold = max_sstables;
@@ -260,22 +262,40 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
multi_window.size(), !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0,
single_window.size(), !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0);
auto need_trimming = [max_sstables, schema, &is_disjoint] (const std::vector<shared_sstable>& ssts) {
// All sstables can be compacted at once if they're disjoint, given that partitioned set
// will incrementally open sstables which translates into bounded memory usage.
return ssts.size() > max_sstables && !is_disjoint(ssts);
auto get_job_size = [] (const std::vector<shared_sstable>& ssts) {
return boost::accumulate(ssts | boost::adaptors::transformed(std::mem_fn(&sstable::bytes_on_disk)), uint64_t(0));
};
// Targets a space overhead of 10%. All disjoint sstables can be compacted together as long as they won't
// cause an overhead above target. Otherwise, the job targets a maximum of #max_threshold sstables.
auto need_trimming = [&] (const std::vector<shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
const size_t min_sstables = 2;
auto is_above_target_size = job_size > target_job_size;
return (ssts.size() > max_sstables && !is_disjoint) ||
(ssts.size() > min_sstables && is_above_target_size);
};
auto maybe_trim_job = [&need_trimming] (std::vector<shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
while (need_trimming(ssts, job_size, is_disjoint)) {
auto sst = ssts.back();
ssts.pop_back();
job_size -= sst->bytes_on_disk();
}
};
if (!multi_window.empty()) {
auto disjoint = is_disjoint(multi_window);
auto job_size = get_job_size(multi_window);
// Everything that spans multiple windows will need reshaping
if (need_trimming(multi_window)) {
if (need_trimming(multi_window, job_size, disjoint)) {
// When trimming, let's keep sstables with overlapping time window, so as to reduce write amplification.
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
boost::partial_sort(multi_window, multi_window.begin() + max_sstables, [](const shared_sstable &a, const shared_sstable &b) {
return a->get_stats_metadata().max_timestamp < b->get_stats_metadata().max_timestamp;
});
multi_window.resize(max_sstables);
maybe_trim_job(multi_window, job_size, disjoint);
}
compaction_descriptor desc(std::move(multi_window));
desc.options = compaction_type_options::make_reshape();
@@ -294,15 +314,17 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
std::copy(ssts.begin(), ssts.end(), std::back_inserter(single_window));
continue;
}
// reuse STCS reshape logic which will only compact similar-sized files, to increase overall efficiency
// when reshaping time buckets containing a huge amount of files
auto desc = size_tiered_compaction_strategy(_stcs_options).get_reshaping_job(std::move(ssts), schema, mode);
auto desc = size_tiered_compaction_strategy(_stcs_options).get_reshaping_job(std::move(ssts), schema, cfg);
if (!desc.sstables.empty()) {
return desc;
}
}
}
if (!single_window.empty()) {
maybe_trim_job(single_window, get_job_size(single_window), all_disjoint);
compaction_descriptor desc(std::move(single_window));
desc.options = compaction_type_options::make_reshape();
return desc;

View File

@@ -78,6 +78,7 @@ public:
// To prevent an explosion in the number of sstables we cap it.
// Better co-locate some windows into the same sstables than OOM.
static constexpr uint64_t max_data_segregation_window_count = 100;
static constexpr float reshape_target_space_overhead = 0.1f;
using bucket_t = std::vector<shared_sstable>;
enum class bucket_compaction_mode { none, size_tiered, major };
@@ -170,7 +171,7 @@ public:
return true;
}
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
};
}

View File

@@ -572,7 +572,7 @@ murmur3_partitioner_ignore_msb_bits: 12
force_schema_commit_log: true
# Time for which task manager task is kept in memory after it completes.
task_ttl_in_seconds: 10
# task_ttl_in_seconds: 0
# Use Raft to consistently manage schema information in the cluster.
# Refer to https://docs.scylladb.com/master/architecture/raft.html for more details.

View File

@@ -852,6 +852,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/rjson.cc',
'utils/human_readable.cc',
'utils/histogram_metrics_helper.cc',
'utils/on_internal_error.cc',
'utils/pretty_printers.cc',
'converting_mutation_partition_applier.cc',
'readers/combined.cc',
@@ -1126,6 +1127,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/lister.cc',
'repair/repair.cc',
'repair/row_level.cc',
'repair/table_check.cc',
'exceptions/exceptions.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
@@ -1240,6 +1242,8 @@ api = ['api/api.cc',
Json2Code('api/api-doc/error_injection.json'),
'api/authorization_cache.cc',
Json2Code('api/api-doc/authorization_cache.json'),
'api/raft.cc',
Json2Code('api/api-doc/raft.json'),
]
alternator = [
@@ -1451,7 +1455,7 @@ deps['test/boost/bytes_ostream_test'] = [
"test/lib/log.cc",
]
deps['test/boost/input_stream_test'] = ['test/boost/input_stream_test.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']

View File

@@ -338,6 +338,9 @@ functions::get(data_dictionary::database db,
if (!receiver_cf.has_value()) {
throw exceptions::invalid_request_exception("functions::get for token doesn't have a known column family");
}
if (schema == nullptr) {
throw exceptions::invalid_request_exception(format("functions::get for token cannot find {} table", *receiver_cf));
}
auto fun = ::make_shared<token_fct>(schema);
validate_types(db, keyspace, schema.get(), fun, provided_args, receiver_ks, receiver_cf);
return fun;

View File

@@ -815,7 +815,7 @@ bool query_processor::has_more_results(cql3::internal_query_state& state) const
future<> query_processor::for_each_cql_result(
cql3::internal_query_state& state,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)>&& f) {
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)> f) {
do {
auto msg = co_await execute_paged_internal(state);
for (auto& row : *msg) {
@@ -1065,6 +1065,9 @@ void query_processor::migration_subscriber::on_update_aggregate(const sstring& k
void query_processor::migration_subscriber::on_update_view(
const sstring& ks_name,
const sstring& view_name, bool columns_changed) {
// scylladb/scylladb#16392 - Materialized views are also tables so we need at least handle
// them as such when changed.
on_update_column_family(ks_name, view_name, columns_changed);
}
void query_processor::migration_subscriber::on_update_tablet_metadata() {
@@ -1113,14 +1116,14 @@ future<> query_processor::query_internal(
db::consistency_level cl,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
auto query_state = create_paged_state(query_string, cl, values, page_size);
co_return co_await for_each_cql_result(query_state, std::move(f));
}
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
}

View File

@@ -307,7 +307,7 @@ public:
db::consistency_level cl,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
/*
* \brief iterate over all cql results using paging
@@ -322,7 +322,7 @@ public:
*/
future<> query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
class cache_internal_tag;
using cache_internal = bool_class<cache_internal_tag>;
@@ -479,7 +479,7 @@ private:
*/
future<> for_each_cql_result(
cql3::internal_query_state& state,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
/*!
* \brief check, based on the state if there are additional results

View File

@@ -541,22 +541,32 @@ std::pair<std::optional<secondary_index::index>, expr::expression> statement_res
int chosen_index_score = 0;
expr::expression chosen_index_restrictions = expr::conjunction({});
for (const auto& index : sim.list_indexes()) {
auto cdef = _schema->get_column_definition(to_bytes(index.target_column()));
for (const expr::expression& restriction : index_restrictions()) {
if (has_partition_token(restriction, *_schema) || contains_multi_column_restriction(restriction)) {
continue;
}
expr::single_column_restrictions_map rmap = expr::get_single_column_restrictions_map(restriction);
const auto found = rmap.find(cdef);
if (found != rmap.end() && is_supported_by(found->second, index)
&& score(index) > chosen_index_score) {
chosen_index = index;
chosen_index_score = score(index);
chosen_index_restrictions = restriction;
}
// Several indexes may be usable for this query. When their score is tied,
// let's pick one by order of the columns mentioned in the restriction
// expression. This specific order isn't important (and maybe in the
// future we could plan a better order based on the specificity of each
// index), but it is critical that two coordinators - or the same
// coordinator over time - must choose the same index for the same query.
// Otherwise, paging can break (see issue #7969).
for (const expr::expression& restriction : index_restrictions()) {
if (has_partition_token(restriction, *_schema) || contains_multi_column_restriction(restriction)) {
continue;
}
expr::for_each_expression<expr::column_value>(restriction, [&](const expr::column_value& cval) {
auto& cdef = cval.col;
expr::expression col_restrictions = expr::conjunction {
.children = expr::extract_single_column_restrictions_for_column(restriction, *cdef)
};
for (const auto& index : sim.list_indexes()) {
if (cdef->name_as_text() == index.target_column() &&
expr::is_supported_by(col_restrictions, index) &&
score(index) > chosen_index_score) {
chosen_index = index;
chosen_index_score = score(index);
chosen_index_restrictions = restriction;
}
}
});
}
return {chosen_index, chosen_index_restrictions};
}
@@ -1132,13 +1142,14 @@ bool starts_before_start(
const auto len1 = r1.start()->value().representation().size();
const auto len2 = r2.start()->value().representation().size();
if (len1 == len2) { // The values truly are equal.
// (a)>=(1) starts before (a)>(1)
return r1.start()->is_inclusive() && !r2.start()->is_inclusive();
} else if (len1 < len2) { // r1 start is a prefix of r2 start.
// (a)>=(1) starts before (a,b)>=(1,1), but (a)>(1) doesn't.
return r1.start()->is_inclusive();
} else { // r2 start is a prefix of r1 start.
// (a,b)>=(1,1) starts before (a)>(1) but after (a)>=(1).
return r2.start()->is_inclusive();
return !r2.start()->is_inclusive();
}
}
@@ -1163,6 +1174,7 @@ bool starts_before_or_at_end(
const auto len1 = r1.start()->value().representation().size();
const auto len2 = r2.end()->value().representation().size();
if (len1 == len2) { // The values truly are equal.
// (a)>=(1) starts at end of (a)<=(1)
return r1.start()->is_inclusive() && r2.end()->is_inclusive();
} else if (len1 < len2) { // r1 start is a prefix of r2 end.
// a>=(1) starts before (a,b)<=(1,1) ends, but (a)>(1) doesn't.
@@ -1194,6 +1206,7 @@ bool ends_before_end(
const auto len1 = r1.end()->value().representation().size();
const auto len2 = r2.end()->value().representation().size();
if (len1 == len2) { // The values truly are equal.
// (a)<(1) ends before (a)<=(1) ends
return !r1.end()->is_inclusive() && r2.end()->is_inclusive();
} else if (len1 < len2) { // r1 end is a prefix of r2 end.
// (a)<(1) ends before (a,b)<=(1,1), but (a)<=(1) doesn't.
@@ -1209,7 +1222,10 @@ std::optional<query::clustering_range> intersection(
const query::clustering_range& r1,
const query::clustering_range& r2,
const clustering_key_prefix::prefix_equal_tri_compare& cmp) {
// Assume r1's start is to the left of r2's start.
// If needed, swap r1 and r2 so that r1's start is to the left of r2's
// start. Note that to avoid infinite recursion (#18688) the function
// starts_before_start() must never return true for both (r1,r2) and
// (r2,r1) - in other words, it must be a *strict* partial order.
if (starts_before_start(r2, r1, cmp)) {
return intersection(r2, r1, cmp);
}

View File

@@ -2004,7 +2004,10 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
)
&& !restrictions->need_filtering() // No filtering
&& group_by_cell_indices->empty() // No GROUP BY
&& db.get_config().enable_parallelized_aggregation();
&& db.get_config().enable_parallelized_aggregation()
&& !( // Do not parallelize the request if it's single partition read
restrictions->partition_key_restrictions_is_all_eq()
&& restrictions->partition_key_restrictions_size() == schema->partition_key_size());
};
if (_parameters->is_prune_materialized_view()) {

View File

@@ -151,13 +151,15 @@ static bytes from_json_object_aux(const map_type_impl& t, const rjson::value& va
std::map<bytes, bytes, serialized_compare> raw_map(t.get_keys_type()->as_less_comparator());
for (auto it = value.MemberBegin(); it != value.MemberEnd(); ++it) {
bytes value = from_json_object(*t.get_values_type(), it->value);
if (t.get_keys_type()->underlying_type() == ascii_type ||
t.get_keys_type()->underlying_type() == utf8_type) {
// For all native (non-collection, non-tuple) key types, they are
// represented as a string in JSON. For more elaborate types, they
// can also be a string representation of another JSON type, which
// needs to be reparsed as JSON. For example,
// map<frozen<list<int>>, int> will be represented as:
// { "[1, 3, 6]": 3, "[]": 0, "[1, 2]": 2 }
if (t.get_keys_type()->underlying_type()->is_native()) {
raw_map.emplace(from_json_object(*t.get_keys_type(), it->name), std::move(value));
} else {
// Keys in maps can only be strings in JSON, but they can also be a string representation
// of another JSON type, which needs to be reparsed. Example - map<frozen<list<int>>, int>
// will be represented like this: { "[1, 3, 6]": 3, "[]": 0, "[1, 2]": 2 }
try {
rjson::value map_key = rjson::parse(rjson::to_string_view(it->name));
raw_map.emplace(from_json_object(*t.get_keys_type(), map_key), std::move(value));

View File

@@ -135,7 +135,7 @@ future<> db::batchlog_manager::stop() {
}
future<size_t> db::batchlog_manager::count_all_batches() const {
sstring query = format("SELECT count(*) FROM {}.{}", system_keyspace::NAME, system_keyspace::BATCHLOG);
sstring query = format("SELECT count(*) FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG);
return _qp.execute_internal(query, cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> rs) {
return size_t(rs->one().get_as<int64_t>("count"));
});
@@ -154,26 +154,26 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
auto limiter = make_lw_shared<utils::rate_limiter>(throttle);
auto batch = [this, limiter](const cql3::untyped_result_set::row& row) {
auto batch = [this, limiter](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto timeout = get_batch_log_timeout();
if (db_clock::now() < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
return make_ready_future<>();
return make_ready_future<stop_iteration>(stop_iteration::no);
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
return make_ready_future<>();
return make_ready_future<stop_iteration>(stop_iteration::no);
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Skipping logged batch because of incorrect version");
return make_ready_future<>();
return make_ready_future<stop_iteration>(stop_iteration::no);
}
auto data = row.get_blob("data");
@@ -255,49 +255,20 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
});
}).then([] { return make_ready_future<stop_iteration>(stop_iteration::no); });
};
return seastar::with_gate(_gate, [this, batch = std::move(batch)] {
return seastar::with_gate(_gate, [this, batch = std::move(batch)] () mutable {
blogger.debug("Started replayAllFailedBatches (cpu {})", this_shard_id());
typedef ::shared_ptr<cql3::untyped_result_set> page_ptr;
sstring query = format("SELECT id, data, written_at, version FROM {}.{} LIMIT {:d}", system_keyspace::NAME, system_keyspace::BATCHLOG, page_size);
return _qp.execute_internal(query, cql3::query_processor::cache_internal::yes).then([this, batch = std::move(batch)](page_ptr page) {
return do_with(std::move(page), [this, batch = std::move(batch)](page_ptr & page) mutable {
return repeat([this, &page, batch = std::move(batch)]() mutable {
if (page->empty()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
auto id = page->back().get_as<utils::UUID>("id");
return parallel_for_each(*page, batch).then([this, &page, id]() {
if (page->size() < page_size) {
return make_ready_future<stop_iteration>(stop_iteration::yes); // we've exhausted the batchlog, next query would be empty.
}
sstring query = format("SELECT id, data, written_at, version FROM {}.{} WHERE token(id) > token(?) LIMIT {:d}",
system_keyspace::NAME,
system_keyspace::BATCHLOG,
page_size);
return _qp.execute_internal(query, {id}, cql3::query_processor::cache_internal::yes).then([&page](auto res) {
page = std::move(res);
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
});
});
}).then([] {
// TODO FIXME : cleanup()
#if 0
ColumnFamilyStore cfs = Keyspace.open(SystemKeyspace.NAME).getColumnFamilyStore(SystemKeyspace.BATCHLOG);
cfs.forceBlockingFlush();
Collection<Descriptor> descriptors = new ArrayList<>();
for (SSTableReader sstr : cfs.getSSTables())
descriptors.add(sstr.descriptor);
if (!descriptors.isEmpty()) // don't pollute the logs if there is nothing to compact.
CompactionManager.instance.submitUserDefined(cfs, descriptors, Integer.MAX_VALUE).get();
#endif
return _qp.query_internal(
format("SELECT id, data, written_at, version FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
std::move(batch)).then([this] {
// Replaying batches could have generated tombstones, flush to disk,
// where they can be compacted away.
return replica::database::flush_table_on_all_shards(_qp.proxy().get_db(), system_keyspace::NAME, system_keyspace::BATCHLOG);
}).then([] {
blogger.debug("Finished replayAllFailedBatches");
});

View File

@@ -489,6 +489,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Adjusts the sensitivity of the failure detector on an exponential scale. Generally this setting never needs adjusting.\n"
"Related information: Failure detection and recovery")
, failure_detector_timeout_in_ms(this, "failure_detector_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 20 * 1000, "Maximum time between two successful echo message before gossip mark a node down in milliseconds.\n")
, direct_failure_detector_ping_timeout_in_ms(this, "direct_failure_detector_ping_timeout_in_ms", value_status::Used, 600, "Duration after which the direct failure detector aborts a ping message, so the next ping can start.\n"
"Note: this failure detector is used by Raft, and is different from gossiper's failure detector (configured by `failure_detector_timeout_in_ms`).\n")
/**
* @Group Performance tuning properties
* @GroupDescription Tuning performance and system resource utilization, including commit log, compaction, memory, disk I/O, CPU, reads, and writes.
@@ -678,6 +680,9 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"The maximum number of tombstones a query can scan before aborting.")
, query_tombstone_page_limit(this, "query_tombstone_page_limit", liveness::LiveUpdate, value_status::Used, 10000,
"The number of tombstones after which a query cuts a page, even if not full or even empty.")
, query_page_size_in_bytes(this, "query_page_size_in_bytes", liveness::LiveUpdate, value_status::Used, 1 << 20,
"The size of pages in bytes, after a page accumulates this much data, the page is cut and sent to the client."
" Setting a too large value increases the risk of OOM.")
/**
* @Group Network timeout settings
*/
@@ -926,6 +931,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, true, "Set true to use enable repair based node operations instead of streaming based")
, allowed_repair_based_node_ops(this, "allowed_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, "replace,removenode,rebuild,bootstrap,decommission", "A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild")
, enable_compacting_data_for_streaming_and_repair(this, "enable_compacting_data_for_streaming_and_repair", liveness::LiveUpdate, value_status::Used, true, "Enable the compacting reader, which compacts the data for streaming and repair (load'n'stream included) before sending it to, or synchronizing it with peers. Can reduce the amount of data to be processed by removing dead data, but adds CPU overhead.")
, repair_partition_count_estimation_ratio(this, "repair_partition_count_estimation_ratio", liveness::LiveUpdate, value_status::Used, 0.1,
"Specify the fraction of partitions written by repair out of the total partitions. The value is currently only used for bloom filter estimation. Value is between 0 and 1.")
, ring_delay_ms(this, "ring_delay_ms", value_status::Used, 30 * 1000, "Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.")
, shadow_round_ms(this, "shadow_round_ms", value_status::Used, 300 * 1000, "The maximum gossip shadow round time. Can be used to reduce the gossip feature check time during node boot up.")
, fd_max_interval_ms(this, "fd_max_interval_ms", value_status::Used, 2 * 1000, "The maximum failure_detector interval time in milliseconds. Interval larger than the maximum will be ignored. Larger cluster may need to increase the default.")
@@ -944,6 +951,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, unspooled_dirty_soft_limit(this, "unspooled_dirty_soft_limit", value_status::Used, 0.6, "Soft limit of unspooled dirty memory expressed as a portion of the hard limit")
, sstable_summary_ratio(this, "sstable_summary_ratio", value_status::Used, 0.0005, "Enforces that 1 byte of summary is written for every N (2000 by default) "
"bytes written to data file. Value must be between 0 and 1.")
, components_memory_reclaim_threshold(this, "components_memory_reclaim_threshold", liveness::LiveUpdate, value_status::Used, .2, "Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
, large_memory_allocation_warning_threshold(this, "large_memory_allocation_warning_threshold", value_status::Used, size_t(1) << 20, "Warn about memory allocations above this size; set to zero to disable")
, enable_deprecated_partitioners(this, "enable_deprecated_partitioners", value_status::Used, false, "Enable the byteordered and random partitioners. These partitioners are deprecated and will be removed in a future version.")
, enable_keyspace_column_family_metrics(this, "enable_keyspace_column_family_metrics", value_status::Used, false, "Enable per keyspace and per column family metrics reporting")
@@ -983,6 +991,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Start serializing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_kill_limit_multiplier(this, "reader_concurrency_semaphore_kill_limit_multiplier", liveness::LiveUpdate, value_status::Used, 4,
"Start killing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_cpu_concurrency(this, "reader_concurrency_semaphore_cpu_concurrency", liveness::LiveUpdate, value_status::Used, 1,
"Admit new reads while there are less than this number of requests that need CPU.")
, twcs_max_window_count(this, "twcs_max_window_count", liveness::LiveUpdate, value_status::Used, 50,
"The maximum number of compaction windows allowed when making use of TimeWindowCompactionStrategy. A setting of 0 effectively disables the restriction.")
, initial_sstable_loading_concurrency(this, "initial_sstable_loading_concurrency", value_status::Used, 4u,

View File

@@ -196,6 +196,7 @@ public:
named_value<bool> snapshot_before_compaction;
named_value<uint32_t> phi_convict_threshold;
named_value<uint32_t> failure_detector_timeout_in_ms;
named_value<uint32_t> direct_failure_detector_ping_timeout_in_ms;
named_value<sstring> commitlog_sync;
named_value<uint32_t> commitlog_segment_size_in_mb;
named_value<uint32_t> schema_commitlog_segment_size_in_mb;
@@ -254,6 +255,7 @@ public:
named_value<uint32_t> tombstone_warn_threshold;
named_value<uint32_t> tombstone_failure_threshold;
named_value<uint64_t> query_tombstone_page_limit;
named_value<uint64_t> query_page_size_in_bytes;
named_value<uint32_t> range_request_timeout_in_ms;
named_value<uint32_t> read_request_timeout_in_ms;
named_value<uint32_t> counter_write_request_timeout_in_ms;
@@ -329,6 +331,7 @@ public:
named_value<bool> enable_repair_based_node_ops;
named_value<sstring> allowed_repair_based_node_ops;
named_value<bool> enable_compacting_data_for_streaming_and_repair;
named_value<double> repair_partition_count_estimation_ratio;
named_value<uint32_t> ring_delay_ms;
named_value<uint32_t> shadow_round_ms;
named_value<uint32_t> fd_max_interval_ms;
@@ -346,6 +349,7 @@ public:
named_value<unsigned> murmur3_partitioner_ignore_msb_bits;
named_value<double> unspooled_dirty_soft_limit;
named_value<double> sstable_summary_ratio;
named_value<double> components_memory_reclaim_threshold;
named_value<size_t> large_memory_allocation_warning_threshold;
named_value<bool> enable_deprecated_partitioners;
named_value<bool> enable_keyspace_column_family_metrics;
@@ -369,6 +373,7 @@ public:
named_value<uint64_t> max_memory_for_unlimited_query_hard_limit;
named_value<uint32_t> reader_concurrency_semaphore_serialize_limit_multiplier;
named_value<uint32_t> reader_concurrency_semaphore_kill_limit_multiplier;
named_value<uint32_t> reader_concurrency_semaphore_cpu_concurrency;
named_value<uint32_t> twcs_max_window_count;
named_value<unsigned> initial_sstable_loading_concurrency;
named_value<bool> enable_3_1_0_compatibility_mode;

View File

@@ -155,7 +155,7 @@ future<> cql_table_large_data_handler::try_record(std::string_view large_table,
const auto sstable_name = large_data_handler::sst_filename(sst);
std::string pk_str = key_to_str(partition_key.to_partition_key(s), s);
auto timestamp = db_clock::now();
large_data_logger.warn("Writing large {} {}/{}: {}{} ({} bytes) to {}", desc, ks_name, cf_name, pk_str, extra_path, size, sstable_name);
large_data_logger.warn("Writing large {} {}/{}: {} ({} bytes) to {}", desc, ks_name, cf_name, extra_path, size, sstable_name);
return _sys_ks->execute_cql(req, ks_name, cf_name, sstable_name, size, pk_str, timestamp, args...)
.discard_result()
.handle_exception([ks_name, cf_name, large_table, sstable_name] (std::exception_ptr ep) {
@@ -182,10 +182,10 @@ future<> cql_table_large_data_handler::internal_record_large_cells(const sstable
if (clustering_key) {
const schema &s = *sst.get_schema();
auto ck_str = key_to_str(*clustering_key, s);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, format("/{}/{}", ck_str, column_name), extra_fields, ck_str, column_name);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, ck_str, column_name);
} else {
auto desc = format("static {}", cell_type);
return try_record("cell", sst, partition_key, int64_t(cell_size), desc, format("//{}", column_name), extra_fields, data_value::make_null(utf8_type), column_name);
return try_record("cell", sst, partition_key, int64_t(cell_size), desc, column_name, extra_fields, data_value::make_null(utf8_type), column_name);
}
}
@@ -197,10 +197,10 @@ future<> cql_table_large_data_handler::internal_record_large_cells_and_collectio
if (clustering_key) {
const schema &s = *sst.get_schema();
auto ck_str = key_to_str(*clustering_key, s);
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, format("/{}/{}", ck_str, column_name), extra_fields, ck_str, column_name, data_value((int64_t)collection_elements));
return try_record("cell", sst, partition_key, int64_t(cell_size), cell_type, column_name, extra_fields, ck_str, column_name, data_value((int64_t)collection_elements));
} else {
auto desc = format("static {}", cell_type);
return try_record("cell", sst, partition_key, int64_t(cell_size), desc, format("//{}", column_name), extra_fields, data_value::make_null(utf8_type), column_name, data_value((int64_t)collection_elements));
return try_record("cell", sst, partition_key, int64_t(cell_size), desc, column_name, extra_fields, data_value::make_null(utf8_type), column_name, data_value((int64_t)collection_elements));
}
}
@@ -210,7 +210,7 @@ future<> cql_table_large_data_handler::record_large_rows(const sstables::sstable
if (clustering_key) {
const schema &s = *sst.get_schema();
std::string ck_str = key_to_str(*clustering_key, s);
return try_record("row", sst, partition_key, int64_t(row_size), "row", format("/{}", ck_str), extra_fields, ck_str);
return try_record("row", sst, partition_key, int64_t(row_size), "row", "", extra_fields, ck_str);
} else {
return try_record("row", sst, partition_key, int64_t(row_size), "static row", "", extra_fields, data_value::make_null(utf8_type));
}

View File

@@ -55,6 +55,10 @@ public:
return ser::serialize_to_buffer<bytes>(_paxos_gc_sec);
}
std::string options_to_string() const override {
return std::to_string(_paxos_gc_sec);
}
static int32_t deserialize(const bytes_view& buffer) {
return ser::deserialize_from_buffer(buffer, boost::type<int32_t>());
}

View File

@@ -973,7 +973,7 @@ future<> merge_schema(sharded<db::system_keyspace>& sys_ks, distributed<service:
if (this_shard_id() != 0) {
// mutations must be applied on the owning shard (0).
co_await smp::submit_to(0, [&, fmuts = freeze(mutations)] () mutable -> future<> {
return merge_schema(sys_ks, proxy, feat, unfreeze(fmuts));
return merge_schema(sys_ks, proxy, feat, unfreeze(fmuts), reload);
});
co_return;
}

View File

@@ -493,37 +493,56 @@ mutation_partition& view_updates::partition_for(partition_key&& key) {
}
size_t view_updates::op_count() const {
return _op_count++;;
return _op_count;
}
row_marker view_updates::compute_row_marker(const clustering_or_static_row& base_row) const {
/*
* We need to compute both the timestamp and expiration.
* We need to compute both the timestamp and expiration for view rows.
*
* There are 3 cases:
* 1) There is a column that is not in the base PK but is in the view PK. In that case, as long as that column
* lives, the view entry does too, but as soon as it expires (or is deleted for that matter) the entry also
* should expire. So the expiration for the view is the one of that column, regardless of any other expiration.
* To take an example of that case, if you have:
* CREATE TABLE t (a int, b int, c int, PRIMARY KEY (a, b))
* CREATE MATERIALIZED VIEW mv AS SELECT * FROM t WHERE c IS NOT NULL AND a IS NOT NULL AND b IS NOT NULL PRIMARY KEY (c, a, b)
* INSERT INTO t(a, b) VALUES (0, 0) USING TTL 3;
* UPDATE t SET c = 0 WHERE a = 0 AND b = 0;
* then even after 3 seconds elapsed, the row will still exist (it just won't have a "row marker" anymore) and so
* the MV should still have a corresponding entry.
* This cell determines the liveness of the view row.
* 2) The columns for the base and view PKs are exactly the same, and all base columns are selected by the view.
* In that case, all components (marker, deletion and cells) are the same and trivially mapped.
* 3) The columns for the base and view PKs are exactly the same, but some base columns are not selected in the view.
* Use the max timestamp out of the base row marker and all the unselected columns - this ensures we can keep the
* view row alive. Do the same thing for the expiration, if the marker is dead or will expire, and so
* will all unselected columns.
* Below there are several distinct cases depending on how many new key
* columns the view has - i.e., how many of the view's key columns were
* regular columns in the base. base_regular_columns_in_view_pk.size():
*
* Zero new key columns:
* The view rows key is composed only from base key columns, and those
* cannot be changed in an update, so the view row remains alive as
* long as the base row is alive. We need to return the same row
* marker as the base for the view - to keep an empty view row alive
* for as long as an empty base row exists.
* Note that in this case, if there are *unselected* base columns, we
* may need to keep an empty view row alive even without a row marker
* because the base row (which has additional columns) is still alive.
* For that we have the "virtual columns" feature: In the zero new
* key columns case, we put unselected columns in the view as empty
* columns, to keep the view row alive.
*
* One new key column:
* In this case, there is a regular base column that is part of the
* view key. This regular column can be added or deleted in an update,
* or its expiration be set, and those can cause the view row -
* including its row marker - to need to appear or disappear as well.
* So the liveness of cell of this one column determines the liveness
* of the view row and the row marker that we return.
*
* Two or more new key columns:
* This case is explicitly NOT supported in CQL - one cannot create a
* view with more than one base-regular columns in its key. In general
* picking one liveness (timestamp and expiration) is not possible
* if there are multiple regular base columns in the view key, as
* those can have different liveness.
* However, we do allow this case for Alternator - we need to allow
* the case of two (but not more) because the DynamoDB API allows
* creating a GSI whose two key columns (hash and range key) were
* regular columns.
* We can support this case in Alternator because it doesn't use
* expiration (the "TTL" it does support is different), and doesn't
* support user-defined timestamps. But, the two columns can still
* have different timestamps - this happens if an update modifies
* just one of them. In this case the timestamp of the view update
* (and that of the row marker we return) is the later of these two
* updated columns.
*/
// WARNING: The code assumes that if multiple regular base columns are present in the view key,
// they share liveness information. It's true especially in the only case currently allowed by CQL,
// which assumes there's up to one non-pk column in the view key. It's also true in alternator,
// which does not carry TTL information.
const auto& col_ids = base_row.is_clustering_row()
? _base_info->base_regular_columns_in_view_pk()
: _base_info->base_static_columns_in_view_pk();
@@ -531,7 +550,20 @@ row_marker view_updates::compute_row_marker(const clustering_or_static_row& base
auto& def = _base->column_at(base_row.column_kind(), col_ids[0]);
// Note: multi-cell columns can't be part of the primary key.
auto cell = base_row.cells().cell_at(col_ids[0]).as_atomic_cell(def);
return cell.is_live_and_has_ttl() ? row_marker(cell.timestamp(), cell.ttl(), cell.expiry()) : row_marker(cell.timestamp());
auto ts = cell.timestamp();
if (col_ids.size() > 1){
// As explained above, this case only happens in Alternator,
// and we may need to pick a higher ts:
auto& second_def = _base->column_at(base_row.column_kind(), col_ids[1]);
auto second_cell = base_row.cells().cell_at(col_ids[1]).as_atomic_cell(second_def);
auto second_ts = second_cell.timestamp();
ts = std::max(ts, second_ts);
// Alternator isn't supposed to have TTL or more than two col_ids!
if (col_ids.size() != 2 || cell.is_live_and_has_ttl() || second_cell.is_live_and_has_ttl()) [[unlikely]] {
utils::on_internal_error(format("Unexpected col_ids length {} or has TTL", col_ids.size()));
}
}
return cell.is_live_and_has_ttl() ? row_marker(ts, cell.ttl(), cell.expiry()) : row_marker(ts);
}
return base_row.marker();
@@ -930,8 +962,22 @@ void view_updates::do_delete_old_entry(const partition_key& base_key, const clus
// Note: multi-cell columns can't be part of the primary key.
auto& def = _base->column_at(kind, col_ids[0]);
auto cell = existing.cells().cell_at(col_ids[0]).as_atomic_cell(def);
auto ts = cell.timestamp();
if (col_ids.size() > 1) {
// This is the Alternator-only support for two regular base
// columns that become view key columns. See explanation in
// view_updates::compute_row_marker().
auto& second_def = _base->column_at(kind, col_ids[1]);
auto second_cell = existing.cells().cell_at(col_ids[1]).as_atomic_cell(second_def);
auto second_ts = second_cell.timestamp();
ts = std::max(ts, second_ts);
// Alternator isn't supposed to have more than two col_ids!
if (col_ids.size() != 2) [[unlikely]] {
utils::on_internal_error(format("Unexpected col_ids length {}", col_ids.size()));
}
}
if (cell.is_live()) {
r->apply(shadowable_tombstone(cell.timestamp(), now));
r->apply(shadowable_tombstone(ts, now));
}
} else {
// "update" caused the base row to have been deleted, and !col_id
@@ -1316,11 +1362,12 @@ void view_update_builder::generate_update(static_row&& update, const tombstone&
future<stop_iteration> view_update_builder::on_results() {
constexpr size_t max_rows_for_view_updates = 100;
size_t rows_for_view_updates = std::accumulate(_view_updates.begin(), _view_updates.end(), 0, [] (size_t acc, const view_updates& vu) {
return acc + vu.op_count();
});
const bool stop_updates = rows_for_view_updates >= max_rows_for_view_updates;
auto should_stop_updates = [this] () -> bool {
size_t rows_for_view_updates = std::accumulate(_view_updates.begin(), _view_updates.end(), 0, [] (size_t acc, const view_updates& vu) {
return acc + vu.op_count();
});
return rows_for_view_updates >= max_rows_for_view_updates;
};
if (_update && !_update->is_end_of_partition() && _existing && !_existing->is_end_of_partition()) {
auto cmp = position_in_partition::tri_compare(*_schema)(_update->position(), _existing->position());
if (cmp < 0) {
@@ -1343,7 +1390,7 @@ future<stop_iteration> view_update_builder::on_results() {
: std::nullopt;
generate_update(std::move(update), _update_partition_tombstone, std::move(existing), _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_updates();
return should_stop_updates() ? stop() : advance_updates();
}
if (cmp > 0) {
// We have something existing but no update (which will happen either because it's a range tombstone marker in
@@ -1379,7 +1426,7 @@ future<stop_iteration> view_update_builder::on_results() {
generate_update(std::move(update), _update_partition_tombstone, { std::move(existing) }, _existing_partition_tombstone);
}
}
return stop_updates ? stop () : advance_existings();
return should_stop_updates() ? stop () : advance_existings();
}
// We're updating a row that had pre-existing data
if (_update->is_range_tombstone_change()) {
@@ -1401,8 +1448,9 @@ future<stop_iteration> view_update_builder::on_results() {
mutation_fragment_v2::printer(*_schema, *_update), mutation_fragment_v2::printer(*_schema, *_existing)));
}
generate_update(std::move(*_update).as_static_row(), _update_partition_tombstone, { std::move(*_existing).as_static_row() }, _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_all();
return should_stop_updates() ? stop() : advance_all();
}
auto tombstone = std::max(_update_partition_tombstone, _update_current_tombstone);
@@ -1417,7 +1465,7 @@ future<stop_iteration> view_update_builder::on_results() {
auto update = static_row();
generate_update(std::move(update), _update_partition_tombstone, { std::move(existing) }, _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_existings();
return should_stop_updates() ? stop() : advance_existings();
}
// If we have updates and it's a range tombstone, it removes nothing pre-exisiting, so we can ignore it
@@ -1438,7 +1486,7 @@ future<stop_iteration> view_update_builder::on_results() {
: std::nullopt;
generate_update(std::move(*_update).as_static_row(), _update_partition_tombstone, std::move(existing), _existing_partition_tombstone);
}
return stop_updates ? stop() : advance_updates();
return should_stop_updates() ? stop() : advance_updates();
}
return stop();
@@ -1619,6 +1667,13 @@ static bool should_update_synchronously(const schema& s) {
return *tag_opt == "true";
}
size_t memory_usage_of(const frozen_mutation_and_schema& mut) {
// Overhead of sending a view mutation, in terms of data structures used by the storage_proxy, as well as possible background tasks
// allocated for a remote view update.
constexpr size_t base_overhead_bytes = 2288;
return base_overhead_bytes + mut.fm.representation().size();
}
// Take the view mutations generated by generate_view_updates(), which pertain
// to a modification of a single base partition, and apply them to the
// appropriate paired replicas. This is done asynchronously - we do not wait
@@ -1643,7 +1698,7 @@ future<> view_update_generator::mutate_MV(
bool network_topology = dynamic_cast<const locator::network_topology_strategy*>(&ks.get_replication_strategy());
auto target_endpoint = get_view_natural_endpoint(ermp, network_topology, base_token, view_token);
auto remote_endpoints = ermp->get_pending_endpoints(view_token);
auto sem_units = pending_view_updates.split(mut.fm.representation().size());
auto sem_units = seastar::make_lw_shared<db::timeout_semaphore_units>(pending_view_updates.split(memory_usage_of(mut)));
const bool update_synchronously = should_update_synchronously(*mut.s);
if (update_synchronously) {
@@ -1691,7 +1746,7 @@ future<> view_update_generator::mutate_MV(
mut.s->ks_name(), mut.s->cf_name(), base_token, view_token);
local_view_update = _proxy.local().mutate_mv_locally(mut.s, *mut_ptr, tr_state, db::commitlog::force_sync::no).then_wrapped(
[s = mut.s, &stats, &cf_stats, tr_state, base_token, view_token, my_address, mut_ptr = std::move(mut_ptr),
units = sem_units.split(sem_units.count())] (future<>&& f) {
sem_units] (future<>&& f) {
--stats.writes;
if (f.failed()) {
++stats.view_updates_failed_local;
@@ -1728,7 +1783,7 @@ future<> view_update_generator::mutate_MV(
schema_ptr s = mut.s;
future<> view_update = apply_to_remote_endpoints(_proxy.local(), std::move(ermp), *target_endpoint, std::move(remote_endpoints), std::move(mut), base_token, view_token, allow_hints, tr_state).then_wrapped(
[s = std::move(s), &stats, &cf_stats, tr_state, base_token, view_token, target_endpoint, updates_pushed_remote,
units = sem_units.split(sem_units.count()), apply_update_synchronously] (future<>&& f) mutable {
sem_units, apply_update_synchronously] (future<>&& f) mutable {
if (f.failed()) {
stats.view_updates_failed_remote += updates_pushed_remote;
cf_stats.total_view_updates_failed_remote += updates_pushed_remote;
@@ -2255,7 +2310,7 @@ future<> view_builder::do_build_step() {
}
}
}).handle_exception([] (std::exception_ptr ex) {
vlogger.warn("Unexcepted error executing build step: {}. Ignored.", std::current_exception());
vlogger.warn("Unexcepted error executing build step: {}. Ignored.", ex);
});
}

View File

@@ -209,7 +209,7 @@ class view_updates final {
schema_ptr _base;
base_info_ptr _base_info;
std::unordered_map<partition_key, mutation_partition, partition_key::hashing, partition_key::equality> _updates;
mutable size_t _op_count = 0;
size_t _op_count = 0;
const bool _backing_secondary_index;
public:
explicit view_updates(view_and_base vab, bool backing_secondary_index)
@@ -318,6 +318,8 @@ future<query::clustering_row_ranges> calculate_affected_clustering_ranges(
bool needs_static_row(const mutation_partition& mp, const std::vector<view_and_base>& views);
size_t memory_usage_of(const frozen_mutation_and_schema& mut);
/**
* create_virtual_column() adds a "virtual column" to a schema builder.
* The definition of a "virtual column" is based on the given definition

View File

@@ -234,12 +234,12 @@ void view_update_generator::do_abort() noexcept {
}
vug_logger.info("Terminating background fiber");
_db.unplug_view_update_generator();
_as.request_abort();
_pending_sstables.signal();
}
future<> view_update_generator::stop() {
_db.unplug_view_update_generator();
do_abort();
return std::move(_started).then([this] {
_registration_sem.broken();

View File

@@ -96,6 +96,7 @@ struct failure_detector::impl {
clock& _clock;
clock::interval_t _ping_period;
clock::interval_t _ping_timeout;
// Number of workers on each shard.
// We use this to decide where to create new workers (we pick a shard with the smallest number of workers).
@@ -138,7 +139,7 @@ struct failure_detector::impl {
// The unregistering process requires cross-shard operations which we perform on this fiber.
future<> _destroy_subscriptions = make_ready_future<>();
impl(failure_detector& parent, pinger&, clock&, clock::interval_t ping_period);
impl(failure_detector& parent, pinger&, clock&, clock::interval_t ping_period, clock::interval_t ping_timeout);
~impl();
// Inform update_endpoint_fiber() about an added/removed endpoint.
@@ -174,12 +175,14 @@ struct failure_detector::impl {
future<> mark(listener* l, pinger::endpoint_id ep, bool alive);
};
failure_detector::failure_detector(pinger& pinger, clock& clock, clock::interval_t ping_period)
: _impl(std::make_unique<impl>(*this, pinger, clock, ping_period))
failure_detector::failure_detector(
pinger& pinger, clock& clock, clock::interval_t ping_period, clock::interval_t ping_timeout)
: _impl(std::make_unique<impl>(*this, pinger, clock, ping_period, ping_timeout))
{}
failure_detector::impl::impl(failure_detector& parent, pinger& pinger, clock& clock, clock::interval_t ping_period)
: _parent(parent), _pinger(pinger), _clock(clock), _ping_period(ping_period) {
failure_detector::impl::impl(
failure_detector& parent, pinger& pinger, clock& clock, clock::interval_t ping_period, clock::interval_t ping_timeout)
: _parent(parent), _pinger(pinger), _clock(clock), _ping_period(ping_period), _ping_timeout(ping_timeout) {
if (this_shard_id() != 0) {
return;
}
@@ -536,11 +539,9 @@ future<> endpoint_worker::ping_fiber() noexcept {
auto start = clock.now();
auto next_ping_start = start + _fd._ping_period;
// A ping should take significantly less time than _ping_period, but we give it a multiple of ping_period before it times out
// just in case of transient network partitions.
// However, if there's a listener that's going to timeout soon (before the ping returns), we abort the ping in order to handle
auto timeout = start + _fd._ping_timeout;
// If there's a listener that's going to timeout soon (before the ping returns), we abort the ping in order to handle
// the listener (mark it as dead).
auto timeout = start + 3 * _fd._ping_period;
for (auto& [threshold, l]: _fd._listeners_liveness) {
if (l.endpoint_liveness[_id].alive && last_response + threshold < timeout) {
timeout = last_response + threshold;

View File

@@ -120,14 +120,14 @@ public:
// Every endpoint in the detected set will be periodically pinged every `ping_period`,
// assuming that the pings return in a timely manner. A ping may take longer than `ping_period`
// before it's aborted (up to a certain multiple of `ping_period`), in which case the next ping
// will start immediately.
//
// `ping_period` should be chosen so that during normal operation, a ping takes significantly
// less time than `ping_period` (preferably at least an order of magnitude less).
// before it's aborted (up to `ping_timeout`), in which case the next ping will start immediately.
//
// The passed-in value must be the same on every shard.
clock::interval_t ping_period
clock::interval_t ping_period,
// Duration after which a ping is aborted, so that next ping can be started
// (pings are sent sequentially).
clock::interval_t ping_timeout
);
~failure_detector();
@@ -147,7 +147,7 @@ public:
// The listener stops being called when the returned subscription is destroyed.
// The subscription must be destroyed before service is stopped.
//
// `threshold` should be significantly larger than `ping_period`, preferably at least an order of magnitude larger.
// `threshold` should be significantly larger than `ping_timeout`, preferably at least an order of magnitude larger.
//
// Different listeners may use different thresholds, depending on the use case:
// some listeners may want to mark endpoints as dead more aggressively if fast reaction times are important

View File

@@ -10,6 +10,7 @@
import os
import re
from scylla_util import *
import resource
import subprocess
import argparse
import yaml
@@ -102,6 +103,34 @@ class scylla_cpuinfo:
else:
return len(self._cpu_data["system"])
def configure_iotune_open_fd_limit(shards_count):
try:
fd_limits = resource.getrlimit(resource.RLIMIT_NOFILE)
except (OSError, ValueError) as e:
logging.warning("Could not get the limit of count of open file descriptors!")
logging.warning("iotune will proceed with the default limit. This may cause problems.")
return
precalculated_fds_count = (10 * shards_count) + 500
soft_limit, hard_limit = fd_limits
if hard_limit == resource.RLIM_INFINITY:
# If there is no hard limit, then ensure that soft limit allows enough FDs.
soft_limit = max(soft_limit, precalculated_fds_count)
else:
# If hard_limit is greater than precalculated_fds_count, then set it as soft and as hard limit.
required_fds_count = max(hard_limit, precalculated_fds_count)
soft_limit = max(soft_limit, required_fds_count)
hard_limit = max(hard_limit, required_fds_count)
try:
resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))
except (OSError, ValueError) as e:
logging.error(e)
logging.error("Could not set the limit of open file descriptors for iotune!")
logging.error(f"Required FDs count: {precalculated_fds_count}, default limit: {fd_limits}!")
sys.exit(1)
def run_iotune():
if "SCYLLA_CONF" in os.environ:
conf_dir = os.environ["SCYLLA_CONF"]
@@ -142,6 +171,8 @@ def run_iotune():
elif cpudata.smp():
iotune_args += [ "--smp", str(cpudata.smp()) ]
configure_iotune_open_fd_limit(cpudata.nr_shards())
try:
subprocess.check_call([bindir() + "/iotune",
"--format", "envfile",

View File

@@ -77,7 +77,7 @@ run apt-get -y upgrade
run apt-get -y install dialog apt-utils
run bash -ec "echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections"
run bash -ec "rm -rf /etc/rsyslog.conf"
run apt-get -y install hostname supervisor openjdk-11-jre-headless python2 python3 python3-yaml curl rsyslog sudo
run apt-get -y install hostname supervisor openjdk-11-jre-headless python2 python3 python3-yaml curl rsyslog sudo systemd
run bash -ec "echo LANG=C.UTF-8 > /etc/default/locale"
run bash -ec "dpkg -i packages/*.deb"
run apt-get -y clean all

View File

@@ -0,0 +1,25 @@
from sphinx.directives.other import Include
from docutils.parsers.rst import directives
class IncludeFlagDirective(Include):
option_spec = Include.option_spec.copy()
option_spec['base_path'] = directives.unchanged
def run(self):
env = self.state.document.settings.env
base_path = self.options.get('base_path', '_common')
if env.app.tags.has('enterprise'):
self.arguments[0] = base_path + "_enterprise/" + self.arguments[0]
else:
self.arguments[0] = base_path + "/" + self.arguments[0]
return super().run()
def setup(app):
app.add_directive('scylladb_include_flag', IncludeFlagDirective, override=True)
return {
"version": "0.1",
"parallel_read_safe": True,
"parallel_write_safe": True,
}

View File

@@ -1,6 +1,14 @@
### a dictionary of redirections
#old path: new path
# Moving pages from the install-scylla folder
/stable/getting-started/install-scylla/scylla-web-installer.html: /stable/getting-started/installation-common/scylla-web-installer.html
/stable/getting-started/install-scylla/unified-installer.html: /stable/getting-started/installation-common/unified-installer.html
/stable/getting-started/install-scylla/air-gapped-install.html: /stable/getting-started/installation-common/air-gapped-install.html
/stable/getting-started/install-scylla/disable-housekeeping.html: /stable/getting-started/installation-common/disable-housekeeping.html
/stable/getting-started/install-scylla/dev-mod.html: /stable/getting-started/installation-common/dev-mod.html
/stable/getting-started/install-scylla/config-commands.html: /stable/getting-started/config-commands.html
# Removed the outdated upgrade guides

View File

@@ -39,7 +39,8 @@ extensions = [
"recommonmark", # optional
"sphinxcontrib.datatemplates",
"scylladb_cc_properties",
"scylladb_aws_images"
"scylladb_aws_images",
"scylladb_include_flag"
]
# The suffix(es) of source filenames.

View File

@@ -20,7 +20,7 @@ sections common to data updating statements.
Update parameters
~~~~~~~~~~~~~~~~~
The ``UPDATE``, ``INSERT`` (and ``DELETE`` and ``BATCH`` for the ``TIMESTAMP``) statements support the following
The ``UPDATE``, ``INSERT`` (and ``DELETE`` and ``BATCH`` for the ``TIMESTAMP`` and ``TIMEOUT``) statements support the following
parameters:
- ``TIMESTAMP``: sets the timestamp for the operation. If not specified, the coordinator will use the current time, in

View File

@@ -198,11 +198,27 @@ We're not able to prevent a node learning about a new generation too late due to
After committing the generation ID, the topology coordinator publishes the generation data to user-facing description tables (`system_distributed.cdc_streams_descriptions_v2` and `system_distributed.cdc_generation_timestamps`).
#### Generation switching: other notes
#### Generation switching: accepting writes
Due to the need of maintaining colocation we don't allow the client to send writes with arbitrary timestamps.
Suppose that a write is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`). Then we only allow the write if its timestamp is in the interval [`T`, `C + generation_leeway`), where `generation_leeway` is a small time-inteval constant (e.g. 5 seconds).
Reason: we cannot allow writes before `T`, because they belong to the old generation whose token ranges might no longer refine the current vnodes, so the corresponding log write would not necessarily be colocated with the base write. We also cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
Due to the need of maintaining colocation we don't allow the client to send writes with arbitrary timestamps. We allow:
- writes to the current and next generations unless they are too far into the future,
- writes to the previous generations unless they are too far into the past.
##### Writes to the current and next generations
Suppose that a write with timestamp `W` is requested and the write coordinator's local clock has time `C` and the generation operating at time `C` has timestamp `T` (`T <= C`) such that `T <= W`. Then we only allow the write if `W < C + generation_leeway`, where `generation_leeway` is a small time-interval constant (e.g. 5 seconds).
We cannot allow writes too far "into the future" because we don't know what generation will be operating at that time (the node which will introduce this generation might not have joined yet). But, as mentioned before, we assume that we'll learn about the next generation in time. Again --- the need for this assumption will be gone in a future patch.
##### Writes to the previous generations
This time suppose that `T > W`. Then we only allow the write if `W > C - generation_leeway` and there was a generation operating at `W`.
We allow writes to previous generations to improve user experience. If a client generates timestamps by itself and clocks are not perfectly synchronized, there may be short periods of time around the moment of switching generations when client's writes are rejected because they fall into one of the previous generations. Usually, this problem is easy to overcome by the client. It can simply repeat a write a few times, but using a higher timestamp. Unfortunately, if a table additionally uses LWT, the client cannot increase the timestamp because LWT makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves doing a CDC log table write. If it fails, we are stuck. Allowing writes to the previous generations is also a probabilistic fix for this bug.
Note that writing only to the previous generation might not be enough. With the Raft-based topology and tablets, we can add multiple nodes almost instantly. Then, we can have multiple generations with almost identical timestamps.
We allow writes only to the recent past to reduce the number of generations that must be stored in memory.
### Streams description tables

View File

@@ -0,0 +1,21 @@
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
+----------------------------+-------------+---------------+---------+---------------+
| Linux Distributions |Ubuntu | Debian | CentOS /| Rocky / |
| | | | RHEL | RHEL |
+----------------------------+------+------+-------+-------+---------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 | 10 | 11 | 7 | 8 | 9 |
+============================+======+======+=======+=======+=========+=======+=======+
| 5.4 | |v| | |v| | |v| | |v| | |x| | |v| | |v| |
+----------------------------+------+------+-------+-------+---------+-------+-------+
| 5.2 | |v| | |v| | |v| | |v| | |v| | |v| | |x| |
+----------------------------+------+------+-------+-------+---------+-------+-------+
* The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
* All releases are available as a Docker container and EC2 AMI, GCP, and Azure images.
Supported Architecture
-----------------------------
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build.
In particular, aarch64 support includes AWS EC2 Graviton.

View File

@@ -9,65 +9,5 @@ Where *supported* in this scope means:
- The download and install procedures are tested as part of ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for latest versions)
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
.. scylladb_include_flag:: os-support-info.rst
.. note::
**Supported Architecture**
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build. In particular, aarch64 support includes AWS EC2 Graviton.
ScyllaDB Open Source
----------------------
.. note::
The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
+----------------------------+-------------+---------------+---------+---------------+
| Linux Distributions |Ubuntu | Debian | CentOS /| Rocky / |
| | | | RHEL | RHEL |
+----------------------------+------+------+-------+-------+---------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 | 10 | 11 | 7 | 8 | 9 |
+============================+======+======+=======+=======+=========+=======+=======+
| 5.4 | |v| | |v| | |v| | |v| | |x| | |v| | |v| |
+----------------------------+------+------+-------+-------+---------+-------+-------+
| 5.2 | |v| | |v| | |v| | |v| | |v| | |v| | |x| |
+----------------------------+------+------+-------+-------+---------+-------+-------+
All releases are available as a Docker container and EC2 AMI, GCP, and Azure images.
ScyllaDB Enterprise
--------------------
.. note::
The recommended OS for ScyllaDB Enterprise is Ubuntu 22.04.
+----------------------------+-----------------------------------+---------------------------+--------+-------+
| Linux Distributions | Ubuntu | Debian | CentOS/| Rocky/|
| | | | RHEL | RHEL |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| ScyllaDB Version / Version | 14.04| 16.04| 18.04| 20.04| 22.04 | 8 | 9 | 10 | 11 | 7 | 8 |
+============================+======+======+======+======+=======+======+======+======+======+========+=======+
| 2023.1 | |x| | |x| | |x| | |v| | |v| | |x| | |x| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| 2022.2 | |x| | |x| | |v| | |v| | |v| | |x| | |x| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| 2022.1 | |x| | |x| | |v| | |v| | |v| | |x| | |x| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| 2021.1 | |x| | |v| | |v| | |v| | |v| | |x| | |v| | |v| | |x| | |v| | |v| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| 2020.1 | |x| | |v| | |v| | |x| | |x| | |x| | |v| | |v| | |x| | |v| | |v| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| 2019.1 | |x| | |v| | |v| | |x| | |x| | |x| | |v| | |x| | |x| | |v| | |x| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
| 2018.1 | |v| | |v| | |x| | |x| | |v| | |x| | |x| | |x| | |x| | |v| | |x| |
+----------------------------+------+------+------+------+-------+------+------+------+------+--------+-------+
All releases are available as a Docker container, EC2 AMI, and a GCP image (GCP image from version 2021.1). Since
version 2023.1, the ScyllaDB AMI/Image OS for ScyllaDB Enterprise is based on Ubuntu 22.04.

View File

@@ -23,7 +23,7 @@ Its recommended to have a balanced setup. If there are only 4-8 :term:`Logica
This works in the opposite direction as well.
ScyllaDB can be used in many types of installation environments.
To see which system would best suit your workload requirements, use the `ScyllaDB Sizing Calculator <https://price-calc.gh.scylladb.com/>`_ to customize ScyllaDB for your usage.
To see which system would best suit your workload requirements, use the `ScyllaDB Sizing Calculator <https://www.scylladb.com/product/scylla-cloud/get-pricing/>`_ to customize ScyllaDB for your usage.

View File

@@ -129,7 +129,7 @@ SStable Content
.. _SStable: /architecture/sstable
All operations target either one specific sstable component or all of them as a whole.
For more information about the sstable components and the format itself, visit SStable_.
For more information about the sstable components and the format itself, visit :doc:`SSTable Format </architecture/sstable/index>`.
On a conceptual level, the data in SStables is represented by objects called mutation fragments. There are the following kinds of fragments:
@@ -634,6 +634,22 @@ Note that levels are cumulative - each contains all the checks of the previous l
By default, the strictest level is used.
This can be relaxed, for example, if you want to produce intentionally corrupt SStables for tests.
shard-of
^^^^^^^^
Pint out the shards which own the specified SSTables.
The content is dumped in JSON, using the following schema:
.. code-block:: none
:class: hide-copy-button
$ROOT := { "$sstable_path": $SHARD_IDS, ... }
$SHARD_IDS := [$SHARD_ID, ...]
$SHARD_ID := Uint
script
^^^^^^

View File

@@ -7,10 +7,11 @@
.. Note::
If ``authenticator`` is set to ``PasswordAuthenticator`` - increase the replication factor of the ``system_auth`` keyspace.
For example:
If ``authenticator`` is set to ``PasswordAuthenticator``, increase the replication factor of the ``system_auth`` keyspace.
For example:
``ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : <new_replication_factor>};``
Ensure you run repair after you alter the keyspace. See :doc:`How to Safely Increase the Replication Factor </kb/rf-increase>`.
It is recommended to set ``system_auth`` replication factor to the number of nodes in each DC.

View File

@@ -14,6 +14,8 @@ The following table shows ScyllaDB Enterprise versions and their corresponding S
* - ScyllaDB Enterprise
- ScyllaDB Open Source
* - 2024.1
- 5.4
* - 2023.1
- 5.2
* - 2022.2

View File

@@ -21,7 +21,7 @@ Any of the following:
.. code-block:: none
WARN 2022-09-22 17:33:11,075 [shard 1]large_data - Writing large partition Some_KS/Some_table: PK[/CK[/COL]] (SIZE bytes) to SSTABLE_NAME
WARN 2022-09-22 17:33:11,075 [shard 1]large_data - Writing large partition Some_KS/Some_table: [COL] (SIZE bytes) to SSTABLE_NAME
In this case, refer to :ref:`Troubleshooting Large Partition Tables <large-partition-table-configure>` for more information.

View File

@@ -12,7 +12,7 @@ the ``/etc/systemd/system/var-lib-scylla.mount`` and ``/etc/systemd/system/var-l
deleted by RPM.
To avoid losing the files, the upgrade procedure includes a step to backup the .mount files. The following
example shows the command to backup the files before the :doc:`upgrade from version 5.0 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.0-to-2022.1/upgrade-guide-from-5.0-to-2022.1-rpm/>`:
example shows the command to backup the files before the upgrade from version 5.0:
.. code-block:: console

View File

@@ -21,7 +21,7 @@ For example:
In this scenario, a missing ``TOC`` file will prevent the Scylla node from starting.
The SSTable corporation problem can be different, for example, other missing or unreadable files. The following solution apply for all of the scenarios.
The SSTable corruption problem can be different, for example, other missing or unreadable files. The following solution applies to all scenarios.
Solution
^^^^^^^^

View File

@@ -31,7 +31,7 @@ Apply the following procedure **serially** on each node. Do not move to the next
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/index.html>`_ for suspending ScyllaDB Manager (only available for ScyllaDB Enterprise) scheduled or running repairs.
* Not to apply schema changes
.. note:: Before upgrading, make sure to use the latest `ScyllaDB Montioring <https://monitoring.docs.scylladb.com/>`_ stack.
.. note:: Before upgrading, make sure to use the latest `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade Steps
=============
@@ -182,4 +182,4 @@ Start the node
Validate
--------
Check the upgrade instructions above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.
Check the upgrade instructions above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.

View File

@@ -34,7 +34,7 @@ Apply the following procedure **serially** on each node. Do not move to the next
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/index.html>`_ for suspending Scylla Manager (only available Scylla Enterprise) scheduled or running repairs.
* Not to apply schema changes
.. note:: Before upgrading, make sure to use the latest `Scylla Montioring <https://monitoring.docs.scylladb.com/>`_ stack.
.. note:: Before upgrading, make sure to use the latest `Scylla Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade steps
=============

View File

@@ -32,7 +32,7 @@ Apply the following procedure **serially** on each node. Do not move to the next
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending ScyllaDB Manager (only available for ScyllaDB Enterprise) scheduled or running repairs.
* Not to apply schema changes
.. note:: Before upgrading, make sure to use the latest `ScyllaDB Montioring <https://monitoring.docs.scylladb.com/>`_ stack.
.. note:: Before upgrading, make sure to use the latest `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_ stack.
Upgrade Steps
=============

View File

@@ -1 +1 @@
.. note:: Execute the following commands one node at the time, moving to the next node only **after** the rollback procedure completed successfully.
.. note:: Execute the following commands one node at a time, moving to the next node only **after** the rollback procedure is completed successfully.

View File

@@ -2,13 +2,14 @@
Upgrade ScyllaDB Image: EC2 AMI, GCP, and Azure Images
======================================================
To upgrade ScyllaDB images, you need to update:
ScyllaDB images are based on **Ubuntu 22.04**.
#. ScyllaDB packages. Since ScyllaDB Open Source **5.2** and ScyllaDB
Enterprise **2023.1**, the images are based on **Ubuntu 22.04**.
See the :doc:`upgrade guide <./index>` for your ScyllaDB version
for instructions for updating ScyllaDB packages on Ubuntu.
#. Underlying OS packages. ScyllaDB includes a list of 3rd party and OS packages
tested with the ScyllaDB release.
If youre using the ScyllaDB official image (recommended), follow the upgrade
instructions on the **Debian/Ubuntu** tab in the :doc:`upgrade guide </upgrade/index/>`
for your ScyllaDB version.
If youre using your own image and have installed ScyllaDB packages for Ubuntu or Debian,
follow the extended upgrade procedure on the **EC2/GCP/Azure Ubuntu image** tab
in the :doc:`upgrade guide </upgrade/index/>` for your ScyllaDB version.
To check your Scylla version, run the ``scylla --version`` command.

View File

@@ -1,20 +1,13 @@
====================================================
Upgrade from Scylla Open Source to Scylla Enterprise
====================================================
=========================================================
Upgrade from ScyllaDB Open Source to ScyllaDB Enterprise
=========================================================
.. toctree::
:titlesonly:
:hidden:
ScyllaDB 5.4 to ScyllaDB Enterprise 2024.1 <upgrade-guide-from-5.4-to-2024.1/index>
ScyllaDB 5.2 to ScyllaDB Enterprise 2023.1 <upgrade-guide-from-5.2-to-2023.1/index>
ScyllaDB 5.1 to ScyllaDB Enterprise 2022.2 <upgrade-guide-from-5.1-to-2022.2/index>
ScyllaDB 5.0 to ScyllaDB Enterprise 2022.1 <upgrade-guide-from-5.0-to-2022.1/index>
Scylla 4.3 to Scylla Enterprise 2021.1 <upgrade-guide-from-4.3-to-2021.1/index>
Scylla 4.0 to Scylla Enterprise 2020.1 <upgrade-guide-from-4.0-to-2020.1/index>
Scylla 3.0 to Scylla Enterprise 2019.1 <upgrade-guide-from-3.0-to-2019.1/index>
Scylla 2.1 to Scylla Enterprise 2018.1 <upgrade-guide-from-2.1-to-2018.1/index>
Scylla 1.6 to Scylla Enterprise 2017.1 <upgrade-guide-from-1.6-to-2017.1/index>
.. raw:: html
@@ -23,21 +16,14 @@ Upgrade from Scylla Open Source to Scylla Enterprise
<div class="panel callout radius animated">
<div class="row">
<div class="medium-3 columns">
<h5 id="getting-started">Upgrade to Scylla Enterprise</h5>
<h5 id="getting-started">Upgrade to ScyllaDB Enterprise</h5>
</div>
<div class="medium-9 columns">
Procedures for upgrading from Scylla Open Source to Scylla Enterprise.
* :doc:`Upgrade - ScyllaDB 5.2 to Scylla Enterprise 2023.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.2-to-2023.1/index>`
* :doc:`Upgrade - ScyllaDB 5.1 to Scylla Enterprise 2022.2 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.1-to-2022.2/index>`
* :doc:`Upgrade - ScyllaDB 5.0 to Scylla Enterprise 2022.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.0-to-2022.1/index>`
* :doc:`Upgrade - Scylla 4.3 to Scylla Enterprise 2021.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-4.3-to-2021.1/index>`
* :doc:`Upgrade - Scylla 4.0 to Scylla Enterprise 2020.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-4.0-to-2020.1/index>`
* :doc:`Upgrade - Scylla 3.0 to Scylla Enterprise 2019.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-3.0-to-2019.1/index>`
* :doc:`Upgrade - Scylla 2.1 to Scylla Enterprise 2018.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-2.1-to-2018.1/index>`
* :doc:`Upgrade - Scylla 1.6 to Scylla Enterprise 2017.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-1.6-to-2017.1/index>`
Procedures for upgrading from ScyllaDB Open Source to ScyllaDB Enterprise:
* :doc:`ScyllaDB 5.4 to ScyllaDB Enterprise 2024.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.4-to-2024.1/index>`
* :doc:`ScyllaDB 5.2 to ScyllaDB Enterprise 2023.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.2-to-2023.1/index>`
.. raw:: html

View File

@@ -1,36 +0,0 @@
================================================
Upgrade - Scylla 1.6 to Scylla Enterprise 2017.1
================================================
.. toctree::
:titlesonly:
:hidden:
Red Hat Enterprise Linux and CentOS <upgrade-guide-from-1.6-to-2017.1-rpm>
Ubuntu <upgrade-guide-from-1.6-to-2017.1-ubuntu>
Debian <upgrade-guide-from-1.6-to-2017.1-debian>
.. raw:: html
<div class="panel callout radius animated">
<div class="row">
<div class="medium-3 columns">
<h5 id="getting-started">Upgrade Scylla Scylla 1.6 to Scylla Enterprise 2017.1</h5>
</div>
<div class="medium-9 columns">
Upgrade guides are available for:
* :doc:`Upgrade Scylla from 1.6.x to Scylla Enterprise 2017.1.y on Red Hat Enterprise Linux and CentOS <upgrade-guide-from-1.6-to-2017.1-rpm>`
* :doc:`Upgrade Scylla from 1.6.x to Scylla Enterprise 2017.1.y on Ubuntu <upgrade-guide-from-1.6-to-2017.1-ubuntu>`
* :doc:`Upgrade Scylla from 1.6.x to Scylla Enterprise 2017.1.y on Debian <upgrade-guide-from-1.6-to-2017.1-debian>`
.. raw:: html
</div>
</div>
</div>

View File

@@ -1,6 +0,0 @@
.. |OS| replace:: Debian 8
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-1.6-to-2017.1/upgrade-guide-from-1.6-to-2017.1-debian/#rollback-procedure
.. |APT| replace:: Scylla Enterprise Deb repo
.. _APT: http://www.scylladb.com/enterprise-download/debian8/
.. include:: /upgrade/_common/upgrade-guide-from-1.6-to-2017.1-ubuntu-and-debian.rst

View File

@@ -1,156 +0,0 @@
===========================================================================================
Upgrade Guide - Scylla 1.6 to Scylla Enterprise 2017.1 for Red Hat Enterprise 7 or CentOS 7
===========================================================================================
This document is a step by step procedure for upgrading from Scylla 1.6 to Scylla Enterprise 2017.1, and rollback to 1.6 if required.
Applicable versions
===================
This guide covers upgrading Scylla from the following versions: 1.6.x to Scylla Enterprise version 2017.1.y, on the following platforms:
* Red Hat Enterprise Linux, version 7 and later
* CentOS, version 7 and later
* No longer provide packages for Fedora
Upgrade Procedure
=================
.. include:: /upgrade/_common/warning.rst
A Scylla upgrade is a rolling procedure which does not require full cluster shutdown. For each of the nodes in the cluster, serially (i.e. one at a time), you will:
* drain node and backup the data
* check your current release
* backup configuration file
* stop Scylla
* download and install new Scylla packages
* start Scylla
* validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
**During** the rolling upgrade it is highly recommended:
* Not to use new Scylla Enterprise 2017.1 features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes
* Not to apply schema changes
Upgrade steps
=============
Drain node and backup the data
------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is complete (all nodes), the snapshot should be removed by ``nodetool clearsnapshot -t <snapshot>``, or you risk running out of space.
Backup configuration file
-------------------------
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-1.6
Stop Scylla
-----------
.. code:: sh
sudo systemctl stop scylla-server
Download and install the new release
------------------------------------
Before upgrading, check what version you are running now using ``rpm -qa | grep scylla-server``. You should use the same version in case you want to :ref:`rollback <upgrade-1.6-2017.1-rpm-rollback-procedure>` the upgrade. If you are not running a 1.6.x version, stop right here! This guide only covers 1.6.x to 2017.1.y upgrades.
To upgrade:
1. Update the `Scylla RPM Enterprise repo <http://www.scylladb.com/enterprise-download/centos_rpm/>`_ to **2017.1**
2. install
.. code:: sh
sudo yum update scylla\* -y
Start the node
--------------
.. code:: sh
sudo systemctl start scylla-server
Validate
--------
1. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in UN status.
2. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check scylla version.
3. Use ``journalctl _COMM=scylla`` to check there are no new errors in the log.
4. Check again after 2 minutes, to validate no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in the cluster.
.. _upgrade-1.6-2017.1-rpm-rollback-procedure:
Rollback Procedure
==================
.. include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from Scylla release 2017.1.x to 1.6.y. Apply this procedure if an upgrade from 1.6 to 2017.1 failed before completing on all nodes. Use this procedure only for nodes you upgraded to 2017.1
Scylla rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes rollback to 1.6, you will:
* drain the node and stop Scylla
* retrieve the old Scylla packages
* restore the configuration file
* restart Scylla
* validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
Rollback steps
==============
Gracefully shutdown Scylla
--------------------------
.. code:: sh
nodetool drain
sudo systemctl stop scylla-server
Download and install the new release
------------------------------------
1. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/yum.repos.d/scylla.repo
2. Update the `Scylla RPM repo <http://www.scylladb.com/download/centos_rpm>`_ to **1.6**
3. Install
.. code:: sh
sudo yum clean all
sudo yum downgrade scylla\* -y
Restore the configuration file
------------------------------
.. code:: sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-1.6 /etc/scylla/scylla.yaml
Start the node
--------------
.. code:: sh
sudo systemctl start scylla-server
Validate
--------
Check upgrade instruction above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.

View File

@@ -1,6 +0,0 @@
.. |OS| replace:: Ubuntu 14.04 or 16.04
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-1.6-to-2017.1/upgrade-guide-from-1.6-to-2017.1-ubuntu/#rollback-procedure
.. |APT| replace:: Scylla Enterprise Deb repo
.. _APT: http://www.scylladb.com/enterprise-download/ubuntu-16-04/
.. include:: /upgrade/_common/upgrade-guide-from-1.6-to-2017.1-ubuntu-and-debian.rst

View File

@@ -1,38 +0,0 @@
================================================
Upgrade - Scylla 2.1 to Scylla Enterprise 2018.1
================================================
.. toctree::
:titlesonly:
:hidden:
Red Hat Enterprise Linux and CentOS <upgrade-guide-from-2.1-to-2018.1-rpm>
Ubuntu 14.04 <upgrade-guide-from-2.1-to-2018.1-ubuntu>
Ubuntu 16.04 <upgrade-guide-from-2.1-to-2018.1-ubuntu-16-04>
Debian <upgrade-guide-from-2.1-to-2018.1-debian>
Metrics <metric-update-2.1-to-2018.1>
.. raw:: html
<div class="panel callout radius animated">
<div class="row">
<div class="medium-3 columns">
<h5 id="getting-started">Upgrade Scylla Scylla 2.1 to Scylla Enterprise 2018.1</h5>
</div>
<div class="medium-9 columns">
Upgrade guides are available for:
* :doc:`Upgrade Scylla Enterprise from 2.1.x to 2018.1.y on Red Hat Enterprise Linux and CentOS <upgrade-guide-from-2.1-to-2018.1-rpm>`
* :doc:`Upgrade Scylla Enterprise from 2.1.x to 2018.1.y on Ubuntu 14.04 <upgrade-guide-from-2.1-to-2018.1-ubuntu>`
* :doc:`Upgrade Scylla Enterprise from 2.1.x to 2018.1.y on Ubuntu 16.04 <upgrade-guide-from-2.1-to-2018.1-ubuntu-16-04>`
* :doc:`Upgrade Scylla Enterprise from 2.1.x to 2018.1.y on Debian <upgrade-guide-from-2.1-to-2018.1-debian>`
* :doc:`Scylla Metrics Update - Scylla 2.1 to 2018.1 <metric-update-2.1-to-2018.1>`
.. raw:: html
</div>
</div>
</div>

View File

@@ -1,10 +0,0 @@
=============================================================
Scylla Metric Update - Scylla 2.1 to Scylla Enterprise 2018.1
=============================================================
The following metrics are new in Scylla 2018.1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* scylla_evictions_from_garbage
* scylla_garbage_partitions

View File

@@ -1,10 +0,0 @@
.. |OS| replace:: Debian 8
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-2.1-to-2018.1/upgrade-guide-from-2.1-to-2018.1-debian/#rollback-procedure
.. |APT| replace:: Scylla deb repo
.. _APT: http://www.scylladb.com/download/debian8/
.. |APT_ENTERPRISE| replace:: Scylla Enterprise Deb repo
.. _APT_ENTERPRISE: http://www.scylladb.com/enterprise-download/debian8/
.. |ENABLE_APT_REPO| replace:: echo 'deb http://http.debian.net/debian jessie-backports main' > /etc/apt/sources.list.d/jessie-backports.list
.. |JESSIE_BACKPORTS| replace:: -t jessie-backports openjdk-8-jre-headless
.. include:: /upgrade/_common/upgrade-guide-from-2.1-to-2018.1-ubuntu-and-debian.rst

View File

@@ -1,170 +0,0 @@
=============================================================================================
Upgrade Guide - Scylla 2.1 to 2018.1 for Red Hat Enterprise Linux 7 or CentOS 7
=============================================================================================
This document is a step by step procedure for upgrading from Scylla 2.1 to Scylla Enterprise 2018.1, and rollback to 2.1 if required.
Applicable versions
===================
This guide covers upgrading Scylla from the following versions: 2.1.x to Scylla Enterprise version 2018.1.y, on the following platforms:
* Red Hat Enterprise Linux, version 7 and later
* CentOS, version 7 and later
* No longer provide packages for Fedora
Upgrade Procedure
=================
.. include:: /upgrade/_common/warning.rst
A Scylla upgrade is a rolling procedure which does not require full cluster shutdown. For each of the nodes in the cluster, serially (i.e. one at a time), you will:
* Check cluster schema
* Drain node and backup the data
* Backup configuration file
* Stop Scylla
* Download and install new Scylla packages
* Start Scylla
* Validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
**During** the rolling upgrade it is highly recommended:
* Not to use new 2018.1 features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes
* Not to apply schema changes
Upgrade steps
=============
Check cluster schema
--------------------
Make sure that all nodes have the schema synched prior to upgrade, we won't survive an upgrade that has schema disagreement between nodes.
.. code:: sh
nodetool describecluster
Drain node and backup the data
------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is complete (all nodes), the snapshot should be removed by ``nodetool clearsnapshot -t <snapshot>``, or you risk running out of space.
Backup configuration files
--------------------------
.. code:: sh
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ); do sudo cp -v $conf $conf.backup-2.1; done
Stop Scylla
-----------
.. code:: sh
sudo systemctl stop scylla-server
Download and install the new release
------------------------------------
Before upgrading, check what version you are running now using ``rpm -qa | grep scylla-server``. You should use the same version in case you want to :ref:`rollback <upgrade-2.1-2018.1-rpm-rollback-procedure>` the upgrade. If you are not running a 2.1.x version, stop right here! This guide only covers 2.1.x to 2018.1.y upgrades.
To upgrade:
1. Update the `Scylla RPM Enterprise repo <http://www.scylladb.com/enterprise-download/centos_rpm/>`_ to **2018.1**
2. install
.. code:: sh
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\*
sudo yum install scylla-enterprise
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ); do sudo cp -v $conf.backup-2.1 $conf; done
Start the node
--------------
.. code:: sh
sudo systemctl start scylla-server
Validate
--------
1. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in UN status.
2. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check scylla version.
3. Use ``journalctl _COMM=scylla`` to check there are no new errors in the log.
4. Check again after 2 minutes, to validate no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in the cluster.
* More on :doc:`Scylla Metrics Update - Scylla 2.1 to 2018.1<metric-update-2.1-to-2018.1>`
.. _upgrade-2.1-2018.1-rpm-rollback-procedure:
Rollback Procedure
==================
.. include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from Scylla Enterprise release 2018.1.x to 2.1.y. Apply this procedure if an upgrade from 2.1 to 2018.1 failed before completing on all nodes. Use this procedure only for nodes you upgraded to 2018.1
Scylla rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes rollback to 2.1, you will:
* Drain the node and stop Scylla
* Retrieve the old Scylla packages
* Restore the configuration file
* Restart Scylla
* Validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
Rollback steps
==============
Gracefully shutdown Scylla
--------------------------
.. code:: sh
nodetool drain
sudo systemctl stop scylla-server
Download and install the new release
------------------------------------
1. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/yum.repos.d/scylla.repo
2. Update the `Scylla RPM repo <http://www.scylladb.com/download/?platform=centos>`_ to **2.1**
3. Install
.. code:: sh
sudo yum clean all
sudo yum remove scylla\*
sudo yum install scylla
Restore the configuration file
------------------------------
.. code:: sh
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ); do sudo cp -v $conf.backup-2.1 $conf; done
Start the node
--------------
.. code:: sh
sudo systemctl start scylla-server
Validate
--------
Check upgrade instruction above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.

View File

@@ -1,10 +0,0 @@
.. |OS| replace:: Ubuntu 16.04
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-2.1-to-2018.1/upgrade-guide-from-2.1-to-2018.1-ubuntu-16-04/#rollback-procedure
.. |APT| replace:: Scylla deb repo
.. _APT: http://www.scylladb.com/download/
.. |APT_ENTERPRISE| replace:: Scylla Enterprise Deb repo
.. _APT_ENTERPRISE: http://www.scylladb.com/enterprise-download/ubuntu-16-04/
.. |ENABLE_APT_REPO| replace:: sudo add-apt-repository -y ppa:openjdk-r/ppa
.. |JESSIE_BACKPORTS| replace:: openjdk-8-jre-headless
.. include:: /upgrade/_common/upgrade-guide-from-2.1-to-2018.1-ubuntu-and-debian.rst

View File

@@ -1,10 +0,0 @@
.. |OS| replace:: Ubuntu 14.04
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-2.1-to-2018.1/upgrade-guide-from-2.1-to-2018.1-ubuntu/#rollback-procedure
.. |APT| replace:: Scylla deb repo
.. _APT: http://www.scylladb.com/download/
.. |APT_ENTERPRISE| replace:: Scylla Enterprise Deb repo
.. _APT_ENTERPRISE: http://www.scylladb.com/enterprise-download/ubuntu/
.. |ENABLE_APT_REPO| replace:: sudo add-apt-repository -y ppa:openjdk-r/ppa
.. |JESSIE_BACKPORTS| replace:: openjdk-8-jre-headless
.. include:: /upgrade/_common/upgrade-guide-from-2.1-to-2018.1-ubuntu-and-debian.rst

View File

@@ -1,38 +0,0 @@
================================================
Upgrade - Scylla 3.0 to Scylla Enterprise 2019.1
================================================
.. toctree::
:titlesonly:
:hidden:
Red Hat Enterprise Linux and CentOS <upgrade-guide-from-3.0-to-2019.1-rpm>
Ubuntu 16.04 <upgrade-guide-from-3.0-to-2019.1-ubuntu-16-04>
Ubuntu 14.04 <upgrade-guide-from-3.0-to-2019.1-ubuntu-18-04>
Debian <upgrade-guide-from-3.0-to-2019.1-debian>
Metrics <metric-update-3.0-to-2019.1>
.. raw:: html
<div class="panel callout radius animated">
<div class="row">
<div class="medium-3 columns">
<h5 id="getting-started">Upgrade Scylla Scylla 3.0 to Scylla Enterprise 2019.1</h5>
</div>
<div class="medium-9 columns">
Upgrade guides are available for:
* :doc:`Upgrade Scylla Enterprise from 3.0.x to 2019.1.y on Red Hat Enterprise Linux and CentOS <upgrade-guide-from-3.0-to-2019.1-rpm>`
* :doc:`Upgrade Scylla Enterprise from 3.0.x to 2019.1.y on Ubuntu 16.04 <upgrade-guide-from-3.0-to-2019.1-ubuntu-16-04>`
* :doc:`Upgrade Scylla Enterprise from 3.0.x to 2019.1.y on Ubuntu 18.04 <upgrade-guide-from-3.0-to-2019.1-ubuntu-18-04>`
* :doc:`Upgrade Scylla Enterprise from 3.0.x to 2019.1.y on Debian <upgrade-guide-from-3.0-to-2019.1-debian>`
* :doc:`Scylla Metrics Update - Scylla 3.0 to 2019.1 <metric-update-3.0-to-2019.1>`
.. raw:: html
</div>
</div>
</div>

View File

@@ -1,87 +0,0 @@
=============================================================
Scylla Metric Update - Scylla 3.0 to Scylla Enterprise 2019.1
=============================================================
The following metrics are new in Scylla Enterprise 2019.1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* scylla_database_paused_reads
* scylla_database_paused_reads_permit_based_evictions
* scylla_database_total_view_updates_failed_local
* scylla_database_total_view_updates_failed_remote
* scylla_database_total_view_updates_pushed_local
* scylla_database_total_view_updates_pushed_remote
* scylla_database_view_building_paused
* scylla_hints_for_views_manager_corrupted_files
* scylla_hints_for_views_manager_discarded
* scylla_hints_manager_corrupted_files
* scylla_hints_manager_discarded
* scylla_query_processor_queries
* scylla_reactor_aio_errors
* scylla_sstables_capped_local_deletion_time
* scylla_sstables_capped_tombstone_deletion_time
* scylla_sstables_cell_tombstone_writes
* scylla_sstables_cell_writes
* scylla_sstables_partition_reads
* scylla_sstables_partition_seeks
* scylla_sstables_partition_writes
* scylla_sstables_range_partition_reads
* scylla_sstables_range_tombstone_writes
* scylla_sstables_row_reads
* scylla_sstables_row_writes
* scylla_sstables_single_partition_reads
* scylla_sstables_sstable_partition_reads
* scylla_sstables_static_row_writes
* scylla_sstables_tombstone_writes
* scylla_storage_proxy_coordinator_last_mv_flow_control_delay
The following metrics names changes from Scylla 3.0 to Scylla Enterprise 2019.1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
:widths: 30 30
:header-rows: 1
* - Scylla 3.0 Name
- Scylla 2019.1 Name
* - scylla_io_queue_commitlog_delay
- scylla_io_queue_delay
* - scylla_io_queue_commitlog_queue_length
- scylla_io_queue_queue_length
* - scylla_io_queue_commitlog_shares
- scylla_io_queue_shares
* - scylla_io_queue_commitlog_total_bytes
- scylla_io_queue_total_bytes
* - scylla_io_queue_commitlog_total_operations
- scylla_io_queue_total_operations
* - scylla_io_queue_compaction_delay
- scylla_io_queue_delay
* - scylla_io_queue_compaction_queue_length
- scylla_io_queue_queue_length
* - scylla_io_queue_compaction_shares
- scylla_io_queue_shares
* - scylla_io_queue_compaction_total_bytes
- scylla_io_queue_total_bytes
* - scylla_io_queue_compaction_total_operations
- scylla_io_queue_total_operations
* - scylla_io_queue_default_delay
- scylla_io_queue_delay
* - scylla_io_queue_default_queue_length
- scylla_io_queue_queue_length
* - scylla_io_queue_default_shares
- scylla_io_queue_shares
* - scylla_io_queue_default_total_bytes
- scylla_io_queue_total_bytes
* - scylla_io_queue_default_total_operations
- scylla_io_queue_total_operations
* - scylla_io_queue_memtable_flush_delay
- scylla_io_queue_delay
* - scylla_io_queue_memtable_flush_queue_length
- scylla_io_queue_queue_length
* - scylla_io_queue_memtable_flush_shares
- scylla_io_queue_shares
* - scylla_io_queue_memtable_flush_total_bytes
- scylla_io_queue_total_bytes
* - scylla_io_queue_memtable_flush_total_operations
- scylla_io_queue_total_operations

View File

@@ -1,8 +0,0 @@
.. |OS| replace:: Debian 9
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-3.0-to-2019.1/upgrade-guide-from-3.0-to-2019.1-debian/#rollback-procedure
.. |APT| replace:: Scylla deb repo
.. _APT: http://www.scylladb.com/download/?platform=debian-9
.. |APT_ENTERPRISE| replace:: Scylla Enterprise Deb repo
.. _APT_ENTERPRISE: http://www.scylladb.com/enterprise-download/debian9/
.. include:: /upgrade/_common/upgrade-guide-from-3.0-to-2019.1-ubuntu-and-debian.rst

View File

@@ -1,172 +0,0 @@
=============================================================================================
Upgrade Guide - Scylla 3.0 to 2019.1 for Red Hat Enterprise Linux 7 or CentOS 7
=============================================================================================
This document is a step by step procedure for upgrading from Scylla 3.0 to Scylla Enterprise 2019.1, and rollback to 3.0 if required.
Applicable versions
===================
This guide covers upgrading Scylla from the following versions: 3.0.x to Scylla Enterprise version 2019.1.y, on the following platforms:
* Red Hat Enterprise Linux, version 7 and later
* CentOS, version 7 and later
* No longer provide packages for Fedora
.. include:: /upgrade/_common/upgrade_to_2019_warning.rst
Upgrade Procedure
=================
.. include:: /upgrade/_common/warning.rst
A Scylla upgrade is a rolling procedure which does not require full cluster shutdown. For each of the nodes in the cluster, serially (i.e. one at a time), you will:
* Check cluster schema
* Drain node and backup the data
* Backup configuration file
* Stop Scylla
* Download and install new Scylla packages
* Start Scylla
* Validate that the upgrade was successful
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
**During** the rolling upgrade it is highly recommended:
* Not to use new 2019.1 features
* Not to run administration functions, like repairs, refresh, rebuild or add or remove nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/index.html>`_ for suspending Scylla Manager scheduled or running repairs.
* Not to apply schema changes
Upgrade steps
=============
Check cluster schema
--------------------
Make sure that all nodes have the schema synched prior to upgrade, we won't survive an upgrade that has schema disagreement between nodes.
.. code:: sh
nodetool describecluster
Drain node and backup the data
------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all the data to an external device. In Scylla, backup is done using the ``nodetool snapshot`` command. For **each** node in the cluster, run the following command:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories having this name under ``/var/lib/scylla`` to a backup device.
When the upgrade is complete (all nodes), the snapshot should be removed by ``nodetool clearsnapshot -t <snapshot>``, or you risk running out of space.
Backup configuration files
--------------------------
.. code:: sh
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ); do sudo cp -v $conf $conf.backup-3.0; done
Stop Scylla
-----------
.. code:: sh
sudo systemctl stop scylla-server
Download and install the new release
------------------------------------
Before upgrading, check what version you are running now using ``rpm -qa | grep scylla-server``. You should use the same version in case you want to :ref:`rollback <upgrade-3.0-2019.1-rpm-rollback-procedure>` the upgrade. If you are not running a 3.0.x version, stop right here! This guide only covers 3.0.x to 2019.1.y upgrades.
To upgrade:
1. Update the `Scylla RPM Enterprise repo <http://www.scylladb.com/enterprise-download/centos_rpm/>`_ to **2019.1**
2. install
.. code:: sh
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\*
sudo yum install scylla-enterprise
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ); do sudo cp -v $conf.backup-3.0 $conf; done
Start the node
--------------
.. code:: sh
sudo systemctl start scylla-server
Validate
--------
1. Check cluster status with ``nodetool status`` and make sure **all** nodes, including the one you just upgraded, are in UN status.
2. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"`` to check scylla version.
3. Use ``journalctl _COMM=scylla`` to check there are no new errors in the log.
4. Check again after 2 minutes, to validate no new issues are introduced.
Once you are sure the node upgrade is successful, move to the next node in the cluster.
* More on :doc:`Scylla Metrics Update - Scylla 3.0 to 2019.1<metric-update-3.0-to-2019.1>`
.. _upgrade-3.0-2019.1-rpm-rollback-procedure:
Rollback Procedure
==================
.. include:: /upgrade/_common/warning_rollback.rst
The following procedure describes a rollback from Scylla Enterprise release 2019.1.x to 3.0.y. Apply this procedure if an upgrade from 3.0 to 2019.1 failed before completing on all nodes. Use this procedure only for nodes you upgraded to 2019.1
Scylla rollback is a rolling procedure which does **not** require full cluster shutdown.
For each of the nodes rollback to 3.0, you will:
* Drain the node and stop Scylla
* Retrieve the old Scylla packages
* Restore the configuration file
* Restart Scylla
* Validate the rollback success
Apply the following procedure **serially** on each node. Do not move to the next node before validating the node is up and running with the new version.
Rollback steps
==============
Gracefully shutdown Scylla
--------------------------
.. code:: sh
nodetool drain
sudo systemctl stop scylla-server
Download and install the new release
------------------------------------
1. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/yum.repos.d/scylla.repo
2. Update the `Scylla RPM repo <http://www.scylladb.com/download/?platform=centos>`_ to **3.0**
3. Install
.. code:: sh
sudo yum clean all
sudo yum remove scylla\*
sudo yum install scylla
Restore the configuration file
------------------------------
.. code:: sh
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ); do sudo cp -v $conf.backup-3.0 $conf; done
Start the node
--------------
.. code:: sh
sudo systemctl start scylla-server
Validate
--------
Check upgrade instruction above for validation. Once you are sure the node rollback is successful, move to the next node in the cluster.

View File

@@ -1,8 +0,0 @@
.. |OS| replace:: Ubuntu 16.04
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-3.0-to-2019.1/upgrade-guide-from-3.0-to-2019.1-ubuntu-16-04/#rollback-procedure
.. |APT| replace:: Scylla deb repo
.. _APT: http://www.scylladb.com/download/
.. |APT_ENTERPRISE| replace:: Scylla Enterprise Deb repo
.. _APT_ENTERPRISE: http://www.scylladb.com/enterprise-download/ubuntu-16-04/
.. include:: /upgrade/_common/upgrade-guide-from-3.0-to-2019.1-ubuntu-and-debian.rst

View File

@@ -1,8 +0,0 @@
.. |OS| replace:: Ubuntu 18.04
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: /upgrade/upgrade-to-enterprise/upgrade-guide-from-3.0-to-2019.1/upgrade-guide-from-3.0-to-2019.1-ubuntu-18-04/#id4
.. |APT| replace:: Scylla deb repo
.. _APT: http://www.scylladb.com/download/
.. |APT_ENTERPRISE| replace:: Scylla Enterprise Deb repo
.. _APT_ENTERPRISE: http://www.scylladb.com/enterprise-download/ubuntu/
.. include:: /upgrade/_common/upgrade-guide-from-3.0-to-2019.1-ubuntu-and-debian.rst

Some files were not shown because too many files have changed in this diff Show More