Compare commits

...

280 Commits

Author SHA1 Message Date
Kamil Braun
a56f7ce21a storage_service: raft topology: warn when raft_topology_cmd_handler fails due to abort
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.

Turn exceptions from aborts into WARN level.

Also improve logging by printing the command that failed.

Fixes scylladb/scylladb#19754

(cherry picked from commit 7506709573)

Closes scylladb/scylladb#20072
2024-08-08 18:14:24 +02:00
Kamil Braun
4948029666 raft topology: improve logging
Add more logging for raft-based topology operations in INFO and DEBUG
levels.

Improve the existing logging, adding more details.

Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).

Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.

These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).

Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945
(cherry picked from commit e8d5974961)

Closes scylladb/scylladb#20049
2024-08-08 11:59:34 +03:00
Tomasz Grabiec
89a93a784e tablets: Do not allocate tablets on nodes being decommissioned
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.

This also violates invariants about replica placement and this state
cannot be fixed by topology operations.

One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:

  load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at:  ...

Fixes #20032

(cherry picked from commit f5c74a5df2)

Closes scylladb/scylladb#20067
2024-08-08 11:57:09 +03:00
Dawid Medrek
d065d6f05d db/hints: Log when ignoring invalid hint directories
In 58784cd, aa4b06a and other commits migrating
hinted handoff from IPs to host IDs (scylladb/scylladb#15567),
we started ignoring hint directories of invalid names,
i.e. those that represent neither an IP address, nor a host ID.
They remain on disk and are taken into account while computing
e.g. the total size of hints, but they're not used in any way.

These changes add logs informing the user when Scylla
encounters such a directory.

Closes scylladb/scylladb#17566

(cherry picked from commit a5528a2093)

Closes scylladb/scylladb#19892
2024-08-07 10:55:06 +02:00
Michael Litvak
ccd01caed8 db: fix waiting for counter update operations on table stop
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.

Fixes scylladb/scylla-enterprise#4475

Closes scylladb/scylladb#20017
2024-08-05 12:52:32 +02:00
Piotr Dulikowski
78e3f0f208 Merge '[Backport 6.0] hinted handoff: migrate sync point to host ID ' from Dawid Mędrek
Change the format of sync points to use host ID instead of IPs, to be consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores host IDs instead of IPs.
The decoding supports both formats with host IDs and IPs, so a sync point contains now a variant of either types, and in the case of new type the translation is avoided.

Fixes scylladb/scylladb#18653

(cherry picked from commit scylladb/scylladb@b824e73)
(cherry picked from commit scylladb/scylladb@afc9a1a)
(cherry picked from commit scylladb/scylladb@c56de90)
(cherry picked from commit scylladb/scylladb@222dbf2)

In scylladb/scylladb#18733, we were experiencing a test failure
because the test code was receiving the reply `"DONE"` instead of
`"IN_PROGRESS"` when awaiting a sync point. The cluster consisted
of two nodes and the last few steps of the test that are relevant were:

1. Stop node 2.
2. Enable an error injection on node 1 to prevent it from sending hints.
3. Perform mutations on node 1 leading it to save hints towards node 2.
4. Start node 2 again.
5. Create a sync point on node 1.
6. Decommission node 2.
7. Await the created sync point on node 1.

Decommissioning node 2 led to node 1 trying to drain hints saved
towards it. However, due to the error injection, the draining process
was stuck and never finished. Because of that, when node 1 received
a request to await the sync point, the hint endpoint manager
corresponding to node 2 was still present -- all of that was expected
by the test.

What was unexpected by the test was the fact that now that hinted
handoff has started identifying nodes by their host IDs, but sync points
themselves still used IP addresses internally, there had to be a point
in the code where mapping one data type to the other would happen.
That place in the code is `manager::wait_for_sync_point()`.

When a node is decommissioned/removed, its host ID--IP mapping
is removed from the locator::token_metadata. Since node 2 had been
decommissioned, we no longer had access to the mapping we needed
and so the code used the "default" replay position, which, when compared,
is smaller than any other replay position except for itself.

Because of that, Scylla thought that all of the hints corresponding to
the sync point it got had been replayed and returned `"DONE"` to
the test's code, effectively leading to its failure.

These changes prevent that from happening as we start using host IDs
in the internal format used by sync points. Similar failures might still
occur if a sync point is created before the migration to host-ID-based
hinted handoff takes place, but awaited only after the migration.
However, the chances that that would happen are quite slim. The test
itself should proceed without any failures now.

Fixes scylladb/scylladb#18733

Closes scylladb/scylladb#19967

* github.com:scylladb/scylladb:
  test/boost: include test/lib/test_utils.hh
  test/boost/hint_test.cc: Add missing parse() callback
  db/hints: migrate sync point to host ID
  db/hints: rename sync point structures with _v1 suffix to _v1_v2
2024-08-05 09:46:48 +02:00
Kefu Chai
e1dab2779d test/boost: include test/lib/test_utils.hh
this change was created in the same spirit of 505900f18f. because
we are deprecating the operator<< for vector and unorderd_map in
Seastar, some tests do not compile anymore if we disable these
operators. so to be prepared for the change disabling them, let's
include test/lib/test_utils.hh for accessing the printer dedicated
for Boost.test. and also '#include <fmt/ranges.h>' when necessary,
because, in order to format the ranges using {fmt}, we need to
use fmt/ranges.h.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-02 15:04:34 +02:00
Kamil Braun
7a2396867b Merge '[Backport 6.0] raft: fix the shutdown phase being stuck' from Emil Maskovsky
Some of the calls inside the raft_group0_client::start_operation() method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the ownership of the raft group semaphore (get_units(semaphore)) - these waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes #19223

(cherry picked from commit 2dbe9ef2f2)

(cherry picked from commit 5dfc50d354)

Refs #19860

Closes scylladb/scylladb#19985

* github.com:scylladb/scylladb:
  raft: fix the shutdown phase being stuck
  raft: use the abort source reference in raft group0 client interface
2024-08-02 11:25:37 +02:00
Emil Maskovsky
b99d87863d raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 5dfc50d354)
2024-08-01 19:37:02 +02:00
Emil Maskovsky
0770069dda raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-08-01 19:36:00 +02:00
Dawid Medrek
13183069f7 test/boost/hint_test.cc: Add missing parse() callback
Before these changes, compilation was failing with the following
error:

In file included from test/boost/hint_test.cc:12:
/usr/include/fmt/ranges.h:298:7: error: no member named 'parse' in 'fmt::formatter<db::hints::sync_point::host_id_or_addr>'
  298 |     f.parse(ctx);
      |     ~ ^

We add the missing callback.

Closes scylladb/scylladb#19375
2024-08-01 14:49:36 +02:00
Michael Litvak
df0503afd6 db/hints: migrate sync point to host ID
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
2024-07-31 18:00:28 +02:00
Michael Litvak
42ee9f9e59 db/hints: rename sync point structures with _v1 suffix to _v1_v2
rename sync point types and variables to have v1/v2 suffix according to
their use.
2024-07-31 17:59:08 +02:00
Kamil Braun
9572674f25 docs: extend "forbidden operations" section for Raft-topology upgrade
The Raft-topology upgrade procedure must not be run concurrently with
version upgrade.

(cherry picked from commit bb0c3cdc65)

Closes scylladb/scylladb#19837
2024-07-29 16:53:01 +02:00
Tomasz Grabiec
416cbafd16 Merge '[Backport 6.0] sstables: fix some mixups between the writer's schema and the sstable's schema' from Michał Chojnowski
There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.

Fixes scylladb/scylladb#16065

The bug is present in all supported versions, so the patch has to be backported to all of them.

(cherry picked from commit a1834efd82)

(cherry picked from commit d10b38ba5b)

(cherry picked from commit 1a8ee69a43)

Refs scylladb/scylladb#19695

Closes scylladb/scylladb#19877

* github.com:scylladb/scylladb:
  sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
  sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
  sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
2024-07-29 15:36:52 +02:00
Jenkins Promoter
36cb61589d Update ScyllaDB version to: 6.0.3 2024-07-29 15:21:14 +03:00
Takuya ASADA
fefa76bffc scylla_raid_setup: install update-initramfs when it's not available
scylla_raid_setup may fail on Ubuntu minimal image since it calls
update-initramfs without installing.

(cherry picked from commit b6dedf1ee1)

Closes scylladb/scylladb#19871
2024-07-25 13:58:11 +03:00
Michał Chojnowski
43ba44ce97 sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.

(cherry picked from commit 1a8ee69a43)
2024-07-25 12:23:58 +02:00
Michał Chojnowski
d6d3a91283 sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.

This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.

(cherry picked from commit d10b38ba5b)
2024-07-25 12:23:58 +02:00
Michał Chojnowski
0555d4c30b sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
As of this patch, those static_casts are actually invalid in some cases
(they cast to the wrong type) because of an oversight.
A later patch will fix that. But to even write a reliable reproducer
for the problem, we must force the invalid casts to manifest as a crash
(instead of weird results).

This patch both allows writing a reproducer for the bug and serves
as a bit of defensive programming for the future.

(cherry picked from commit a1834efd82)

# Conflicts:
#	sstables/sstables.cc
2024-07-25 12:23:58 +02:00
Nadav Har'El
af39675c38 alternator: fix "/localnodes" to not return nodes still joining
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.

The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.

The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().

But the hard part of this test is the testing:

1. An existing multi-node test for "/localnodes" assummed that right after
   a new node was created, it appears on "/localnodes". But after this
   patch, it may take a bit more time for the bootstrapping to complete
   and the new node to appear in /localnodes - so I had to add a retry loop.

2. I added a test that reproduces the bug fixed here, and verifies its
   fix. The test is in the multi-node topology framework. It adds an
   injection which delays the bootstrap, which leaves a new node in JOINING
   state for a long time. The test then verifies that the new node is
   alive (as checked by the REST API), but is not returned by "/localnodes".

3. The new injection for delaying the bootstrap is unfortunately not
   very pretty - I had to do it in three places because we have several
   code paths of how bootstrap works without repair, with repair, without
   Raft and with Raft - and I wanted to delay all of them.

Fixes #19694.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19725

(cherry picked from commit bac7c33313)
(deleted test for cherry-pick)
2024-07-24 11:29:36 +03:00
Lakshmi Narayanan Sreethar
3c1fd843c8 [Backport 6.0]: sstables: do not reload components of unlinked sstables
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.

The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.

Fixes https://github.com/scylladb/scylladb/issues/19722

Closes scylladb/scylladb#19717

* github.com:scylladb/scylladb:
  boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
  sstables: do not reload components of unlinked sstables
  sstables/sstables_manager: introduce on_unlink method

(cherry picked from commit 591876b44e)

Backported from #19717 to 6.0

Closes scylladb/scylladb#19830
2024-07-23 23:16:53 +03:00
Piotr Dulikowski
9cc20e7c4d Merge '[Backport 6.0] schema: fix describe of indexes on collections' from ScyllaDB
If the index was created on collection (both frozen or not), its description wasn't a correct create statement.
This patch fixes the bug and includes functions like `full()`, `keys()`, `values()`, ... used to create index on collections.

Fixes scylladb/scylladb#19278

(cherry picked from commit 253feb6811)

(cherry picked from commit b65a4c66f0)

 Refs #19381

Closes scylladb/scylladb#19700

* github.com:scylladb/scylladb:
  cql-pytest/test_describe: add a test for describe indexes
  schema/schema: fix column names in index description
2024-07-22 12:33:47 +02:00
Kamil Braun
9efaca0bd2 Merge '[Backport 6.0] test: raft: fix the flaky test_raft_recovery_stuck' from ScyllaDB
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit ef3393bd36)

(cherry picked from commit a89facbc74)

 Refs #19771

Closes scylladb/scylladb#19809

* github.com:scylladb/scylladb:
  test: raft: fix the flaky `test_raft_recovery_stuck`
  test: raft: code cleanup in `test_raft_recovery_stuck`
2024-07-22 11:14:44 +02:00
Emil Maskovsky
62c9709f4a test: raft: fix the flaky test_raft_recovery_stuck
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is
fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit a89facbc74)
2024-07-20 02:17:50 +00:00
Emil Maskovsky
64d414f10a test: raft: code cleanup in test_raft_recovery_stuck
Cleaning up the imports.

(cherry picked from commit ef3393bd36)
2024-07-20 02:17:50 +00:00
Kamil Braun
f32ed716ed Merge '[Backport 6.0] Fix lwt semaphore guard accounting' from ScyllaDB
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.

Should be backported to all supported versions.

(cherry picked from commit 87beebeed0)

(cherry picked from commit 4178589826)

 Refs #19699

Closes scylladb/scylladb#19796

* github.com:scylladb/scylladb:
  test: add test to check that coordinator lwt semaphore continues functioning after locking failures
  paxos: do not signal semaphore if it was not acquired
2024-07-19 19:06:36 +02:00
Gleb Natapov
c437c8be36 test: add test to check that coordinator lwt semaphore continues functioning after locking failures
(cherry picked from commit 4178589826)
2024-07-18 15:34:17 +00:00
Gleb Natapov
1c04b95c68 paxos: do not signal semaphore if it was not acquired
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.

Fixes: https://github.com/scylladb/scylladb/issues/19698
(cherry picked from commit 87beebeed0)
2024-07-18 15:34:16 +00:00
Emil Maskovsky
5649b55e08 test: raft: fix the flaky test_change_ip
The python driver might currently trigger spurios reconnects that cause
the `NoHostAvailable` to be thrown, which is not expected.

This patch adds a retry mechanism to the test to make skip this failure
if it occurs, as a work-around.

The proper fix is expected to be done in the scylladb/python-driver#295,
once fixed there this work-around can be reverted.

Fixes: scylladb/scylla#18547
(cherry picked from commit 6b9992737a)

Closes scylladb/scylladb#19773
2024-07-18 15:06:23 +02:00
Emil Maskovsky
b4745406da raft: Fix crash in leader_host API handler
The leader_host API handler was eventually using the `req` unique_ptr
after it has been already destroyed (passed down to the future lambda
by reference). This was causing an occassional crash in some tests.

Reworked the leader_host handler to use the req only outside of the
future lambda.

Also updated the code to handle the possibility that the non-default
leader group (other than Group 0) might reside on a different shard
than the shard 0 - using the same concept of calling on all shards via
`invoke_on_all()` as done for the other requests.

Fixes scylladb/scylladb#19714

(cherry picked from commit b2db8f4b9b)

Closes scylladb/scylladb#19742
2024-07-16 13:29:37 +02:00
Anna Stuchlik
27faec3015 doc: replace a link on the CDC+Kafka page
This commit replaces a link to the installation section with a link to the getting started section.

(cherry picked from commit f90867c740)

Closes scylladb/scylladb#19712
2024-07-16 13:15:45 +02:00
Emil Maskovsky
06c356df8f test: raft: fix the topology failure recovery test flakiness
Setting the error condition for all nodes in the cluster to avoid
having to check which one is the coordinator. This should make the test
more stable and avoid the flakiness observed when the coordinator node
is the one that got the error condition injected.

Randomizing the retrieved running servers to reproduce the issue more
frequently and to avoid making any assumptions about the order of the
servers.

Note that only the "raft_topology_barrier_fail" needs to run
on a non-coordinator node, the other error "stream_ranges_fail" can be
injected on any node (including the coordinator).

Fixes: #18614
(cherry picked from commit 9dbad34205)

Closes scylladb/scylladb#19708
2024-07-15 16:27:22 +02:00
Michael Litvak
815a707b0a storage_proxy: remove response handler if no targets
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.

Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.

In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.

Fixes scylladb/scylladb#19529

(cherry picked from commit a9fdd0a93a)

Closes scylladb/scylladb#19680
2024-07-15 08:24:18 +02:00
Botond Dénes
16452f9cf5 Merge '[Backport 6.0] scylla-sstable: add method to load the schema from the sstable itself' from ScyllaDB
As it turns out, each sstable carries its own schema in its serialization header (Statistics component). This schema is incomplete -- the names of the key columns are not stored, just their type. Static and regular columns do have names and types stored however. This bare-bones schema is enough to parse and display the content of the sstable. Another thing missing is schema options (the stuff after the `WITH` keyword, except the clustering order). The only options stored are the compression options (in the CompressionInfo component), this is actually needed to read the Data component.

This series adds a new method to `tools/schema_loader.cc` to extract the schema stored in the sstable itself. This new schema load method is used as the last fall-back for obtaining the schema, in case scylla-sstable is trying to autodetect the schema of the sstable. Although, right now this bare-bones schema is enough for everything scylla-sstable does, it is more future proof to stick to the "full" schema if possible, so this new method is the last resort for now.

Fixes: https://github.com/scylladb/scylladb/issues/17869
Fixes: https://github.com/scylladb/scylladb/issues/18809

New functionality, no backport needed.

(cherry picked from commit 435c01d1e6)

(cherry picked from commit 0d7335dd27)

(cherry picked from commit 8f2ba03465)

(cherry picked from commit 43c44f0af5)

(cherry picked from commit 145a67f77c)

 Refs #19169

Closes scylladb/scylladb#19711

* github.com:scylladb/scylladb:
  tools/scylla-sstable: log loaded schema with trace level
  tools/scylla-sstable: load schema from the sstable as fallback
  tools/schema_loader: introduce load_schema_from_sstable()
  test/lib/random_schema: remove assert on min number of regular columns
  sstables: introduce load_metadata()
2024-07-12 16:55:44 +03:00
Botond Dénes
5d94a08250 tools/scylla-sstable: log loaded schema with trace level
The schema of the sstable can be interesting, so log it with trace
level. Unfortunately, this is not the nice CQL statement we are used to
(that requires a database object), but the not-nearly-so-nice CFMetadata
printout. Still, it is better then nothing.

(cherry picked from commit 145a67f77c)
2024-07-12 10:36:59 +00:00
Botond Dénes
4f74e6f28e tools/scylla-sstable: load schema from the sstable as fallback
When auto-detecting the schema of the sstable, if all other methods
failed, load the schema from the sstable's serialization header. This
schema is incomplete. It is just enough to parse and display the content
of the sstable. Although parsing and displaying the content of the
sstable is all scylla-sstable does, it is more future-compatible to us
the full schema when possible. So the always-available but minimal
schema that each sstable has on itself, is used just as a fallback.

The test which tested the case when all schema load attempts fail,
doesn't work now, because loading the serialization header always
succeeds. So convert this test into two positive tests, testing the
serialization header schema fallback instead.

(cherry picked from commit 43c44f0af5)
2024-07-12 10:36:59 +00:00
Botond Dénes
f42e8e872a tools/schema_loader: introduce load_schema_from_sstable()
Allows loading the schema from an sstable's serialization header. This
schema is incomplete, but it is enough to parse and display the content
of the sstable.

(cherry picked from commit 8f2ba03465)
2024-07-12 10:36:59 +00:00
Botond Dénes
f7c8c32929 test/lib/random_schema: remove assert on min number of regular columns
It is legal for a schema to have 0 regular columns, so remove the assert
on the schema specification's regular column count.

(cherry picked from commit 0d7335dd27)
2024-07-12 10:36:59 +00:00
Botond Dénes
4f165eb3e9 sstables: introduce load_metadata()
Loads just the metadata components. No validation.
Split off from load(), to allow scylla-sstable to partially load an
sstable.

(cherry picked from commit 435c01d1e6)
2024-07-12 10:36:59 +00:00
Michał Jadwiszczak
25f8fd0b5c cql-pytest/test_describe: add a test for describe indexes
(cherry picked from commit b65a4c66f0)
2024-07-11 12:59:27 +00:00
Michał Jadwiszczak
67764e7d66 schema/schema: fix column names in index description
Previously description of index didn't include functions for
indexes on collections like full(), keys(), values(), etc...

(cherry picked from commit 253feb6811)
2024-07-11 12:59:27 +00:00
Tomasz Grabiec
43ff19273c Merge '[Backport 6.0] mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion' from ScyllaDB
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.

This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.

Fixes https://github.com/scylladb/scylladb/issues/19552

(cherry picked from commit f784be6a7e)

(cherry picked from commit 7b3f55a65f)

(cherry picked from commit 78d6471ce4)

Refs #19617

Closes scylladb/scylladb#19675

* github.com:scylladb/scylladb:
  mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
  logalloc: add hold_reserve
  logalloc: generalize refill_emergency_reserve()
2024-07-10 14:28:01 +02:00
Michał Chojnowski
aee0150506 mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.

This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.

Fixes scylladb/scylladb#19552

(cherry picked from commit 78d6471ce4)
2024-07-10 08:36:11 +00:00
Michał Chojnowski
c5c19e90ac logalloc: add hold_reserve
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.

This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.

(cherry picked from commit 7b3f55a65f)
2024-07-10 08:36:11 +00:00
Michał Chojnowski
985f5a50f6 logalloc: generalize refill_emergency_reserve()
In the next patch, we will want to do the thing as
refill_emergency_reserve() does, just with a quantity different
than _emergency_reserve_max. So we split off the shareable part
to a new function, and use it to implement refill_emergency_reserve().

(cherry picked from commit f784be6a7e)
2024-07-10 08:36:11 +00:00
Botond Dénes
ae11381d7c Merge '[Backport 6.0] reader_concurrency_semaphore: make CPU concurrency configurable' from Botond Dénes
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.

This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.

Fixes: https://github.com/scylladb/scylladb/issues/19017

This PR backports https://github.com/scylladb/scylladb/pull/19018 and its follow-up https://github.com/scylladb/scylladb/pull/19600.

Closes scylladb/scylladb#19644

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
  test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
  test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
  reader_concurrency_semaphore: wire in the configurable cpu concurrency
  reader_concurrency_semaphore: add cpu_concurrency constructor parameter
  db/config: introduce reader_concurrency_semahore_cpu_concurrency
2024-07-10 07:23:08 +03:00
Anna Stuchlik
4ec5a06101 doc: update Scylla Doctor installation
This commit updates the instuctions on how to download and run Scylla Doctor,
following the changes in how Scylla Doctor is released.

(cherry picked from commit 2ffda9b262)

Closes scylladb/scylladb#19525
2024-07-09 14:32:21 +03:00
Anna Stuchlik
dcf4c757b2 doc: remove support for Debian 10
This PR removes support for Debian 10, which reached end of life on June 30, 2024.

Refs https://github.com/scylladb/scylla-enterprise/issues/4377

(cherry picked from commit 1f340428ea)

Closes scylladb/scylladb#19630
2024-07-09 12:55:11 +02:00
Wojciech Przytuła
a7fe9eeffd storage_proxy: fix uninitialized LWT contention counter
When debugging the issue of high LWT contention metric, we (the
drivers team) discovered that at least 3 drivers (Go, Java, Rust)
cause high numbers in that metrics in LWT workloads - we doubted that
all those drivers route LWT queries badly. We tried to understand that
metric and its semantics. It took 3 people over 10 hours to figure out
what it is supposed to count.

People from core team suspected that it was the drivers sending
requests to different shards, causing contention. Then we ran the
workload against a single node single shard cluster... and observed
contention. Finally, we looked into the Scylla code and saw it.

**Uninitialized stack value.**

The core member was shocked. But we, the drivers people, felt we always
knew it. It's yet another time that we are blamed for a server-side
issue. We rebuilt scylla with the variable initialized to 0 and the
metric kept being 0.

To prevent such errors in the future, let's consider some lints that
warn against uninitialized variables. This is such an obvious feature
of e.g. Rust, and yet this has shown to be cause a painful bug in 2024.

Fixes: scylladb/scylladb#19654
(cherry picked from commit 36a125bf97)

Closes scylladb/scylladb#19657
2024-07-09 11:41:10 +02:00
Michael Litvak
ad6eb1cadf view: drain view builder before database
The view builder is doing write operations to the database.
In order for the view builder to shutdown gracefully without errors, we
need to ensure the database can handle writes while it is drained.
The commit changes the drain order, so that view builder is drained
before the database shuts down.

Fixes scylladb/scylladb#18929

(cherry picked from commit 9d9318c564)

Closes scylladb/scylladb#19636
2024-07-08 19:16:26 +02:00
Botond Dénes
dadc0c32e1 reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
Now that the CPU concurency limit is configurable, new reads might be
ready to execute right after the current one was executed. So move the
poll for admitting new reads into the inner loop, to prevent the
situation where the inner loop yields and a concurrent
do_wait_admission() finds that there are waiters (queued because at the
time they arrived to the semaphore, the _ready_list was not empty) but it
is is possible to admit a new read. When this happens the semaphore will
dump diagnostics to help debug the apparent contradiction, which can
generate a lot of log spam. Moving the poll into the inner loop prevents
the false-positive contradiction detection from firing.

Refs: scylladb/scylladb#19017

Closes scylladb/scylladb#19600

(cherry picked from commit 155acbb306)
2024-07-08 08:13:40 +03:00
Botond Dénes
88d3c2eb4b test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
(cherry picked from commit b4f3809ad2)
2024-07-08 08:13:07 +03:00
Botond Dénes
4307631950 test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
This is currently a lambda in a test, hoist it into the global scope and
make it into a function, so other tests can use it too (in the next
patch).

(cherry picked from commit 9cbdd8ef92)
2024-07-08 08:12:34 +03:00
Botond Dénes
abc4a9b635 reader_concurrency_semaphore: wire in the configurable cpu concurrency
Before this patch, the semaphore was hard-wired to stop admission, if
there is even a single permit, which is in the need_cpu state.
Therefore, keeping the CPU concurrency at 1.
This patch makes use of the new cpu_concurrency parameter, which was
wired in in the last patches, allowing for a configurable amount of
concurrent need_cpu permits. This is to address workloads where some
small subset of reads are expected to be slow, and can hold up faster
reads behind them in the semaphore queue.

(cherry picked from commit 07c0a8a6f8)
2024-07-08 08:12:34 +03:00
Botond Dénes
052cef2621 reader_concurrency_semaphore: add cpu_concurrency constructor parameter
In the case of the user semaphore, this receives the new
reader_concurrency_semaphore_cpu_limit config item.
Not used yet.

(cherry picked from commit 59faa6d4ff)
2024-07-08 08:12:20 +03:00
Botond Dénes
5a7af93c7c db/config: introduce reader_concurrency_semahore_cpu_concurrency
To allow increasing the semaphore's CPU concurrency, which is currently
hard-limited to 1. Not wired yet.

(cherry picked from commit c7317be09a)
2024-07-08 08:06:28 +03:00
Pavel Emelyanov
78f3fc8890 tablet_allocator: Put more info into failed-to-drain exception
When balancer fails to find a node to balance drained tablets into, it
throws an exception with tablet id and node id, but it's also good to
know more details about the balancing state that lead to failure

refs: #19504

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c3d9831c5f)

Closes scylladb/scylladb#19619
2024-07-05 11:17:37 +03:00
None
3e06c882f0 .github: remove pull_request_template
The reason for the pr template is to explain why do we need to backport
a PR.

On release branches there is no need for it

Closes scylladb/scylladb#19615
2024-07-04 16:52:27 +03:00
Avi Kivity
c6e8a7f762 Merge '[Backport 6.0] Close output_stream in get_compaction_history() API handler' from ScyllaDB
If an httpd body writer is called with output_stream<>, it mist close the stream on its own regardless of any exceptions it may generate while working, otherwise stream destructor may step on non-closed assertion. Stepped on with different handler, see #19541

Coroutinize the handler as the first step while at it (though the fix would have been notably shorter if done with .finally() lambda)

(cherry picked from commit acb351f4ee)

(cherry picked from commit 6d4ba98796)

(cherry picked from commit b4f9387a9d)

 Refs #19543

Closes scylladb/scylladb#19603

* github.com:scylladb/scylladb:
  api: Close response stream of get_compaction_history()
  api: Flush output stream in get_compaction_history() call
  api: Coroutinize get_compaction_history inner function
2024-07-04 15:08:08 +03:00
Pavel Emelyanov
941ec80a00 api: Close response stream of get_compaction_history()
The function must close the stream even if it throws along the way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b4f9387a9d)
2024-07-03 18:30:17 +00:00
Pavel Emelyanov
ab5041cb03 api: Flush output stream in get_compaction_history() call
It's currently implicitly flushed on its close, but in that case close
can throw while flusing. Next patch wants close not to throw and that's
possible if flushing the stream in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 6d4ba98796)
2024-07-03 18:30:17 +00:00
Pavel Emelyanov
009f5eb69e api: Coroutinize get_compaction_history inner function
The handler returns a function which is then invoked with output_stream
argument to render the json into. This function is converted into
coroutine. It has yet another inner lambda that's passed into
compaction_manager::get_compaction_history() as consumer lambda. It's
coroutinized too.

The indentation looks weird as preparation for future patching.
Hopefullly it's still possible to understand what's going on.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit acb351f4ee)
2024-07-03 18:30:17 +00:00
Tzach Livyatan
c9cd171f42 Docs: Fix a typo in sstable-corruption.rst
(cherry picked from commit a7115124ce)

Closes scylladb/scylladb#19591
2024-07-03 10:24:44 +02:00
Piotr Dulikowski
8b9e62e107 Merge '[Backport 6.0] cql3/statement/select_statement: do not parallelize single-partition aggregations' from Michał Jadwiszczak
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.

Fixes scylladb/scylladb#19349

(cherry picked from commit e9ace7c203)

(cherry picked from commit 8eb5ca8202)

Refs scylladb/scylladb#19350

Closes scylladb/scylladb#19499

* github.com:scylladb/scylladb:
  test/boost/cql_query_test: add test for single-partition aggregation
  cql3/select_statement: do not parallelize single-partition aggregations
2024-07-02 21:03:24 +02:00
Kamil Braun
4e21421ddc Merge '[Backport 6.0] Do not expire local addres in raft address map since the local node cannot disappear' from ScyllaDB
A node may wait in the topology coordinator queue for awhile before been
joined. Since the local address is added as expiring entry to the raft
address map it may expire meanwhile and the bootstrap will fail. The
series makes the entry non expiring.

Fixes  scylladb/scylladb#19523

Needs to be backported to 6.0 since the bug may cause bootstrap to fail.

(cherry picked from commit 5d8f08c0d7)

(cherry picked from commit 3f136cf2eb)

 Refs #19557

Closes scylladb/scylladb#19574

* github.com:scylladb/scylladb:
  test: add test that checks that local address cannot expire between join request placemen and its processing
  storage_service: make node's entry non expiring in raft address map
2024-07-01 16:20:17 +02:00
Gleb Natapov
724ec62e22 test: add test that checks that local address cannot expire between join request placemen and its processing
(cherry picked from commit 3f136cf2eb)
2024-07-01 10:44:31 +00:00
Gleb Natapov
a6c5f8192d storage_service: make node's entry non expiring in raft address map
Local address map entry should never expire in the address map.

(cherry picked from commit 5d8f08c0d7)
2024-07-01 10:44:31 +00:00
Pavel Emelyanov
20b99246fd Merge '[Backport 6.0] Close output stream in task manager's API get_tasks handler' from ScyllaDB
If client stops reading response early, the server-side stream throws but must be closed anyway. Seen in another endpoint and fixed by #19541

(cherry picked from commit 4897d8f145)

(cherry picked from commit 986a04cb11)

(cherry picked from commit 1be8b2fd25)

 Refs #19542

Closes scylladb/scylladb#19562

* github.com:scylladb/scylladb:
  api: Fix indentation after previous patch
  api: Close response stream on error
  api: Flush response output stream before closing
2024-07-01 10:47:30 +03:00
Pavel Emelyanov
8e74ac5140 Merge '[Backport 6.0] Close output_stream in get_snapshot_details() API handler' from ScyllaDB
All streams used by httpd handlers are to be closed by the handler itself, caller doesn't take care of that.

fixes: #19494

(cherry picked from commit d1fd886608)

(cherry picked from commit a0c1552cea)

(cherry picked from commit 1839030e3b)

 Refs #19541

Closes scylladb/scylladb#19563

* github.com:scylladb/scylladb:
  api: Fix indentation after previous patch
  api: Close output_stream on error
  api: Flush response output stream before closing
2024-07-01 10:47:08 +03:00
Pavel Emelyanov
4e17a5a1c2 api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1839030e3b)
2024-06-30 19:20:11 +00:00
Pavel Emelyanov
c5c168a1db api: Close output_stream on error
If the get_snapshot_details() lambda throws, the output stream remains
non-closed which is bad. Close it regardless of what.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a0c1552cea)
2024-06-30 19:20:10 +00:00
Pavel Emelyanov
09272d2478 api: Flush response output stream before closing
Otherwise close() may throw and this is what next patch will want not to
happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit d1fd886608)
2024-06-30 19:20:10 +00:00
Pavel Emelyanov
1e7f377b0a api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1be8b2fd25)
2024-06-30 19:19:52 +00:00
Pavel Emelyanov
b038177f19 api: Close response stream on error
The handler's lambda is called with && stream object and must close the
stream on its own regardless of what.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 986a04cb11)
2024-06-30 19:19:52 +00:00
Pavel Emelyanov
426bc6a4e1 api: Flush response output stream before closing
The .close() method flushes the stream, but it may throw doing it. Next
patch will want .close() not to throw, for that stream must be flushed
explicitly before closing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4897d8f145)
2024-06-30 19:19:52 +00:00
Piotr Smaron
6a1e0489c6 cql: forbid switching from tablets to vnodes in ALTER KS
This check is already in place, but isn't fully working, i.e.
switching from a vnode KS to a tablets KS is not allowed, but
this check doesn't work in the other direction. To fix the
latter, `ks_prop_defs::get_initial_tablets()` has been changed
to handle 3 states: (1) init_tablets is set, (2) it was skipped,
(3) tablets are disabled. These couldn't fit into std::optional,
so a new local struct to hold these states has been introduced.
Callers of this function have been adjusted to set init_tablets
to an appropriate value according to the circumstances, i.e. if
tablets are globally enabled, but have been skipped in the CQL,
init_tablets is automatically set to 0, but if someone executes
ALTER KS and doesn't provide tablets options, they're inherited
from the old KS.
I tried various approaches and this one resulted in the least
lines of code changed. I also provided testcases to explain how
the code behaves.

Fixes: #18795
(cherry picked from commit 758139c8b2)

Closes scylladb/scylladb#19540
2024-06-28 17:58:35 +03:00
Yaron Kaikov
1577765a20 .github/scripts/label_promoted_commits.py: fix adding labels when PR is closed
`prs = response.json().get("items", [])` will return empty when there are no merged PRs, and this will just skip the all-label replacement process.

This is a regression following the work done in #19442

Adding another part to handle closed PRs (which is the majority of the cases we have in Scylla core)

Fixes: https://github.com/scylladb/scylladb/issues/19441
(cherry picked from commit 2eb8344b9a)

Closes scylladb/scylladb#19527
2024-06-27 19:35:18 +03:00
Botond Dénes
c4f1f129c3 Merge '[Backport 6.0] batchlog replay: bypass tombstones generated by past replays' from ScyllaDB
The `system.batchlog` table has a partition for each batch that failed to complete. After finally applying the batch, the partition is deleted. Although the table has gc_grace_second = 0, tombstones can still accumulate in memory, because we don't purge partition tombstones from either the memtable or the cache. This can lead to the cache and memtable of this table to accumulate many thousands of even millions of tombstones, making batchlog replay very slow. We didn't notice this before, because we would only replay all failed batches on unbootstrap, which is rare and a heavy and slow operation on its own right already.
With repair-based tombstone-gc however, we do a full batchlog replay at the beginning of each repair, and now this extra delay is noticeable.
Fix this by making sure batchlog replays don't have to scan through all the tombstones generated by previous replays:
* flush the `system.batchlog` memtable at the end of each batchlog replay, so it is cleared of tombstones
* bypass the cache

Fixes: https://github.com/scylladb/scylladb/issues/19376

Although this is not a regression -- replay was like this since forever -- now that repair calls into batchlog replay, every release which uses repair-based tombstone-gc should get this fix

(cherry picked from commit 4e96e320b4)

(cherry picked from commit 2dd057c96d)

(cherry picked from commit 29f610d861)

(cherry picked from commit 31c0fa07d8)

 Refs #19377

Closes scylladb/scylladb#19502

* github.com:scylladb/scylladb:
  db/batchlog_manager: bypass cache when scanning batchlog table
  db/batchlog_manager: replace open-coded paging with internal one
  db/batchlog_manager: implement cleanup after all batchlog replay
  cql3/query_processor: for_each_cql_result(): move func to the coro frame
2024-06-27 14:46:50 +03:00
Botond Dénes
fa644c6269 Merge '[Backport 6.0] tasks: fix tasks abort' from Aleksandra Martyniuk
Currently if task_manager::task::impl::abort preempts before children are recursively aborted and then the task gets unregistered, we hit use after free since abort uses children vector which is no longer alive.

Modify abort method so that it goes over all tasks in task manager and aborts those with the given parent.

Fixes: https://github.com/scylladb/scylladb/issues/19304.

Requires backport to all versions containing task manager

(cherry picked from commit 3463f495b1)

(cherry picked from commit 50cb797d95)

Refs https://github.com/scylladb/scylladb/pull/19305

Closes scylladb/scylladb#19437

* github.com:scylladb/scylladb:
  test: add test for abort while a task is being unregistered
  tasks: fix tasks abort
2024-06-27 14:45:34 +03:00
Botond Dénes
cb4b4fe678 Merge '[Backport 6.0] test_tablets: add test_tablet_storage_freeing' from ScyllaDB
Before work on tablets was completed, it was noticed that — due to some missing pieces of implementation — Scylla doesn't properly close sstables for migrated-away tablets. Because of this, disk space wasn't being reclaimed properly.

Since the missing pieces of implementation were added, the problem should be gone now. This patch adds a test which was used to reproduce the problem earlier. It's expected to pass now, validating that the issue was fixed.

Should be backported to branch-6.0, because the tested problem was also affecting that branch.

Fixes #16946

(cherry picked from commit 7741491b47)

(cherry picked from commit 823da140dd)

 Refs #18906

Closes scylladb/scylladb#19295

* github.com:scylladb/scylladb:
  test_tablets: add test_tablet_storage_freeing
  test: pylib: add get_sstables_disk_usage()
2024-06-27 14:40:06 +03:00
Kamil Braun
aca08bb1d1 Merge '[Backport 6.0] join_token_ring, gossip topology: recalculate sync nodes in wait_alive' from ScyllaDB
The node booting in gossip topology waits until all NORMAL
nodes are UP. If we removed a different node just before,
the booting node could still see it as NORMAL and wait for
it to be UP, which would time out and fail the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

Although the issue fixed by this PR caused only test flakiness,
it could also manifest in real clusters. It's best to backport this
PR to 5.4 and 6.0.

Fixes scylladb/scylladb#17526

(cherry picked from commit 017134fd38)

(cherry picked from commit 7735bd539b)

(cherry picked from commit bcc0a352b7)

 Refs #19387

Closes scylladb/scylladb#19419

* github.com:scylladb/scylladb:
  join_token_ring, gossip topology: update obsolete comment
  join_token_ring, gossip topology: fix indendation after previous patch
  join_token_ring, gossip topology: recalculate sync nodes in wait_alive
2024-06-26 12:38:06 +02:00
Yaron Kaikov
9f31426ead .github/workflow: close and replace label when backport promoted
Today after Mergify opened a Backport PR, it will stay open until someone manually close the backport PR , also we can't track using labels which backport was done or not since there is no indication for that except digging into the PR and looking for a comment or a commit ref

The following changes were made in this PR:
* trigger add-label-when-promoted.yaml also when the push was made to `branch-x.y`
* Replace label `backport/x.y` with `backport/x.y-done` in the original PR (this will automatically update the original Issue as well)
* Add a comment on the backport PR and close it

Fixes: https://github.com/scylladb/scylladb/issues/19441
(cherry picked from commit 394cba3e4b)

Closes scylladb/scylladb#19496
2024-06-26 12:42:34 +03:00
Botond Dénes
22622a94ca db/batchlog_manager: bypass cache when scanning batchlog table
Scans should not pollute the cache with cold data, in general. In the
case of the batchlog table, there is another reason to bypass the cache:
this table can have a lot of partition tombstones, which currently are
not purged from the cache. So in certain cases, using the cache can make
batch replay very slow, because it has to scan past tombstones of
already replayed batches.

(cherry picked from commit 31c0fa07d8)
2024-06-26 09:05:14 +00:00
Botond Dénes
35a64856b0 db/batchlog_manager: replace open-coded paging with internal one
query_processor has built-in paging support, no need to open-code paging
in batchlog manager code.

(cherry picked from commit 29f610d861)
2024-06-26 09:05:13 +00:00
Botond Dénes
4e66b3c9ce db/batchlog_manager: implement cleanup after all batchlog replay
We have a commented code snippet from Origin with cleanup and a FIXME to
implement it. Origin flushes the memtables and kicks a compaction. We
only implement the flush here -- the flush will trigger a compaction
check and we leave it up to the compaction manager to decide when a
compaction is worthwhile.
This method used to be called only from unbootstrap, so a cleanup was
not really needed. Now it is also called at the end of repair, if the
table is using repair-based tombstone-gc. If the memtable is filled with
tombstones, this can add a lot of time to the runtime of each repair. So
flush the memtable at the end, so the tombstones can be purged (they
aren't purged from memtables yet).

(cherry picked from commit 2dd057c96d)
2024-06-26 09:05:13 +00:00
Botond Dénes
5e422ceefb cql3/query_processor: for_each_cql_result(): move func to the coro frame
Said method has a func parameter (called just f), which it receives as
rvalue ref and just uses as a reference. This means that if caller
doesn't keep the func alive, for_each_cql_result() will run into
use-after-free after the first suspention point. This is unexpected for
callers, who don't expect to have to keep something alive, which they
passed in with std::move().
Adjust the signature to take a value instead, value parameters are moved
to the coro frame and survive suspention points.
Adjust internal callers (query_internal()) the same way.

There are no known vulnerable external callers.

(cherry picked from commit 4e96e320b4)
2024-06-26 09:05:13 +00:00
Michał Jadwiszczak
29c6a4cf44 test/boost/cql_query_test: add test for single-partition aggregation
(cherry picked from commit 8eb5ca8202)
2024-06-25 23:56:49 +02:00
Dawid Medrek
7201efc2f2 db/hints: Initialize endpoint managers only for valid hint directories
Before these changes, it could happen that Scylla initialized
endpoint managers for hint directories representing

* host IDs before migrating hinted handoff to using host IDs,
* IP addresses after the migration.

One scenario looked like this:

1. Start Scylla and upgrade the cluster to using host IDs.
2. Create, by hand, a hint directory representing an IP address.
3. Trigger changing the host filter in hinted handoff; it could
   be achieved by, for example, restricting the set of data
   centers Scylla is allowed to save hints for.

When changing the host filter, we browse the hint directories
and create endpoint managers if we can send hints towards
the node corresponding to a given hint directory. We only
accepted hint directories representing IP addresses
and host IDs. However, we didn't check whether the local node
has already been upgraded to host-ID-based hinted handoff
or not. As a result, endpoint managers were created for
both IP addresses and host IDs, no matter whether we were
before or after the migration.

These changes make sure that any time we browse the hint
directories, we take that into account.

Fixes scylladb/scylladb#19172

(cherry picked from commit c9bb0a4da6)

Closes scylladb/scylladb#19426
2024-06-23 19:32:57 +03:00
Kefu Chai
1b2f10a4e7 sstables: use "me" sstable format by default
in 7952200c, we changed the `selected_format` from `mc` to `me`,
but to be backward compatible the cluster starts with "md", so
when the nodes in cluster agree on the "ME_SSTABLE_FORMAT" feature,
the format selector believes that the node is already using "ME",
which is specified by `_selected_format`. even it is actually still
using "md", which is specified by `sstable_manager::_format`, as
changed by 54d49c04. as explained above, it was specified to "md"
in hope to be backward compatible when upgrading from an existign
installation which might be still using "md". but after a second
thought, since we are able to read sstables persisted with older
formats, this concern is not valid.

in other words, 7952200c introduced a regression which changed the
"default" sstable format from `me` to `md`.

to address this, we just change `sstable_manager::_format` to "me",
so that all sstables are created using "me" format.

a test is added accordingly.

Fixes #18995
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 5a0d30f345)

Closes scylladb/scylladb#19422
2024-06-23 19:26:53 +03:00
Jenkins Promoter
1f2bbf52cc Update ScyllaDB version to: 6.0.2 2024-06-23 15:15:46 +03:00
Aleksandra Martyniuk
169dfaf037 test: add test for abort while a task is being unregistered
(cherry picked from commit 50cb797d95)
2024-06-22 15:47:03 +02:00
Botond Dénes
cfac9d8bef Merge '[Backport 6.0] Reduce TWCS off-strategy space overhead' from ScyllaDB
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.

Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.

That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.

Fixes #16514.

(cherry picked from commit b8bd4c51c2)

(cherry picked from commit 51c7ee889e)

(cherry picked from commit 0ce8ee03f1)

(cherry picked from commit ace4e5111e)

 Refs #18137

Closes scylladb/scylladb#19404

* github.com:scylladb/scylladb:
  compaction: Reduce twcs off-strategy space overhead to 10% of free space
  compaction: wire storage free space into reshape procedure
  sstables: Allow to get free space from underlying storage
  replica: don't expose compaction_group to reshape task
2024-06-21 20:00:10 +03:00
Anna Stuchlik
aca9d657ca doc: remove the link to Scylladb Google group
The group is no longer active and should be removed from resources.

(cherry picked from commit 027cf3f47d)

Closes scylladb/scylladb#19402
2024-06-21 19:59:35 +03:00
Anna Stuchlik
5fe531fcb2 doc: separate Entrprise- from OSS-only content
This commit adds files that contain Open Source-specific information
and includes these files with the .. scylladb_include_flag:: directive.
The files include a) a link and b) Table of Contents.

The purpose of this update is to enable adding
Open Source/Enterprise-specific information in the Reference section.

(cherry picked from commit 680405b465)

Closes scylladb/scylladb#19395
2024-06-21 19:58:14 +03:00
Botond Dénes
673d49dba3 docs: nodetool status: document keyspace and table arguments
Also fix the example nodetool status invocation.

Fixes: #17840

Closes scylladb/scylladb#18037

(cherry picked from commit 6e3b997e04)

Closes scylladb/scylladb#19394
2024-06-21 19:57:37 +03:00
Botond Dénes
fe9435924a Merge '[Backport 6.0] schema: Make "describe" use extensions to string' from ScyllaDB
Fixes #19334

Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.

Note: required to make enterprise CI happy again.

(cherry picked from commit d27620e146)

(cherry picked from commit 73abc56d79)

 Refs #19337

Closes scylladb/scylladb#19359

* github.com:scylladb/scylladb:
  schema: Make "describe" use extensions to string
  schema_extensions: Add an option to string method
2024-06-21 19:56:02 +03:00
Nadav Har'El
0715038dbe test: unflake test test_alternator_ttl_scheduling_group
This test in topology_experimental_raft/test_alternator.py wants to
check that during Alternator TTL's expiration scans, ALL of the CPU was
used in the "streaming" scheduling group and not in the "statement"
scheduling group. But to allow for some fluke requests (e.g., from the
driver), the test actually allows work in the statement group to be
up to 1% of the work.

Unfortunately, in one test run - a very slow debug+aarch64 run - we
saw the work on the statement group reach 1.4%, failing the test.
I don't know exactly where this work comes from, perhaps the driver,
but before this bug was fixed we saw more than 58% of the work in the
wrong scheduling group, so neither 1% or 1.4% is a sign that the bug
came back. In fact, let's just change the threshold in the test to 10%,
which is also much lower than the pre-fix value of 58%, so is still a
valid regression test.

Fixes #19307

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9fc70a28ca)

Closes scylladb/scylladb#19333
2024-06-21 19:55:09 +03:00
Patryk Jędrzejczak
e129c4ad43 join_token_ring, gossip topology: update obsolete comment
The code mentioned in the comment has already been added. We change
the comment to prevent confusion.

(cherry picked from commit bcc0a352b7)
2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak
37bb6e0a43 join_token_ring, gossip topology: fix indendation after previous patch
(cherry picked from commit 7735bd539b)
2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak
e5e8b970ed join_token_ring, gossip topology: recalculate sync nodes in wait_alive
Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.

This issue caused scylladb/scylladb#17526.

Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.

(cherry picked from commit 017134fd38)
2024-06-21 12:05:42 +00:00
Michał Jadwiszczak
b275ee9a1c cql3/select_statement: do not parallelize single-partition aggregations
Currently reads with WHERE clause which limits them to be
single-partition reads, are unnecessarily parallelized.

This commit checks this condition and the query doesn't use
forward_service in single-partition reads.

(cherry picked from commit e9ace7c203)
2024-06-21 09:31:39 +00:00
Raphael S. Carvalho
3d9aa9d49e compaction: Reduce twcs off-strategy space overhead to 10% of free space
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.

To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.

Fixes #16514.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ace4e5111e)
2024-06-20 20:41:41 +00:00
Raphael S. Carvalho
ef72075920 compaction: wire storage free space into reshape procedure
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0ce8ee03f1)
2024-06-20 20:41:41 +00:00
Raphael S. Carvalho
37f1af2646 sstables: Allow to get free space from underlying storage
That will be used in turn to restrict reshape to 10% of available space
in underlying storage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 51c7ee889e)
2024-06-20 20:41:41 +00:00
Raphael S. Carvalho
56f551f740 replica: don't expose compaction_group to reshape task
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b8bd4c51c2)
2024-06-20 20:41:41 +00:00
Calle Wilund
fd59176a73 main/minio_server.py: Respect any preexisting AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY vars
Fixes scylladb/scylla-pkg#3845

Don't overwrite (or rather change) AWS credentials variables if already set in
enclosing environment. Ensures EAR tests for AWS KMS can run properly in CI.

v2:
* Allow environment variables in reading obj storage config - allows CI to
  use real credentials in env without risking putting them info less seure
  files
* Don't write credentials info from miniserver into config, instead use said
  environment vars to propagate creds.

v3:
* Fix python launch scripts to not clear environment, thus retaining above aws envs.

(cherry picked from commit 5056a98289)

Closes scylladb/scylladb#19330
2024-06-20 18:08:51 +03:00
Aleksandra Martyniuk
634c0d44ef tasks: fix tasks abort
Currently if task_manager::task::impl::abort preempts before children
are recursively aborted and then the task gets unregistered, we hit
use after free since abort uses children vector which is no
longer alive.

Modify abort method so that it goes over all tasks in task manager
and aborts those with the given parent.

Fixes: #19304.
(cherry picked from commit 3463f495b1)
2024-06-20 14:47:14 +00:00
Botond Dénes
3baa68f8af Merge '[Backport 6.0] doc: add 6.x.y to 6.x.z and remove 5.x.y to 5.x.z upgrade guide' from ScyllaDB
This PR removes the 5.x.y to 5.x.z upgrade guide and adds the 6.x.y to 6.x.z upgrade guide.

The previous maintenance upgrade guides, such as from 5.x.y to 5.x.z, consisted of several documents - separate for each platform.
The new 6.x.y to 6.x.z upgrade guide is one document - there are tabs to include platform-specific information (we've already done it for other upgrade guides as one generic document is more convenient to use and maintain).

I did not modify the procedures. At some point, they have been reviewed for previous upgrade guides.

Fixes https://github.com/scylladb/scylladb/issues/19322

-  This PR must be backported to branch-6.0, as it adds 6.x specific content.

(cherry picked from commit ead201496d)

(cherry picked from commit ea35982764)

 Refs #19340

Closes scylladb/scylladb#19360

* github.com:scylladb/scylladb:
  doc: remove the 5.x.y to 5.x.z upgrade guide
  doc: add the 6.x.y to 6.x.z upgrade guide-6
2024-06-19 14:36:54 +03:00
Gleb Natapov
0e49180cef topology coordinator: add more trace level logging for debugging
Add more logging that provide more visibility into what happens during
topology loading.

Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com>

(cherry picked from commit fb764720d3)
2024-06-18 16:38:51 +02:00
Anna Stuchlik
a97a074813 doc: remove the 5.x.y to 5.x.z upgrade guide
This commit removes the upgrade guide from 5.x.y to 5.x.z.
It is reduntant in version 6.x.

(cherry picked from commit ea35982764)
2024-06-18 14:13:57 +00:00
Anna Stuchlik
e869eae5fa doc: add the 6.x.y to 6.x.z upgrade guide-6
This commit adds the upgrade guide from 6.x.y to 6.x.z.

(cherry picked from commit ead201496d)
2024-06-18 14:13:57 +00:00
Calle Wilund
dd4f483668 schema: Make "describe" use extensions to string
Fixes #19334

Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.

(cherry picked from commit 73abc56d79)
2024-06-18 14:13:51 +00:00
Calle Wilund
d18be9a7dc schema_extensions: Add an option to string method
Allow an extension to describe itself as the CQL property
string that created it (and is serialized to schema tables)

Only paxos extension requires override.

(cherry picked from commit d27620e146)
2024-06-18 14:13:51 +00:00
Botond Dénes
6682b50868 Merge '[Backport 6.0] doc: document keyspace and table for nodetool ring' from ScyllaDB
these two arguments are critical when tablets are enabled.

Fixes https://github.com/scylladb/scylladb/issues/19296

---

6.0 is the first release with tablets support. and `nodetool ring` is an important tool to understand the data distribution. so we need to backport this document change to 6.0

(cherry picked from commit aef1718833)

(cherry picked from commit ea3b8c5e4f)

 Refs #19297

Closes scylladb/scylladb#19309

* github.com:scylladb/scylladb:
  doc: document `keyspace` and `table` for `nodetool ring`
  doc: replace tab with space
2024-06-17 09:33:34 +03:00
Wojciech Mitros
d70cf46af0 mv: replicate the gossiped backlog to all shards
On each shard of each node we store the view update backlogs of
other nodes to, depending on their size, delay responses to incoming
writes, lowering the load on these nodes and helping them get their
backlog to normal if it were too high.

These backlogs are propagated between nodes in two ways: the first
one is adding them to replica write responses. The seconds one
is gossiping any changes to the node's backlog every 1s. The gossip
becomes useful when writes stop to some node for some time and we
stop getting the backlog using the first method, but we still want
to be able to select a proper delay for new writes coming to this
node. It will also be needed for the mv admission control.

Currently, the backlog is gossiped from shard 0, as expected.
However, we also receive the backlog only on shard 0 and only
update this shard's backlogs for the other node. Instead, we'd
want to have the backlogs updated on all shards, allowing us
to use proper delays also when requests are received on shards
different than 0.

This patch changes the backlog update code, so that the backlogs
on all shards are updated instead. This will only be performed
up to once per second for each other node, and is done with
a lower priority, so it won't severly impact other work.

Fixes: scylladb/scylladb#19232
(cherry picked from commit d31437b589)

Closes scylladb/scylladb#19302
2024-06-17 09:32:29 +03:00
Botond Dénes
869f2637b8 Merge '[Backport 6.0] Fix usage of utils/chunked_vector::reserve_partial' from ScyllaDB
utils/chunked_vector::reserve_partial: fix usage in callers

The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.

Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.

This PR updates the method comment and all the callers to use the
right way.

Fixes #19254

(cherry picked from commit 64768b58e5)

(cherry picked from commit 29f036a777)

(cherry picked from commit 0a22759c2a)

(cherry picked from commit d4f8b91bd6)

(cherry picked from commit 310c5da4bb)

(cherry picked from commit 83190fa075)

(cherry picked from commit c49f6391ab)

 Refs #19279

Closes scylladb/scylladb#19310

* github.com:scylladb/scylladb:
  utils/large_bitset: remove unused includes identified by clangd
  utils/large_bitset: use thread::maybe_yield()
  test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
  utils/lsa/chunked_managed_vector: fix reserve_partial()
  utils/chunked_vector: return void from reserve_partial and make_room
  test/boost/chunked_vector_test: fix testcase tests_reserve_partial
  utils/chunked_vector::reserve_partial: fix usage in callers
2024-06-17 09:31:28 +03:00
Israel Fruchter
d79f9156ed Update tools/cqlsh submodule v6.0.20
* tools/cqlsh c8158555...0d58e5ce (6):
  > cqlsh.py: fix server side describe after login command
  > cqlsh: try server-side DESCRIBE, then client-side
  > Refactor tests to accept both client and server side describe
  > github actions: support testing with enterprise release
  > Add the tab-completion support of SERVICE_LEVEL statements
  > reloc/build_reloc.sh: don't use `--no-build-isolation`

Closes: scylladb/scylladb#18989
(cherry picked from commit 1fd600999b)

Closes scylladb/scylladb#19132
2024-06-17 09:05:46 +03:00
Lakshmi Narayanan Sreethar
2fa4cb69b6 utils/large_bitset: remove unused includes identified by clangd
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit c49f6391ab)
2024-06-14 15:48:57 +00:00
Lakshmi Narayanan Sreethar
87397f43f6 utils/large_bitset: use thread::maybe_yield()
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 83190fa075)
2024-06-14 15:48:57 +00:00
Lakshmi Narayanan Sreethar
e64e659ef1 test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
Update the maximum size tested by the testcase. The test always created
only one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (12.8 KB). So, use twice the
max_chunk_capacity as the test size distribution upper limit to verify
that partial_reserve can reserve multiple chunks.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 310c5da4bb)
2024-06-14 15:48:57 +00:00
Kefu Chai
fb9a1b4e38 doc: document keyspace and table for nodetool ring
these two arguments are critical when tablets are enabled.

Fixes #19296
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit ea3b8c5e4f)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
397b04b2a4 utils/lsa/chunked_managed_vector: fix reserve_partial()
Fix the method comment and return types of chunked_managed_vector's
reserve_partial() similar to chunked_vector's reserve_partial() as it
has the same issues mentioned in #19254. Also update the usage in the
chunked_managed_vector_test.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d4f8b91bd6)
2024-06-14 15:48:56 +00:00
Kefu Chai
8f3d693c8a doc: replace tab with space
more consistent this way, also easier to format in a regular editor
without additional setup.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit aef1718833)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
8dc662ebde utils/chunked_vector: return void from reserve_partial and make_room
Since reserve_partial does not depend on the number of remaining
capacity to be reserved, there is no need to return anything from it and
the make_room method.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 0a22759c2a)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
4e68599b17 test/boost/chunked_vector_test: fix testcase tests_reserve_partial
Fix the usage of reserve_partial in the testcase. Also update the
maximum chunk size used by the testcase. The test always created only
one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (128 KB). So, use smaller chunk size,
512 bytes, to verify that partial_reserve can reserve multiple chunks.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 29f036a777)
2024-06-14 15:48:56 +00:00
Lakshmi Narayanan Sreethar
7072f7e706 utils/chunked_vector::reserve_partial: fix usage in callers
The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.

Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.

This commit updates the method comment and all the callers to use the
right way.

Fixes #19254

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 64768b58e5)
2024-06-14 15:48:56 +00:00
Kefu Chai
57980b77d3 test: test_topology_ops: adapt to tablets
in e7d4e080, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of #17025, which was fixed last week. so we can
enable background writes with tablets.

in this change,

* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
  to fulfill the needs of tablets, which enforces that the number
  of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
  magic numbers in the test, and make the data dependencies
  more obvious.

Fixes #17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 77f0259a63)

Closes scylladb/scylladb#19184
2024-06-14 15:54:36 +03:00
Botond Dénes
5139e74058 Merge '[Backport 6.0] Improve handling of outdated --experimental-features' from ScyllaDB
Some time ago it turned out that if unrecognized feature name is met in scylla.yaml, the whole experimental features list is ignored, but scylla continues to boot. There's UNUSED feature which is the proper way to deprecate a feature, and this PR improves its handling in several ways.

1. The recently removed "tablets" feature is partially brought back, but marked as UNUSED
2. Any UNUSED features met while parsing are printed into logs
3. The enum_option<> helper is enlightened along the way

refs: #18968

(cherry picked from commit f56cdb1cac)

(cherry picked from commit 0c0a7d9b9a)

(cherry picked from commit b85a02a3fe)

(cherry picked from commit b2520b8185)

 Refs #19230

Closes scylladb/scylladb#19266

* github.com:scylladb/scylladb:
  config: Mark tablets feature as unused
  main: Warn unused features
  enum_option: Carry optional key on board
  enum_option: Remove on-board _map member
2024-06-14 15:43:17 +03:00
Michał Chojnowski
ddcaefefdc test_tablets: add test_tablet_storage_freeing
Tests that tablet storage is freed after it is migrated away.

Fixes #16946

(cherry picked from commit 823da140dd)
2024-06-14 10:19:32 +00:00
Michał Chojnowski
f466dcfa5f test: pylib: add get_sstables_disk_usage()
Adds an util for measuring the disk usage of the given table on the given
node.
Will be used in a follow-up patch for testing that sstables are freed
properly.

(cherry picked from commit 7741491b47)
2024-06-14 10:19:32 +00:00
Benny Halevy
6122f9454d storage_service: join_token_ring: reject replace on different dc or rack
Do not allow replacing a node on one dc/rack
with a node on a different dc/rack as this violates
the assumption of replace node operation that
all token ranges previously owned by the dead
node would be rebuilt on the new node.

Fixes #16858
Refs scylladb/scylla-enterprise#3518

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 34dfa4d3a3)

Closes scylladb/scylladb#19281
2024-06-14 07:43:58 +03:00
Botond Dénes
b18d9e5d0d Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.

Fixes: https://github.com/scylladb/scylladb/issues/18674

- [x] This patch has to be backported because it fixes broken functionality

(cherry picked from commit dbccb61636)

(cherry picked from commit 4590026b38)

(cherry picked from commit feea609e37)

(cherry picked from commit 0c61b1822c)

(cherry picked from commit 8ef4fbdb87)

 Refs #18705

Closes scylladb/scylladb#19240

* github.com:scylladb/scylladb:
  test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
  test/pylib: rest_client: add get_injection()
  api/error_injection: add getter for error_injection
  utils/error_injection: add set_parameter()
  replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
2024-06-13 12:45:23 +03:00
Kamil Braun
cb6a97d0dc raft: fsm: add details to on_internal_error_noexcept message
If we receive a message in the same term but from a different leader
than we expect, we print:
```
Got append request/install snapshot/read_quorum from an unexpected leader
```
For some reason the message did not include the details (who the leader
was and who the sender was) which requires almost zero effort and might
be useful for debugging. So let's include them.

Ref: scylladb/scylla-enterprise#4276
(cherry picked from commit 99a0599e1e)

Closes scylladb/scylladb#19265
2024-06-13 11:25:11 +02:00
Wojciech Mitros
813fef44d3 exceptions: make view update timeouts inherit from timed_out_error
Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.

Fixes #19261
(cherry picked from commit 4aa7ada771)

Closes scylladb/scylladb#19262
2024-06-13 12:01:12 +03:00
Botond Dénes
1c67c6cf78 Merge '[Backport 6.0] test: memtable_test: increase unspooled_dirty_soft_limit ' from ScyllaDB
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.

after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.

Fixes https://github.com/scylladb/scylladb/issues/19034

---

the issue applies to both 5.4 and 6.0, and this issue hurts the CI stability, hence we should backport it.

(cherry picked from commit 2df4e9cfc2)

(cherry picked from commit 223fba3243)

 Refs #19252

Closes scylladb/scylladb#19258

* github.com:scylladb/scylladb:
  test: memtable_test: increase unspooled_dirty_soft_limit
  test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
2024-06-13 07:26:43 +03:00
Pavel Emelyanov
5811df4d4b config: Mark tablets feature as unused
This features used to be there for a while, but then it was removed by
83d491af02. This patch partially takes it
back, but maps to UNUSED, so that if met in config, it's warned, but
other features are parsed as well.

refs: #18968

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b2520b8185)
2024-06-12 18:35:32 +00:00
Pavel Emelyanov
cb9d6e080c main: Warn unused features
When seeing an UNUSED feature -- print it into log. This is where the
enum_option::key is in use. The thing is that experimental features map
different unused feature names into the single UNUSED feature enum
value, so once the feature is parsed its configured name only persists
in the option's key member (saved by previous patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit b85a02a3fe)
2024-06-12 18:35:32 +00:00
Pavel Emelyanov
86068790ec enum_option: Carry optional key on board
It facilitates option formatting, but the main purpose is to be able to
find out the exact keys, not values, later (see next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 0c0a7d9b9a)
2024-06-12 18:35:31 +00:00
Pavel Emelyanov
3501ede024 enum_option: Remove on-board _map member
The map in question is immutable and can obtained from the Mapper type
at any time, there's no need in keeping its copy on each enum_option

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit f56cdb1cac)
2024-06-12 18:35:31 +00:00
Anna Stuchlik
bc89aac9d0 doc: reorganize ToC of the Reference section
This commit adds a proper ToC to the Reference section to improve
how it renders.

(cherry picked from commit 63084c6798)

Closes scylladb/scylladb#19257
2024-06-12 19:12:53 +02:00
Kefu Chai
b39c0a1d15 test: memtable_test: increase unspooled_dirty_soft_limit
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.

after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.

Fixes #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 223fba3243)
2024-06-12 15:44:11 +00:00
Kefu Chai
548fd01bd4 test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.

after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.

Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 2df4e9cfc2)
2024-06-12 15:44:11 +00:00
Pavel Emelyanov
2306c3b522 test: Reduce failure detector timeout for failed tablets migration test
Most of the time this test spends waiting for a node to die. Helps 3x times

Was
  real	9m21,950s
  user	1m11,439s
  sys	1m26,022s

Now
  real	3m37,780s
  user	0m58,439s
  sys	1m13,698s

refs: #17764

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a4e8f9340a)

Closes scylladb/scylladb#19233
2024-06-12 10:02:45 +03:00
Tomasz Grabiec
6d90ff84d9 Merge '[Backport 6.0] tablets: Filter-out left nodes in get_natural_endpoints()' from ScyllaDB
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".

Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:

  storage_proxy - No mapping for :: in the passed effective replication map

It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.

Users which need replica list stability should use the host_id-based API.

Fixes #18843

(cherry picked from commit 3e1ba4c859)

(cherry picked from commit 0d596a425c)

 Refs #18955

Closes scylladb/scylladb#19143

* github.com:scylladb/scylladb:
  tablets: Filter-out left nodes in get_natural_endpoints()
  test: pylib: Extract start_writes() load generator utility
2024-06-12 01:31:38 +02:00
Botond Dénes
0d13c51dd4 test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
Avoid this the live-update feature of this config item breaking
silently.

(cherry picked from commit 8ef4fbdb87)
2024-06-11 17:32:37 +00:00
Botond Dénes
d4563e2b28 test/pylib: rest_client: add get_injection()
The /v2/error_injection/{injection} endpoint now has a GET method too,
expose this.

(cherry picked from commit 0c61b1822c)
2024-06-11 17:32:37 +00:00
Botond Dénes
bb18a8152e api/error_injection: add getter for error_injection
Allow external code to obtain information about an error injection
point, including whether it is enabled, and importantly, what its
parameters are. Together with the `set_parameter()` added in the
previous patch, this allows tests to read out the values of internal
parameters, via a set_parameter() injection point.

(cherry picked from commit feea609e37)
2024-06-11 17:32:37 +00:00
Botond Dénes
1947290c74 utils/error_injection: add set_parameter()
Allow injection points to write values into the parameter map, which
external code can then examine. This allows exfiltrating the values if
internal variables, to be examined by tests, without exposing these
variables via an "official" path.

(cherry picked from commit 4590026b38)
2024-06-11 17:32:36 +00:00
Botond Dénes
d121fc1264 replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
This config item is propagated to the table object via table::config.
Although the field in table::config, used to propagate the value, was
utils::updateable_value<T>, it was assigned a constant and so the
live-update chain was broken.
This patch fixes this.

(cherry picked from commit dbccb61636)
2024-06-11 17:32:36 +00:00
Michał Chojnowski
80ac0da11c storage_proxy: avoid infinite growth of _throttled_writes
storage_proxy has a throttling mechanism which attempts to limit the number
of background writes by forcefully raising CL to ALL
(it's not implemented exactly like that, but that's the effect) when
the amount of background and queued writes is above some fixed threshold.
If this is applied to a write, it becomes "throttled",
and its ID is appended to into _throttled_writes.

Whenever the amount of background and queued writes falls below the threshold,
writes are "unthrottled" — some IDs are popped from _throttled_writes
and the writes represented by these IDs — if their handlers still exist —
have their CL lowered back.

The problem here is that IDs are only ever removed from _throttled_writes
if the number of queued and background writes falls below the threshold.
But this doesn't have to happen in any finite time, if there's constant write
pressure. And in fact, in one load test, it hasn't happened in 3 hours,
eventually causing the buffer to grow into gigabytes and trigger OOM.

This patch is intended to be a good-enough-in-practice fix for the problem.

Fixes #17476
Fixes #1834

(cherry picked from commit fee48f67ef)

Closes scylladb/scylladb#19180
2024-06-11 18:33:38 +03:00
Raphael S. Carvalho
d4c3a43b34 replica: Refresh mutation source when allocating tablet replicas
Consider the following:

1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in

Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"

which means loss of write on pending replica.

The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.

Fixes #19052.
Fixes #19033.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 7b41630299)

Closes scylladb/scylladb#19229
2024-06-11 18:12:43 +03:00
Kefu Chai
31ba5561e7 build: remove coverage compiling options from the cxx_flags
in 44e85c7d, we remove coverage compiling options from the cflags
when building abseil. but in 535f2b21, these options were brought
back as parts of cxx_flags.

so we need to remove them again from cxx_flags.
Fixes #19219
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit d05db52d11)

Closes scylladb/scylladb#19237
2024-06-11 18:11:35 +03:00
Tomasz Grabiec
7479167af2 tablets: Filter-out left nodes in get_natural_endpoints()
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".

Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:

  storage_proxy - No mapping for :: in the passed effective replication map

It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.

Users which need replica list stability should use the host_id-based API.

Fixes #18843

(cherry picked from commit 0d596a425c)
2024-06-11 12:18:17 +02:00
Tomasz Grabiec
e35ab96f8b test: pylib: Extract start_writes() load generator utility
(cherry picked from commit 3e1ba4c859)
2024-06-11 12:18:17 +02:00
Guilherme Nogueira
1ace370ecd Remove comma that breaks CQL DML on tablets.rst
The current sample reads:

```cql
CREATE KEYSPACE my_keyspace
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'replication_factor': 3,
} AND tablets = {
    'enabled': false
};
```

The additional comma after `'replication_factor': 3` breaks the query execution.

(cherry picked from commit cf157e4423)

Closes scylladb/scylladb#19194
2024-06-10 20:24:22 +03:00
Kefu Chai
3e7de910ab docs: correct the link pointing to Scylla U
before this change it points to
https://university.scylladb.com/courses/scylla-operations/lessons/change-data-capture-cdc/
which then redirects the browser to
https://university.scylladb.com/courses/scylla-operations/,
but it should have point to
https://university.scylladb.com/courses/data-modeling/lessons/change-data-capture-cdc/

in this change, the hyperlink is corrected.

Fixes #19163
Refs 6e97b83b60
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit b5dce7e3d0)

Closes scylladb/scylladb#19198
2024-06-10 20:23:08 +03:00
Kefu Chai
9cf0d618d0 build: populate cxxflags to abseil
before this change, when building abseil, we don't pass cxxflags
to compiler, and abseil libraries are build with the default
optimization level. in the case of clang, its default optimization
level is `-O0`, it compiles the fastest, but the performance of
the emitted code is not optimized for runtime performance. but we
expect good performance for the release build. a typical command line
for building abseil looks like
```
clang++  -I/home/kefu/dev/scylladb/master/abseil -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o -MF absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o.d -o absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o -c /home/kefu/dev/scylladb/master/abseil/absl/base/internal/scoped_set_env.cc
```

so, in this change, we populate cxxflags to abseil, so that the
per-mode `-O` option can be populated when building abseil.

after this change, the command line building abseil in release mode
looks like

```
clang++  -I/home/kefu/dev/scylladb/master/abseil -ffunction-sections -fdata-sections  -O3 -mllvm -inline-threshold=2500 -fno-slp-vectorize -DSCYLLA_BUILD_MODE=release -g -gz -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o -MF absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o.d -o absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o -c /home/kefu/dev/scylladb/master/abseil/absl/flags/internal/commandlineflag.cc
```

Refs 0b0e661a85
Fixes #19161
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 535f2b2134)

Closes scylladb/scylladb#19200
2024-06-10 20:22:00 +03:00
Nadav Har'El
4810937ddf test/alternator: fix flaky test test_item_latency
The Alternator test test_metrics.py::test_item_latency confirms that
for several operation types (PutItem, GetItem, DeleteItem, UpdateItem)
we did not forget to measure their latencies.

The test checked that a latency was updated by checking that two metrics
increases:
    scylla_alternator_op_latency_count
    scylla_alternator_op_latency_sum

However, it turns out that the "sum" is only an approximate sum of all
latencies, and when the total sum grows large it sometimes does *not*
increase when a short latency is added to the statistics. When this
happens, this test fails on the assertion that the "sum" increases after
an operation. We saw this happening sometimes in CI runs.

The simple fix is to stop checking _sum at all, and only verify that
the _count increases - this is really an integer counter that
unconditionally increases when a latency is added to the histogram.

Don't worry that the strength of this test is reduced - this test was
never meant to check the accuracy or correctness of the histograms -
we should have different (and better) tests for that, unrelated to
Alternator. The purpose of *this* test is only to verify that for some
specific operation like PutItem, Alternator didn't forget to measure its
latency and update the histogram. We want to avoid a bug like we had
in counters in the past (#9406).

Fixes #18847.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 13cf6c543d)

Closes scylladb/scylladb#19193
2024-06-10 20:20:54 +03:00
Tomasz Grabiec
a3e4dc7b6c test: tablets: Fix flakiness of test_removenode_with_ignored_node due to read timeout
The check query may be executed on a node which doesn't yet see that
the downed server is down, as it is not shut down gracefully. The
query coordinator can choose the down node as a CL=1 replica for read
and time out.

To fix, wait for all nodes to notice the node is down before executing
the checking query.

Fixes #17938

(cherry picked from commit c8f71f4825)

Closes scylladb/scylladb#19199
2024-06-10 20:12:56 +03:00
Botond Dénes
7a6ff12ace Merge '[Backport 6.0] alternator: keep TTL work in the maintenance scheduling group' from ScyllaDB
Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.

Fixes: #18719

- [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.

(cherry picked from commit 5d3f7c13f9)

(cherry picked from commit 1fe8f22d89)

 Refs #18729

Closes scylladb/scylladb#19196

* github.com:scylladb/scylladb:
  alternator, scheduler: test reproducing RPC scheduling group bug
  main: add maintenance tenant to messaging_service's scheduling config
2024-06-10 19:58:38 +03:00
Anna Stuchlik
e38d675cb9 doc: mark tablets as GA in the CREATE KEYSPACE section
This commit removes the information that tablets are an experimental feature
from the CREATE KEYSPACE section.

In addition, it removes the notes and cautions that are redundant when
a feature is GA, especially the information and warnings about the future
plans.

Fixes https://github.com/scylladb/scylladb/issues/18670

Closes scylladb/scylladb#19063

(cherry picked from commit 55ed18db07)
2024-06-10 18:53:47 +03:00
Gleb Natapov
45ff4d2c41 group0, topology coordinator: run group0 and the topology coordinator in gossiper scheduling group
Currently they both run in streaming group and it may become busy during
repair/mv building and affect group0 functionality. Move it to the
gossiper group where it should have more time to run.

Fixes #18863

(cherry picked from commit a74fbab99a)

Closes scylladb/scylladb#19175
2024-06-10 10:34:29 +02:00
Nadav Har'El
0662e80917 alternator, scheduler: test reproducing RPC scheduling group bug
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.

Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.

The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.

The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22d89)
2024-06-10 07:42:23 +00:00
Botond Dénes
5b546ad4b1 main: add maintenance tenant to messaging_service's scheduling config
Currently only the user tenant (statement scheduling group) and system
(default scheduling group) tenants exist, as we used to have only
user-initiated operations and sytem (internal) ones. Now there is need
to distinguish between two kinds of system operation: foreground and
background ones. The former should use the system tenant while the
latter will use the new maintenance tenant (streaming scheduling group).

(cherry picked from commit 5d3f7c13f9)
2024-06-10 07:42:22 +00:00
Piotr Dulikowski
e04378fdf0 Merge ' [Backport 6.0] db/hints: Use host ID to IP mappings to choose the ep manager to drain when node is leaving' from Dawid Mędrek
In [d0f5873](d0f58736c8), we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster.

This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained.

We also introduce error injections to be able to test that it indeed happens.

Fixes scylladb/scylladb#18761

(cherry picked from commit [745a9c6](745a9c6ab8))

(cherry picked from commit [e855794](e855794327))

Refs scylladb/scylladb#18764

Closes scylladb/scylladb#19114

* github.com:scylladb/scylladb:
  db/hints: Introduce an error injection to test draining
  db/hints: Ensure that draining happens
2024-06-10 09:11:07 +02:00
Tomasz Grabiec
f8243cbf19 Merge '[Backport 6.0] Serialize repair with tablet migration' from ScyllaDB
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.

Fixes #17658.
Fixes #18561.

(cherry picked from commit 6c64cf33df)

(cherry picked from commit 1513d6f0b0)

(cherry picked from commit 476c076a21)

(cherry picked from commit c45ce41330)

(cherry picked from commit e97acf4e30)

(cherry picked from commit 98323be296)

(cherry picked from commit 5ca54a6e88)

 Refs #18641

Closes scylladb/scylladb#19144

* github.com:scylladb/scylladb:
  test: pylib: Do not block async reactor while removing directories
  repair: Exclude tablet migrations with tablet repair
  repair_service: Propagate topology_state_machine to repair_service
  main, storage_service: Move topology_state_machine outside storage_service
  storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
  tablet_scheduler: Make disabling of balancing interrupt shuffle mode
  tablet_scheduler: Log whether balancing is considered as enabled
2024-06-09 00:20:44 +02:00
Tomasz Grabiec
27f01bf4e3 test: pylib: Do not block async reactor while removing directories
This fixes a problem where suite cleanup schedules lots of uninstall()
tasks for servers started in the suite, which schedules lots of tasks,
which synchronously call rmtree(). These take over a minute to finish,
which blocks other tasks for tests which are still executing.

In particular, this was observed to case
ManagerClient.server_stop_gracefully() to time-out. It has a timeout
of 60 seconds. The server was stopped quickly, but the RESTful API
response was not processed in time and the call timed out when it got
the async reactor.

(cherry picked from commit 5ca54a6e88)
2024-06-08 16:31:18 +02:00
Tomasz Grabiec
ded9aca6ee repair: Exclude tablet migrations with tablet repair
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requets start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

Fixes #17658.
Fixes #18561.

(cherry picked from commit 98323be296)
2024-06-08 16:31:18 +02:00
Tomasz Grabiec
ccd441a4de repair_service: Propagate topology_state_machine to repair_service
(cherry picked from commit e97acf4e30)
2024-06-08 16:31:15 +02:00
Jenkins Promoter
79e4e411b3 Update ScyllaDB version to: 6.0.1 2024-06-07 09:31:05 +03:00
Kefu Chai
f8ba94a960 doc: document "enable_tablets" option
it sets the cluster feature of tablets, and is a prerequisite for
using tablets.

Refs #18670
Fixes #19157
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit bac7e1e942)

Closes scylladb/scylladb#19158
2024-06-07 07:03:30 +03:00
Tzach Livyatan
dfe89157c6 Docs: fix start command in Update replace-dead-node.rst
Fix #18920

(cherry picked from commit c30f81c389)

Closes scylladb/scylladb#19142
2024-06-07 07:02:02 +03:00
Kefu Chai
50d8fa6b77 topology_coordinator: handle/wait futures when stopping topology_coordinator
before this change, unlike other services in scylla,
topology_coordinator is not properly stopped when it is aborted,
because the scylla instance is no longer a leader or is being shut down.
its `run()` method just stops the grand loop and bails out before
topology_coordinator is destroyed. but we are tracking the migration
state of tablets using a bunch of futures, which might not be
handled yet, and some of them could carry failures. in that case,
when the `future` instances with failure state get destroyed,
seastar calls `report_failed_future`. and seastar considers this
practice a source a bug -- as one just fails to handle an error.
that's why we have following error:

```
WARN  2024-05-19 23:00:42,895 [shard 0:strm] seastar - Exceptional future ignored: seastar::rpc::unknown_verb_error (unknown verb), backtrace: /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56c14e /home/bhalevy/.ccm/scylla-repository/local_tarball/libre
loc/libseastar.so+0x56c770 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56ca58 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x38c6ad 0x29cdd07 0x29b376b 0x29a5b65 0x108105a /home/bhalevy/.ccm/scylla-repository/local_tarbal
l/libreloc/libseastar.so+0x3ff1df /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x400367 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff838 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36de58
 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d092 0x1017cba 0x1055080 0x1016ba7 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a 0x1015524
```
and the backtrace looks like:
```
seastar::current_backtrace_tasklocal() at ??:?
seastar::current_tasktrace() at ??:?
seastar::current_backtrace() at ??:?
seastar::report_failed_future(seastar::future_state_base::any&&) at ??:?
service::topology_coordinator::tablet_migration_state::~tablet_migration_state() at topology_coordinator.cc:?
service::topology_coordinator::~topology_coordinator() at topology_coordinator.cc:?
service::run_topology_coordinator(seastar::sharded<db::system_distributed_keyspace>&, gms::gossiper&, netw::messaging_service&, locator::shared_token_metadata&, db::system_keyspace&, replica::database&, service::raft_group0&, service::topology_state_machine&, seastar::abort_source&, raft::server&, seastar::noncopyable_function<seastar::future<service::raft_topology_cmd_result> (utils::tagged_tagged_integer<raft::internal::non_final, raft::term_tag, unsigned long>, unsigned long, service::raft_topology_cmd const&)>, service::tablet_allocator&, std::chrono::duration<long, std::ratio<1l, 1000l> >, service::endpoint_lifecycle_notifier&) [clone .resume] at topology_coordinator.cc:?
seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at main.cc:?
seastar::reactor::run_some_tasks() at ??:?
seastar::reactor::do_run() at ??:?
seastar::reactor::run() at ??:?
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ??:?
```

and even worse, these futures are indirectly owned by `topology_coordinator`.
so there are chances that they could be used even after `topology_coordinator`
is destroyed. this is a use-after-free issue. because the
`run_topology_coordinator` fiber exits when the scylla instance retires
from the leader's role, this use-after-free could be fatal to a
running instance due to undefined behavior of use after free.

so, in this change, we handle the futures in `_tablets`, and note
down the failures carried by them if any.

Fixes #18745
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 4a36918989)

Closes scylladb/scylladb#19139
2024-06-07 07:00:25 +03:00
Jenkins Promoter
a77615adf3 Update ScyllaDB version to: 6.0.0 2024-06-06 16:03:39 +03:00
Tomasz Grabiec
e518bb68b2 main, storage_service: Move topology_state_machine outside storage_service
It will be propagated to repair_service to avoid cyclic dependency:

storage_service <-> repair_service

(cherry picked from commit c45ce41330)
2024-06-06 13:01:19 +00:00
Tomasz Grabiec
af2caeb2de storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
Will be used later in a place which doesn't have access to storage_service
but has to toplogy_state_machine.

It's not necessary to start group0 operation around polling because
the busy() state can be checked atomically and if it's false it means
the topology is no longer busy.

(cherry picked from commit 476c076a21)
2024-06-06 13:01:19 +00:00
Tomasz Grabiec
d5ebfea1ff tablet_scheduler: Make disabling of balancing interrupt shuffle mode
Tests will rely on that, they will run in shuffle mode, and disable
balancing around section which otherwise would be infinitely blocked
by ongoing shuffling (like repair).

(cherry picked from commit 1513d6f0b0)
2024-06-06 13:01:18 +00:00
Tomasz Grabiec
3fec9e1344 tablet_scheduler: Log whether balancing is considered as enabled
(cherry picked from commit 6c64cf33df)
2024-06-06 13:01:18 +00:00
Kamil Braun
5d3dde50f4 Merge '[Backport 6.0] Fail bootstrap if ip mapping is missing during double write stage' from ScyllaDB
If a node restart just before it stores bootstrapping node's IP it will
not have ID to IP mapping for bootstrapping node which may cause failure
on a write path. Detect this and fail bootstrapping if it happens.

(cherry picked from commit 1faef47952)

(cherry picked from commit 27445f5291)

(cherry picked from commit 6853b02c00)

(cherry picked from commit f91db0c1e4)

 Refs #18927

Closes scylladb/scylladb#19118

* github.com:scylladb/scylladb:
  raft topology: fix indentation after previous commit
  raft topology: do not add bootstrapping node without IP as pending
  test: add test of bootstrap where the coordinator crashes just before storing IP mapping
  schema_tables: remove unused code
2024-06-06 11:35:13 +02:00
Tomasz Grabiec
b7fe4412d0 test: pylib: Fetch all pages by default in run_async
Fetching only the first page is not the intuitive behavior expected by users.

This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.

Fixes #18774

(cherry picked from commit 2c3f7c996f)

Closes scylladb/scylladb#19130
2024-06-06 08:22:45 +03:00
Benny Halevy
fd7284ec06 gms: endpoint_state: get_dc_rack: do not assign to uninitialized memory
Assigning to a member of an uninitialized optional
does not initialize the object before assigning to it.
This resulted in the AddressSanitizer detecting attempt
to double-free when the uninitialized string contained
apprently a bogus pointer.

The change emplaces the returned optional when needed
without resorting to the copy-assignment operator.
So it's not suceptible to assigning to uninitialized
memory, and it's more efficient as well...

Fixes scylladb/scylladb#19041

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b2fa954d82)

Closes scylladb/scylladb#19117
2024-06-06 08:21:05 +03:00
Botond Dénes
8d12eeee62 Merge '[Backport 6.0] tasks: introduce task manager's task folding' from Aleksandra Martyniuk
Task manager's tasks stay in memory after they are finished.
Moreover, even if a child task is unregistered from task manager,
it is still alive since its parent keeps a foreign pointer to it. Also,
when a task has finished successfully there is no point in keeping
all of its descendants in memory.

The patch introduces folding of task manager's tasks. Whenever
a task which has a parent is finished it is unregistered from task
manager and foreign_ptr to it (kept in its parent) is replaced
with its status. Children's statuses of the task are dropped unless
they or one of their descendants failed. So for each operation we
keep a tree of tasks which contains:
- a root task and its direct children (status if they are finished, a task
  otherwise);
- running tasks and their direct children (same as above);
- a statuses path from root to failed tasks.

/task_manager/wait_task/ does not unregister tasks anymore.

Refs: https://github.com/scylladb/scylladb/issues/16694.

- [ ] ** Backport reason (please explain below if this patch should be backported or not) **
Requires backport to 6.0 as task number exploded with tablets.

(cherry picked from commit 6add9edf8a)

(cherry picked from commit 319e799089)

(cherry picked from commit e6c50ad2d0)

(cherry picked from commit a82a2f0624)

(cherry picked from commit c1b2b8cb2c)

(cherry picked from commit 30f97ea133)

(cherry picked from commit fc0796f684)

(cherry picked from commit d7e80a6520)

(cherry picked from commit beef77a778)

Refs https://github.com/scylladb/scylladb/pull/18735

Closes scylladb/scylladb#19104

* github.com:scylladb/scylladb:
  docs: describe task folding
  test: rest_api: add test for task tree structure
  test: rest_api: modify new_test_module
  tasks: test: modify test_task methods
  api: task_manager: do not unregister task in /task_manager/wait_task/
  tasks: unregister tasks with parents when they are finished
  tasks: fold finished tasks info their parents
  tasks: make task_manager::task::impl::finish_failed noexcept
  tasks: change _children type
2024-06-06 07:56:12 +03:00
Gleb Natapov
e11827f37e raft topology: fix indentation after previous commit
(cherry picked from commit f91db0c1e4)
2024-06-05 13:55:29 +00:00
Gleb Natapov
0acfc223ab raft topology: do not add bootstrapping node without IP as pending
If there is no mapping from host id to ip while a node is in bootstrap
state there is no point adding it to pending endpoint since write
handler will not be able to map it back to host id anyway. If the
transition sate requires double writes though we still want to fail.
In case the state is write_both_read_old we fail the barrier that will
cause topology operation to rollback and in case of write_both_read_new
we assert but this should not happen since the mapping is persisted by
this point (or we failed in write_both_read_old state).

Fixes: scylladb/scylladb#18676
(cherry picked from commit 6853b02c00)
2024-06-05 13:55:28 +00:00
Gleb Natapov
c53cd98a41 test: add test of bootstrap where the coordinator crashes just before storing IP mapping
On the next boot there is no host ID to IP mapping which causes node to
crash again with "No mapping for :: in the passed effective replication map"
assertion.

(cherry picked from commit 27445f5291)
2024-06-05 13:55:28 +00:00
Gleb Natapov
fa6a7cf144 schema_tables: remove unused code
(cherry picked from commit 1faef47952)
2024-06-05 13:55:28 +00:00
Patryk Jędrzejczak
65021c4b1c [Backport 6.0] test: test_topology_ops: run correctly without tablets
The values of `tablets_enabled` were nonempty strings, so they
always evaluated to `True` in the if statement responsible for
enabling writing workers only if tablets are disabled. Hence, the
writing workers were always disabled.

The original commit, ea4717da65,
contains one more change, which is not needed (and conflicting)
in 6.0 because scylladb/scylladb#18898 has been backported first.

Closes scylladb/scylladb#19111
2024-06-05 15:15:00 +02:00
Botond Dénes
341c29bd74 Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.

Fixes https://github.com/scylladb/scylladb/issues/18085.

(cherry picked from commit abcc68dbe7)

(cherry picked from commit 551bf9dd58)

(cherry picked from commit e7246751b6)

Refs https://github.com/scylladb/scylladb/pull/18287

Closes scylladb/scylladb#19095

* github.com:scylladb/scylladb:
  topology_experimental_raft/test_tablets: restore usage of check_with_down
  test: Fix flakiness in topology_experimental_raft/test_tablets
  service: Use tablet read selector to determine which replica to account table stats
  storage_service: Fix race between tablet split and stats retrieval
2024-06-05 13:06:32 +03:00
Aleksandra Martyniuk
e963631859 docs: describe task folding
(cherry picked from commit beef77a778)
2024-06-05 10:09:13 +02:00
Jenkins Promoter
c6f0a3267e Update ScyllaDB version to: 6.0.0-rc3 2024-06-05 10:03:47 +03:00
Marcin Maliszkiewicz
f02f2fef40 docs: remove note about performance degradation with default superuser
This doesn't apply for auth-v2 as we improved data placement and
removed cassandra quirk which was setting different CL for some
default superuser involved operations.

Fixes #18773

(cherry picked from commit 9adf74ae6c)

Closes scylladb/scylladb#18860
2024-06-05 09:04:45 +03:00
Benny Halevy
f8ae38a68c data_dictionary: keyspace_metadata: format: print also initial_tablets
Currently, there is no indication of tablets in the logged KSMetaData.
Print the tablets configuration of either the`initial`  number of tablets,
if enabled, or {'enabled':false} otherwise.

For example:
```
migration_manager - Create new Keyspace: KSMetaData{name=tablets_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"initial":0}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004d446a8}

migration_manager - Create new Keyspace: KSMetaData{name=vnodes_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"enabled":false}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004c33ea8}

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4fe700a962)

Closes scylladb/scylladb#19009
2024-06-05 08:31:21 +03:00
Botond Dénes
8a064daccf Update tools/java submodule
* tools/java 4ee15fd9...6dfc187a (1):
  > Update Scylla Java driver to 3.11.5.3.

[botond: regenerate frozen toolchain]

Closes scylladb/scylladb#18999
2024-06-05 08:00:19 +03:00
Botond Dénes
7f540407c9 Merge '[Backport 6.0] repair: Introduce new primary replica selection algorithm for tablets' from ScyllaDB
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc

Use the tablet_map get_primary_replica or get_primary_replica_within_dc,
respectively to see if this node is the primary replica for each tablet
or not.

Fixes https://github.com/scylladb/scylladb/issues/17752

No backport is required before 6.0 as tablets (and tablet repair) are introduced in 6.0

(cherry picked from commit c52f70f92c)

(cherry picked from commit 2de79c39dc)

(cherry picked from commit 84761acc31)

(cherry picked from commit 009767455d)

(cherry picked from commit 18df36d920)

 Refs #18784

Closes scylladb/scylladb#19068

* github.com:scylladb/scylladb:
  repair: repair_tablets: use get_primary_replica
  repair: repair_tablets: no need to check ranges_specified per tablet
  locator: tablet_map: add get_primary_replica_within_dc
  locator: tablet_map: get_primary_replica: do not copy tablet info
  locator: tablet_map: get_primary_replica: return tablet_replica
2024-06-05 07:47:24 +03:00
Aleksandra Martyniuk
50e1369d1d test: rest_api: add test for task tree structure
Add test which checks whether the tasks are folded into their parent
as expected.

(cherry picked from commit d7e80a6520)
2024-06-04 14:42:10 +00:00
Aleksandra Martyniuk
21e860453c test: rest_api: modify new_test_module
Remove remaining test tasks when a test module is removed, so that
a node could shutdown even if a test fails.

(cherry picked from commit fc0796f684)
2024-06-04 14:42:10 +00:00
Dawid Medrek
fc3d2d8fde db/hints: Introduce an error injection to test draining
We want to verify that a hint directory is drained
when any of the nodes correspodning to it leaves
the cluster. The test scenario should happen before
the whole cluster has been migrated to
the host-ID-based hinted handoff, so when we still
rely on the mappings between hint endpoint managers
and the hint directories managed by them.

To make such a test possible, in these changes we
introduce an error injection rejecting incoming
hints. We want to test a scenario when:

1. hints are saved towards a given node -- node N1,
2. N1 changes its IP to a different one,
3. some other node -- node N2 -- changes its IP
   to the original IP of N1,
4. hints are saved towards N2 and they are stored
   in the same directory as the hints saved towards
   N1 before,
5. we start draining N2.

Because at some point N2 needs to be stopped,
it may happen that some mutations towards
a distributed system table generate a hint
to N2 BEFORE it has finished changing its IP,
effectively creating another hint directory
where ALL of the hints towards the node
will be stored from there on. That would disturb
the test scenario. Hence, this error injection is
necessary to ensure that all of the steps in the
test proceed as expected.

(cherry picked from commit e855794327)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
1d34da21a9 tasks: test: modify test_task methods
Wait until the task is done in test_task::finish_failed and
test_task::finish to ensure that it is folded into its parent.

(cherry picked from commit 30f97ea133)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
377bc345f1 api: task_manager: do not unregister task in /task_manager/wait_task/
If /task_manager/wait_task/ unregisters the task, then there is no
way to examine children failures, since their statuses can be checked
only through their parent.

(cherry picked from commit c1b2b8cb2c)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
607be221b8 tasks: unregister tasks with parents when they are finished
Unregister children that are finished from task manager. They can be
examined through they parents.

(cherry picked from commit a82a2f0624)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
cb242ad48c tasks: fold finished tasks info their parents
Currently, when a child task is unregistered, it is still kept by its parent. This leads
to excessive memory usage, especially when the tasks are configured to be kept in task
manager after they are finished (task_ttl_in_seconds).

Introduce task_essentials struct which keeps only data necesarry for task manager API.
When a task which has a parent is finished, a foreign pointer to it in its parent is replaced
with respective task_essentials. Once a parent task is finished it is also folded into
its parent (if it has one). Children details of a folded task are lost, unless they
(or some of their subtrees) failed. That is, when a task is finished, we keep:
- a root task (until it is unregistered);
- task_essentials of root's direct children;
- a path (of task_essentials) from root to each failed task (so that the reason
  of a failure could be examined).

(cherry picked from commit e6c50ad2d0)
2024-06-04 14:42:09 +00:00
Aleksandra Martyniuk
7258f4f73c tasks: make task_manager::task::impl::finish_failed noexcept
(cherry picked from commit 319e799089)
2024-06-04 14:42:09 +00:00
Dawid Medrek
82d635b6a7 db/hints: Ensure that draining happens
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.

Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.

Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.

(cherry picked from commit 745a9c6ab8)
2024-06-04 14:42:08 +00:00
Aleksandra Martyniuk
baf0385728 tasks: change _children type
Keep task children in a map. It's a preparation for further changes.

(cherry picked from commit 6add9edf8a)
2024-06-04 14:42:08 +00:00
Raphael S. Carvalho
a373ed52a5 topology_experimental_raft/test_tablets: restore usage of check_with_down
e7246751b6 incorrectly dropped its usage in
test_tablet_missing_data_repair.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-04 11:01:22 -03:00
Raphael S. Carvalho
9a341a65af replica: Only consume memtable of the tablet intersecting with range read
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.

Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).

That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.

Refs #18904.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 832fb43fb4)

Closes scylladb/scylladb#18983
2024-06-04 16:17:47 +03:00
Kefu Chai
35b4b47d74 build: add sanitizer compiling options directly
before this change, in order to avoid repeating/hardwiring the
compiling options set by Seastar, we just inherit the compiling
options of Seastar for building Abseil, as the former exposes the
options to enable sanitizers.

this works fine, despite that, strictly speaking, not all options
are necessary for building abseil, as abseil is not a Seastar
application -- it is just a C++ library.

but when we introduce dependencies which are only generated at
build time, and these dependencies are passed to the compiler
at build time, this breaks the build of Abseil. because these
dependencies are exposed by the Seastar's .pc file, and consumed
by Abseil. when building Abseil, apparently, the building process
driven by ninja is not started yet, so we are not able to build
Abseil with these settings due to missing dependencies.

so instead of inheriting the compiling options from Seastar, just
set the sanitizer related compiling options directly, to avoid
referencing these missing dependencies.

the upside is that we pass a much smaller set of compiling options
to compiler when building Abseil, the downside is that we hardwire
these options related to sanitizer manually, they are also detected
by Seastar's building system. but fortunately, these options are
relatively stable across the building environements we support.

Fixes #19055
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit c436dfd2db)

Closes scylladb/scylladb#19064
2024-06-04 15:00:43 +03:00
Benny Halevy
6d7388c689 repair: repair_tablets: use get_primary_replica
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc

Use the tablet_map get_primary_replica* functions to get
the primary replica for each tablet, possibly within a dc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 18df36d920)
2024-06-03 19:50:40 +00:00
Benny Halevy
6ac34f7acf repair: repair_tablets: no need to check ranges_specified per tablet
The code already turns off `primary_replica_only`
if `!ranges_specified.empty()`, so there's no need to
check it again inside the per-tablet loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 009767455d)
2024-06-03 19:50:40 +00:00
Benny Halevy
bdf3e71f62 locator: tablet_map: add get_primary_replica_within_dc
Will be needed by repair in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 84761acc31)
2024-06-03 19:50:40 +00:00
Benny Halevy
ec30bdc483 locator: tablet_map: get_primary_replica: do not copy tablet info
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2de79c39dc)
2024-06-03 19:50:40 +00:00
Benny Halevy
21f87c9cfa locator: tablet_map: get_primary_replica: return tablet_replica
This is required by repair when it will start using get_primary_replica
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c52f70f92c)
2024-06-03 19:50:39 +00:00
Botond Dénes
a38d5463ef Merge '[Backport 6.0] tablets: load balancer: Use random selection of candidates when moving tablets' from ScyllaDB
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.

Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.

For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions.  One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2.  Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):

  Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
  Overcommit       : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit       : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
  Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}

The worst state before the patch had the following distribution of tablets for the smaller table:

  Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
  Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
  Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
  Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
  Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
  Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
  Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33

One node has as many as 171 tablets of that table and another one has as few as 3.

After the patch, the worst distribution looks like this:

  Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
  Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
  Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
  Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
  Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
  Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
  Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65

Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.

Refs #16824

(cherry picked from commit 3be6120e3b)

(cherry picked from commit c9bcb5e400)

(cherry picked from commit 7b1eea794b)

(cherry picked from commit 603abddca9)

 Refs #18885

Closes scylladb/scylladb#19036

* github.com:scylladb/scylladb:
  tablets: load balancer: Use random selection of candidates when moving tablets
  test: perf: Add test for tablet load balancer effectiveness
  load_sketch: Extract get_shard_minmax()
  load_sketch: Allow populating only for a given table
2024-06-03 12:25:05 +03:00
Raphael S. Carvalho
3cb71c5b88 replica: Fix race of tablet snapshot with compaction
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.

Fixes #18977.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b396b05e20)

Closes scylladb/scylladb#19008
2024-06-03 12:21:52 +03:00
Lakshmi Narayanan Sreethar
85805f6472 db/config.cc: increment components_memory_reclaim_threshold config default
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.

Fixes #18607

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)

Closes scylladb/scylladb#19014
2024-06-03 12:19:16 +03:00
Pavel Emelyanov
62a23fd86a config: Remove experimental TABLETS feature
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.

The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)

Closes scylladb/scylladb#19012
2024-06-03 12:16:41 +03:00
Tomasz Grabiec
b9c88fdf4b tablets: load balancer: Use random selection of candidates when moving tablets
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.

Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.

For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions.  One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2.  Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):

  Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
  Overcommit       : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit       : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
  Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
  Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}

The worst state before the patch had the following distribution of tablets for the smaller table:

  Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
  Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
  Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
  Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
  Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
  Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
  Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33

One node has as many as 171 tablets of that table and the one has as few as 3.

After the patch, the worst distribution looks like this:

  Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
  Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
  Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
  Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
  Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
  Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
  Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65

Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.

Refs #16824

(cherry picked from commit 603abddca9)
2024-06-02 22:40:46 +00:00
Tomasz Grabiec
0c1b6fed16 test: perf: Add test for tablet load balancer effectiveness
(cherry picked from commit 7b1eea794b)
2024-06-02 22:40:45 +00:00
Tomasz Grabiec
fb7a33be13 load_sketch: Extract get_shard_minmax()
(cherry picked from commit c9bcb5e400)
2024-06-02 22:40:44 +00:00
Tomasz Grabiec
b208953e07 load_sketch: Allow populating only for a given table
(cherry picked from commit 3be6120e3b)
2024-06-02 22:40:44 +00:00
Michał Jadwiszczak
803662351d docs/procedures/backup-restore: use DESC SCHEMA WITH INTERNALS
Update docs for backup procedure to use `DESC SCHEMA WITH INTERNALS`
instead of plain `DESC SCHEMA`.
Add a note to use cqlsh in a proper version (at least 6.0.19).

Closes scylladb/scylladb#18953

(cherry picked from commit 5b4e688668)
2024-06-02 23:15:49 +02:00
Marcin Maliszkiewicz
cbf47319c1 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769

(cherry picked from commit 2ab143fb40)

[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
2024-06-02 21:41:14 +03:00
Jenkins Promoter
64388bcf22 Update ScyllaDB version to: 6.0.0-rc2 2024-06-02 15:35:58 +03:00
Anna Stuchlik
83dfe6bfd6 doc: add support for Ubuntu 24.04
(cherry picked from commit e81afa05ea)

Closes scylladb/scylladb#19010
2024-05-31 18:33:57 +03:00
Wojciech Mitros
3c47ab9851 mv: handle different ERMs for base and view table
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.

This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

(cherry-picked from 519317dc58)

This commit also includes the follow-up patch that removes the
flakiness from the test that is introduced by the commit above.
The flakiness was caused by enabling the
delay_before_get_view_natural_endpoint injection on a node
and not disabling it before the node is shut down. The patch
removes the enabling of the injection on the node in the first
place.
By squashing the commits, we won't introduce a place in the
commit history where a potential bisect could mistakenly fail.

Fixes: https://github.com/scylladb/scylladb/issues/18941

(cherry-picked from 0de3a5f3ff)

Closes scylladb/scylladb#18974
2024-05-30 09:13:31 +02:00
Anna Stuchlik
bef3777a5f doc: add the tablets information to the nodetool describering command
This commit adds an explanation of how the `nodetool describering` command
works if tablets are enabled.

(cherry picked from commit 888d7601a2)

Closes scylladb/scylladb#18981
2024-05-30 09:22:49 +03:00
Pavel Emelyanov
b25dd2696f Backport Merge 'tablets: alter keyspace' from Piotr Smaron
This change supports changing replication factor in tablets-enabled keyspaces.
This covers both increasing and decreasing the number of tablets replicas through
first building topology mutations (`alter_keyspace_statement.cc`) and then
tablets/topology/schema mutations (`topology_coordinator.cc`).
For the limitations of the current solution, please see the docs changes attached to this PR.

refs: scylladb/scylladb#16723

* br-backport-alter-ks-tablets:
  test: Do not check tablets mutations on nodes that don't have them
  test: Fix the way tablets RF-change test parses mutation_fragments
  test/tablets: Unmark RF-changing test with xfail
  docs: document ALTER KEYSPACE with tablets
  Return response only when tablets are reallocated
  cql-pytest: Verify RF is changes by at most 1 when tablets on
  cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
  Reject ALTER with 'replication_factor' tag
  Implement ALTER tablets KEYSPACE statement support
  Parameterize migration_manager::announce by type to allow executing different raft commands
  Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks
  Extend system.topology with 3 new columns to store data required to process alter ks global topo req
  Allow query_processor to check if global topo queue is empty
  Introduce new global topo `keyspace_rf_change` req
  New raft cmd for both schema & topo changes
  Add storage service to query processor
  tablets: tests for adding/removing replicas
  tablet_allocator: make load_balancer_stats_manager configurable by name
2024-05-30 08:33:58 +03:00
Pavel Emelyanov
57d267a97e test: Do not check tablets mutations on nodes that don't have them
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
5b8523273b test: Fix the way tablets RF-change test parses mutation_fragments
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
6497ed68ed test/tablets: Unmark RF-changing test with xfail
Now the scailing works and test must check it does

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Piotr Smaron
39c1237e25 docs: document ALTER KEYSPACE with tablets 2024-05-30 08:33:26 +03:00
Piotr Smaron
e04964ba17 Return response only when tablets are reallocated
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.
2024-05-30 08:33:26 +03:00
Dawid Medrek
fb5b9012e6 cql-pytest: Verify RF is changes by at most 1 when tablets on
This commit adds a test verifying that we can only
change the RF of a keyspace for any DC by at most 1
when using tablets.

Fixes #18029
2024-05-30 08:33:26 +03:00
Dawid Medrek
749197f0a4 cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
We want to ensure that when the replication factor
of a keyspace changes, it changes by at most 1 per DC
if it uses tablets. The rationale for that is to make
sure that the old and new quorums overlap by at least
one node.

After these changes, attempts to change the RF of
a keyspace in any DC by more than 1 will fail.
2024-05-30 08:33:26 +03:00
Piotr Smaron
1f4428153f Reject ALTER with 'replication_factor' tag
This patch removes the support for the "wildcard" replication_factor
option for ALTER KEYSPACE when the keyspace supports tablets.
It will still be supported for CREATE KEYSPACE so that a user doesn't
have to know all datacenter names when creating the keyspace,
but ALTER KEYSPACE will require that and the user will have to
specify the exact change in replication factors they wish to make by
explicitly specifying the datacenter names.
Expanding the replication_factor option in the ALTER case is
unintuitive and it's a trap many users fell into.

See #8881, #15391, #16115
2024-05-30 08:33:26 +03:00
Piotr Smaron
544c424e89 Implement ALTER tablets KEYSPACE statement support
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
2024-05-30 08:33:25 +03:00
Piotr Smaron
73b59b244d Parameterize migration_manager::announce by type to allow executing different raft commands
Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.
2024-05-30 08:33:15 +03:00
Piotr Smaron
5afa3028a3 Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks 2024-05-30 08:33:15 +03:00
Piotr Smaron
885c7309ee Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
2024-05-30 08:33:15 +03:00
Piotr Smaron
adfad686b3 Allow query_processor to check if global topo queue is empty
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.
2024-05-30 08:33:15 +03:00
Piotr Smaron
1a70db17a6 Introduce new global topo keyspace_rf_change req
It will be used when processing ALTER KS statement, but also to
create a separate processing path for a KS with tablets (as opposed to
a vnode KS).
2024-05-30 08:33:15 +03:00
Piotr Smaron
bd4b781dc8 New raft cmd for both schema & topo changes
Allows executing combined topology & schema mutations under a single RAFT command
2024-05-30 08:33:15 +03:00
Piotr Smaron
51b8b04d97 Add storage service to query processor
Query processor needs to access storage service to check if global
topology request is still ongoing and to be able to wait until it
completes.
2024-05-30 08:33:15 +03:00
Paweł Zakrzewski
242caa14fe tablets: tests for adding/removing replicas
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.
2024-05-30 08:33:15 +03:00
Paweł Zakrzewski
cedb47d843 tablet_allocator: make load_balancer_stats_manager configurable by name
This is needed, because the same name cannot be used for 2 separate
entities, because we're getting double-metrics-registration error, thus
the names have to be configurable, not hardcoded.
2024-05-30 08:33:15 +03:00
Pavel Emelyanov
da816bf50c test/tablets: Check that after RF change data is replicated properly
There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18644
2024-05-30 08:31:48 +03:00
Botond Dénes
8bff078a89 Merge '[Backport 6.0] cdc, raft topology: fix and test cdc in the recovery mode' from ScyllaDB
This PR ensures that CDC keeps working correctly in the recovery
mode after leaving the raft-based topology.

We update `system.cdc_local` in `topology_state_load` to ensure
a node restarting in the recovery mode sees the last CDC generation
created by the topology coordinator.

Additionally, we extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator.

Fixes scylladb/scylladb#17409
Fixes scylladb/scylladb#17819

(cherry picked from commit 4351eee1f6)

(cherry picked from commit 68b6e8e13e)

(cherry picked from commit 388db33dec)

(cherry picked from commit 2111cb01df)

 Refs #18820

Closes scylladb/scylladb#18938

* github.com:scylladb/scylladb:
  test: test_topology_recovery_basic: test CDC during recovery
  test: util: start_writes_to_cdc_table: add FIXME to increase CL
  test: util: start_writes_to_cdc_table: allow restarting with new cql
  storage_service: update system.cdc_local in topology_state_load
2024-05-29 16:14:02 +03:00
Botond Dénes
68d12daa7b Merge '[Backport 6.0] Fix parsing of initial tablets by ALTER' from ScyllaDB
If the user wants to change the default initial tablets value, it uses ALTER KEYSPACE statement. However, specifying `WITH tablets = { initial: $value }`  will take no effect, because statement analyzer only applies `tablets` parameters together with the `replication` ones, so the working statement should be `WITH replication = $old_parameters AND tablets = ...` which is not very convenient.

This PR changes the analyzer so that altering `tablets` happens independently from `replication`. Test included.

fixes: #18801

(cherry picked from commit 8a612da155)

(cherry picked from commit a172ef1bdf)

(cherry picked from commit 1003391ed6)

 Refs #18899

Closes scylladb/scylladb#18918

* github.com:scylladb/scylladb:
  cql-pytest: Add validation of ALTER KEYSPACE WITH TABLETS
  cql3: Fix parsing of ALTER KEYSPACE's tablets parameters
  cql3: Remove unused ks_prop_defs/prepare_options() argument
2024-05-29 16:13:26 +03:00
Patryk Jędrzejczak
e1616a2970 test: test_topology_ops: stop a write worker after the first error
`test_topology_ops` is flaky, which has been uncovered by gating
in scylladb/scylladb#18707. However, debugging it is harder than it
should be because write workers can flood the logs. They may send
a lot of failed writes before the test fails. Then, the log file
can become huge, even up to 20 GB.

Fix this issue by stopping a write worker after the first error.

This test is important for 6.0, so we can backport this change.

(cherry picked from commit 7c1e6ba8b3)

Closes scylladb/scylladb#18914
2024-05-29 16:11:46 +03:00
Kefu Chai
62f5171a55 docs: fix typos in upgrade document
s/Montioring/Monitoring/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit f1f3f009e7)

Closes scylladb/scylladb#18912
2024-05-29 16:11:08 +03:00
Piotr Smaron
fd928601ad cql: fix a crash lurking in ks_prop_defs::get_initial_tablets
`tablets_options->erase(it);` invalidates `it`, but it's still referred
to later in the code in the last `else`, and when that code is invoked,
we get a `heap-use-after-free` crash.

Fixes: #18926
(cherry picked from commit 8a77a74d0e)

Closes scylladb/scylladb#18949
2024-05-29 16:10:24 +03:00
Aleksandra Martyniuk
ae474f6897 test: fix test_tombstone_gc.py
Tests in test_tombstone_gc.py are parametrized with string instead
of bool values. Fix that. Use the value to create a keyspace with
or without tablets.

Fixes: #18888.
(cherry picked from commit b7ae7e0b0e)

Closes scylladb/scylladb#18948
2024-05-29 16:09:45 +03:00
Anna Stuchlik
099338b766 doc: add the tablet limitation to the manual recovery procedure
This commit adds the information that the manual recovery procedure
is not supported if tablets are enabled.

In addition, the content in the Manual Recovery Procedure is reorganized
by adding the Prerequisites and Procedure subsections - in this way,
we can limit the number of Note and Warning boxes that made the page
hard to follow.

Fixes https://github.com/scylladb/scylladb/issues/18895

(cherry picked from commit cfa3cd4c94)

Closes scylladb/scylladb#18947
2024-05-29 16:08:44 +03:00
Anna Stuchlik
375610ace8 doc: document RF limitation
This commit adds the information that the Replication Factor
must be the same or higher than the number of nodes.

(cherry picked from commit 87f311e1e0)

Closes scylladb/scylladb#18946
2024-05-29 16:08:05 +03:00
Botond Dénes
1b64e80393 Merge '[Backport 6.0] Harden the repair_service shutdown path' from ScyllaDB
This series ignores errors in `load_history()` to prevent `abort_requested_exception` coming from `get_repair_module().check_in_shutdown()` from escaping during `repair_service::stop()`, causing
```
repair_service::~repair_service(): Assertion `_stopped' failed.
```

Fixes https://github.com/scylladb/scylladb/issues/18889

Backport to 6.0 required due to 523895145d

(cherry picked from commit 38845754c4)

(cherry picked from commit c32c418cd5)

 Refs #18890

Closes scylladb/scylladb#18944

* github.com:scylladb/scylladb:
  repair: load_history: warn and ignore all errors
  repair_service: debug stop
2024-05-29 16:07:40 +03:00
Benny Halevy
fa330a6a4d repair: load_history: warn and ignore all errors
Currently, the call to `get_repair_module().check_in_shutdown()`
may throw `abort_requested_exception` that causes
`repair_service::stop()` to fail, and trigger assertion
failure in `~repair_service`.

We alredy ignore failure from `update_repair_time`,
so expand the logic to cover the whole function body.

Fixes scylladb/scylladb#18889

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c32c418cd5)
2024-05-28 17:07:26 +00:00
Benny Halevy
68544d5bb3 repair_service: debug stop
Seen the following unexplained assertion failure with
pytest -s -v --scylla-version=local_tarball --tablets repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc
```
INFO  2024-05-27 11:18:05,081 [shard 0:main] init - Shutting down repair service
INFO  2024-05-27 11:18:05,081 [shard 0:main] task_manager - Stopping module repair
INFO  2024-05-27 11:18:05,081 [shard 0:main] task_manager - Unregistered module repair
INFO  2024-05-27 11:18:05,081 [shard 1:main] task_manager - Stopping module repair
INFO  2024-05-27 11:18:05,081 [shard 1:main] task_manager - Unregistered module repair
scylla: repair/row_level.cc:3230: repair_service::~repair_service(): Assertion `_stopped' failed.
Aborting on shard 0.
Backtrace:
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3f040c
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x41c7a1
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x3dbaf
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x8e883
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x3dafd
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x2687e
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x2679a
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x36186
  0x26f2428
  0x10fb373
  0x10fc8b8
  0x10fc809
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x456c6d
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x456bcf
  0x10fc65b
  0x10fc5bc
  0x10808d0
  0x1080800
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff22f
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x4003b7
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff888
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36dea8
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d0e2
  0x101cefa
  0x105a390
  0x101bde7
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89
  /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a
  0x101a764
```

Decoded:
```
~repair_service at ./repair/row_level.cc:3230
~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:491
 (inlined by) ~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:491
~shared_ptr at ././seastar/include/seastar/core/shared_ptr.hh:569
 (inlined by) seastar::shared_ptr<repair_service>::operator=(seastar::shared_ptr<repair_service>&&) at ././seastar/include/seastar/core/shared_ptr.hh:582
 (inlined by) seastar::shared_ptr<repair_service>::operator=(decltype(nullptr)) at ././seastar/include/seastar/core/shared_ptr.hh:588
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:727
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&>(seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&) at ././seastar/include/seastar/core/future.hh:2035
 (inlined by) seastar::futurize<std::invoke_result<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>::type>::type seastar::smp::submit_to<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) at ././seastar/include/seastar/core/smp.hh:367
seastar::futurize<std::invoke_result<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>::type>::type seastar::smp::submit_to<seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>(unsigned int, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&&) at ././seastar/include/seastar/core/smp.hh:394
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:725
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int>(std::__invoke_other, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<seastar::future<void>, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int>, seastar::future<void> >::type std::__invoke_r<seastar::future<void>, seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int>(seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}&, unsigned int&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:114
 (inlined by) std::_Function_handler<seastar::future<void> (unsigned int), seastar::sharded<repair_service>::stop()::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda(unsigned int)#1}>::_M_invoke(std::_Any_data const&, unsigned int&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:290
```

FWIW, gdb crashed when opening the coredump.

This commit will help catch the issue earlier
when repair_service::stop() fails (and it must never fail)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 38845754c4)
2024-05-28 17:07:26 +00:00
Piotr Dulikowski
bc711a169d Merge '[Backport 6.0] qos/raft_service_level_distributed_data_accessor: print correct error message when trying to modify a service level in recovery mode' from ScyllaDB
Raft service levels are read-only in recovery mode. This patch adds check and proper error message when a user tries to modify service levels in recovery mode.

Fixes https://github.com/scylladb/scylladb/issues/18827

(cherry picked from commit 2b56158d13)

(cherry picked from commit ee08d7fdad)

(cherry picked from commit af0b6bcc56)

 Refs #18841

Closes scylladb/scylladb#18913

* github.com:scylladb/scylladb:
  test/auth_cluster/test_raft_service_levels: try to create sl in recovery
  service/qos/raft_sl_dda: reject changes to service levels in recovery mode
  service/qos/raft_sl_dda: extract raft_sl_dda steps to common function
2024-05-28 16:45:52 +02:00
Patryk Jędrzejczak
0d0c037e1d test: test_topology_recovery_basic: test CDC during recovery
In topology on raft, management of CDC generations is moved to the
topology coordinator. We extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator. A node restarting in the recovery mode
should learn about the active generation from `system.cdc_local`
(or from gossip, but we don't want to rely on it). Then, it should
load its data from `system.cdc_generations_v3`.

Fixes scylladb/scylladb#17409

(cherry picked from commit 2111cb01df)
2024-05-28 14:02:00 +00:00
Patryk Jędrzejczak
4d616ccb8c test: util: start_writes_to_cdc_table: add FIXME to increase CL
(cherry picked from commit 388db33dec)
2024-05-28 14:02:00 +00:00
Patryk Jędrzejczak
25d3398b93 test: util: start_writes_to_cdc_table: allow restarting with new cql
This patch allows us to restart writing (to the same table with
CDC enabled) with a new CQL session. It is useful when we want to
continue writing after closing the first CQL session, which
happens during the `reconnect_driver` call. We must stop writing
before calling `reconnect_driver`. If a write started just before
the first CQL session was closed, it would time out on the client.

We rename `finish_and_verify` - `stop_and_verify` is a better
name after introducing `restart`.

(cherry picked from commit 68b6e8e13e)
2024-05-28 14:02:00 +00:00
Patryk Jędrzejczak
ed3ac1eea4 storage_service: update system.cdc_local in topology_state_load
When the node with CDC enabled and with the topology on raft
disabled bootstraps, it reads system.cdc_local for the last
generation. Nodes with both enabled use group0 to get the last
generation.

In the following scenario with a cluster of one node:
1. the node is created with CDC and the topology on raft enabled
2. the user creates table T
3. the node is restarted in the recovery mode
4. the CDC log of T is extended with new entries
5. the node restarts in normal mode
The generation created in the step 3 is seen in
system_distributed.cdc_generation_timestamps but not in
system.cdc_generations_v3, thus there are used streams that the CDC
based on raft doesn't know about. Instead of creating a new
generation, the node should use the generation already committed
to group0.

Save the last CDC generation in the system.cdc_local during loading
the topology state so that it is visible for CDC not based on raft.

Fixes scylladb/scylladb#17819

(cherry picked from commit 4351eee1f6)
2024-05-28 14:01:59 +00:00
Anna Stuchlik
7229c820cf doc: describe Tablets in ScyllaDB
This commit adds the main description of tablets and their
benefits.
The article can be used as a reference in other places
across the docs where we mention tablets.

(cherry picked from commit b5c006aadf)

Closes scylladb/scylladb#18916
2024-05-28 11:27:53 +02:00
Pavel Emelyanov
67878af591 cql-pytest: Add validation of ALTER KEYSPACE WITH TABLETS
There's a test that checks how ALTER changes the initial tablets value,
but it equips the statement with `replication` parameters because of
limitations that parser used to impose. Now the `tablets` parameters can
come on their own, so add a new test. The old one is kept from
compatibility considerations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1003391ed6)
2024-05-28 02:07:58 +00:00
Pavel Emelyanov
2dbc555933 cql3: Fix parsing of ALTER KEYSPACE's tablets parameters
When the `WITH` doesn't include the `replication` parameters, the
`tablets` one is ignoded, even if it's present in the statement. That's
not great, those two parameter sets are pretty much independent and
should be parsed individually.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a172ef1bdf)
2024-05-28 02:07:58 +00:00
Pavel Emelyanov
3b9c86dcf5 cql3: Remove unused ks_prop_defs/prepare_options() argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 8a612da155)
2024-05-28 02:07:58 +00:00
Raphael S. Carvalho
b6f3891282 test: Fix flakiness in topology_experimental_raft/test_tablets
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.

Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by https://github.com/scylladb/python-driver/issues/230)

Refs #15356.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit e7246751b6)
2024-05-27 18:21:21 +00:00
Raphael S. Carvalho
46220bd839 service: Use tablet read selector to determine which replica to account table stats
Since we introduced the ability to revert migrations, we can no longer
rely on ordering of transition stages to determine whether to account
pending or leaving replica. Let's use read selector instead, which
correctly has info which replica type has correct stats info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 551bf9dd58)
2024-05-27 18:21:21 +00:00
Raphael S. Carvalho
55a45e3486 storage_service: Fix race between tablet split and stats retrieval
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.

It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).

Fixes #18085.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit abcc68dbe7)
2024-05-27 18:21:21 +00:00
Michał Jadwiszczak
1dd522edc8 test/auth_cluster/test_raft_service_levels: try to create sl in recovery
(cherry picked from commit af0b6bcc56)
2024-05-27 18:20:36 +00:00
Michał Jadwiszczak
6d655e6766 service/qos/raft_sl_dda: reject changes to service levels in recovery
mode

When a cluster goes into recovery mode and service levels were migrated
to raft, service levels become temporarily read-only.

This commit adds a proper error message in case a user tries to do any
changes.

(cherry picked from commit ee08d7fdad)
2024-05-27 18:20:36 +00:00
Michał Jadwiszczak
54b9fdab03 service/qos/raft_sl_dda: extract raft_sl_dda steps to common function
When setting/dropping a service level using raft data accessor, the same
validation steps are executed (this_shard_id = 0 and guard is present).
To not duplicate the calls in both functions, they can be extracted to a
helper function.

(cherry picked from commit 2b56158d13)
2024-05-27 18:20:36 +00:00
Raphael S. Carvalho
13f8486cd7 replica: Fix tablet's compaction_groups_for_token_range() with unowned range
File-based tablet streaming calls every shard to return data of every
group that intersects with a given range.
After dynamic group allocation, that breaks as the tablet range will
only be present in a single shard, so an exception is thrown causing
migration to halt during streaming phase.
Ideally, only one shard is invoked, but that's out of the scope of this
fix and compaction_groups_for_token_range() should return empty result
if none of the local groups intersect with the range.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit eb8ef38543)

Closes scylladb/scylladb#18859
2024-05-27 15:20:04 +03:00
Kefu Chai
747ffd8776 migration_manager: do not reference moved-away smart pointer
this change is inspired by clang-tidy. it warns like:
```
[752/852] Building CXX object service/CMakeFiles/service.dir/migration_manager.cc.o
Warning: /home/runner/work/scylladb/scylladb/service/migration_manager.cc:891:71: warning: 'view' used after it was moved [bugprone-use-after-move]
  891 |             db.get_notifier().before_create_column_family(*keyspace, *view, mutations, ts);
      |                                                                       ^
/home/runner/work/scylladb/scylladb/service/migration_manager.cc:886:86: note: move occurred here
  886 |             auto mutations = db::schema_tables::make_create_view_mutations(keyspace, std::move(view), ts);
      |                                                                                      ^
```
in which,  `view` is an instance of view_ptr which is a type with the
semantics of shared pointer, it's backed by a member variable of
`seastar::lw_shared_ptr<const schema>`, whose move-ctor actually resets
the original instance. so we are actually accessing the moved-away
pointer in

```c++
db.get_notifier().before_create_column_family(*keyspace, *view, mutations, ts)
```

so, in this change, instead of moving away from `view`, we create
a copy, and pass the copy to
`db::schema_tables::make_create_view_mutations()`. this should be fine,
as the behavior of `db::schema_tables::make_create_view_mutations()`
does not rely on if the `view` passed to it is a moved away from it or not.

the change which introduced this use-after-move was 88a5ddabce

Refs 88a5ddabce
Fixes #18837
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 125464f2d9)

Closes scylladb/scylladb#18873
2024-05-27 15:18:29 +03:00
Anna Stuchlik
a87683c7be doc: remove outdated MV error from Troubleshooting
This commit removes the MV error message, which only
affect older versions of ScyllaDB, from the Troubleshooting section.

Fixes https://github.com/scylladb/scylladb/issues/17205

(cherry picked from commit 92bc8053e2)

Closes scylladb/scylladb#18855
2024-05-27 15:12:22 +03:00
Anna Stuchlik
eff7b0d42d doc: replace Raft-disabled with Raft-enabled procedure
This commit fixes the incorrect Raft-related information on the Handling Cluster Membership Change Failures page
introduced with https://github.com/scylladb/scylladb/pull/17500.

The page describes the procedure for when Raft is disabled. Since 6.0, Raft for consistent schema management
is enabled and mandatory (cannot be disabled), this commit adds the procedure for Raft-enabled setups.

(cherry picked from commit 6626d72520)

Closes scylladb/scylladb#18858
2024-05-27 15:11:09 +03:00
David Garcia
7dbcfe5a39 docs: docs: autogenerate metrics
Autogenerates metrics documentation using the scripts/get_description.py script introduced in #17479

docs: add beta
(cherry picked from commit 9eef3d6139)

Closes scylladb/scylladb#18857
2024-05-27 15:10:48 +03:00
Jenkins Promoter
d078bafa00 Update ScyllaDB version to: 6.0.0-rc1 2024-05-23 15:35:32 +03:00
Yaron Kaikov
1b4d5d02ef Update ScyllaDB version to: 6.0.0-rc0 2024-05-22 14:07:45 +03:00
303 changed files with 6942 additions and 2102 deletions

View File

@@ -1 +0,0 @@
**Please replace this line with justification for the backport/\* labels added to this PR**

View File

@@ -1,4 +1,3 @@
import requests
from github import Github
import argparse
import re
@@ -23,36 +22,65 @@ def parser():
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--label', type=str, required=True, help='Label to use')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
def add_comment_and_close_pr(pr, comment):
if pr.state == 'open':
pr.create_issue_comment(comment)
pr.edit(state="closed")
def mark_backport_done(repo, ref_pr_number, branch):
pr = repo.get_pull(int(ref_pr_number))
label_to_remove = f'backport/{branch}'
label_to_add = f'{label_to_remove}-done'
current_labels = [label.name for label in pr.get_labels()]
if label_to_remove in current_labels:
pr.remove_from_labels(label_to_remove)
if label_to_add not in current_labels:
pr.add_to_labels(label_to_add)
def main():
# This script is triggered by a push event to either the master branch or a branch named branch-x.y (where x and y represent version numbers). Based on the pushed branch, the script performs the following actions:
# - When ref branch is `master`, it will add the `promoted-to-master` label, which we need later for the auto backport process
# - When ref branch is `branch-x.y` (which means we backported a patch), it will replace in the original PR the `backport/x.y` label with `backport/x.y-done` and will close the backport PR (Since GitHub close only the one referring to default branch)
args = parser()
pr_pattern = re.compile(r'Closes .*#([0-9]+)')
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
processed_prs = set()
# Print commit information
for commit in commits.commits:
print(commit.sha)
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
pr_number = match.group(1)
url = f'https://api.github.com/repos/{args.repository}/issues/{pr_number}/labels'
data = {
"labels": [f'{args.label}']
}
headers = {
"Authorization": f"token {github_token}",
"Accept": "application/vnd.github.v3+json"
}
response = requests.post(url, headers=headers, json=data)
if response.ok:
print(f"Label added successfully to {url}")
pr_number = int(match.group(1))
if pr_number in processed_prs:
continue
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'
add_comment_and_close_pr(pr, comment)
else:
print(f"No label was added to {url}")
print(f'master branch, pr number is: {pr_number}')
pr = repo.get_pull(pr_number)
pr.add_to_labels('promoted-to-master')
processed_prs.add(pr_number)
if __name__ == "__main__":

View File

@@ -4,6 +4,10 @@ on:
push:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
jobs:
check-commit:
@@ -15,6 +19,8 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Install dependencies
@@ -23,4 +29,4 @@ jobs:
- name: Run python script
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --label promoted-to-master
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.5.0-dev
VERSION=6.0.3
if test -f version
then

View File

@@ -4576,7 +4576,7 @@ static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_vie
// used by default on new Alternator tables. Change this initialization
// to 0 enable tablets by default, with automatic number of tablets.
std::optional<unsigned> initial_tablets;
if (sp.get_db().local().get_config().check_experimental(db::experimental_features_t::feature::TABLETS)) {
if (sp.get_db().local().get_config().enable_tablets()) {
auto it = tags_map.find(INITIAL_TABLETS_TAG_KEY);
if (it != tags_map.end()) {
// Tag set. If it's a valid number, use it. If not - e.g., it's

View File

@@ -211,7 +211,10 @@ protected:
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (_gossiper.is_alive(ip)) {
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We need the node to be in normal state. See #19694.
if (_gossiper.is_normal(ip)) {
rjson::push_back(results, rjson::from_string(fmt::to_string(ip)));
}
}

View File

@@ -63,6 +63,28 @@
"paramType":"path"
}
]
},
{
"method":"GET",
"summary":"Read the state of an injection from all shards",
"type":"array",
"items":{
"type":"error_injection_info"
},
"nickname":"read_injection",
"produces":[
"application/json"
],
"parameters":[
{
"name":"injection",
"description":"injection name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
@@ -152,5 +174,39 @@
}
}
}
},
"models":{
"mapper":{
"id":"mapper",
"description":"A key value mapping",
"properties":{
"key":{
"type":"string",
"description":"The key"
},
"value":{
"type":"string",
"description":"The value"
}
}
},
"error_injection_info":{
"id":"error_injection_info",
"description":"Information about an error injection",
"properties":{
"enabled":{
"type":"boolean",
"description":"Is the error injection enabled"
},
"parameters":{
"type":"array",
"items":{
"type":"mapper"
},
"description":"The parameter values"
}
},
"required":["enabled"]
}
}
}

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include "compaction_manager.hh"
#include "compaction/compaction_manager.hh"
@@ -153,10 +154,13 @@ void set_compaction_manager(http_context& ctx, routes& r) {
});
cm::get_compaction_history.set(r, [&ctx] (std::unique_ptr<http::request> req) {
std::function<future<>(output_stream<char>&&)> f = [&ctx](output_stream<char>&& s) {
return do_with(output_stream<char>(std::move(s)), true, [&ctx] (output_stream<char>& s, bool& first){
return s.write("[").then([&ctx, &s, &first] {
return ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable {
std::function<future<>(output_stream<char>&&)> f = [&ctx] (output_stream<char>&& out) -> future<> {
auto s = std::move(out);
bool first = true;
std::exception_ptr ex;
try {
co_await s.write("[");
co_await ctx.db.local().get_compaction_manager().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {
cm::history h;
h.id = fmt::to_string(entry.id);
h.ks = std::move(entry.ks);
@@ -170,18 +174,21 @@ void set_compaction_manager(http_context& ctx, routes& r) {
e.value = it.second;
h.rows_merged.push(std::move(e));
}
auto fut = first ? make_ready_future<>() : s.write(", ");
if (!first) {
co_await s.write(", ");
}
first = false;
return fut.then([&s, h = std::move(h)] {
return formatter::write(s, h);
});
}).then([&s] {
return s.write("]").then([&s] {
return s.close();
});
co_await formatter::write(s, h);
});
});
});
co_await s.write("]");
co_await s.flush();
} catch (...) {
ex = std::current_exception();
}
co_await s.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
return make_ready_future<json::json_return_type>(std::move(f));
});

View File

@@ -64,6 +64,32 @@ void set_error_injection(http_context& ctx, routes& r) {
});
});
hf::read_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {
const sstring injection = req->get_path_param("injection");
std::vector<error_injection_json::error_injection_info> error_injection_infos(smp::count, error_injection_json::error_injection_info{});
co_await smp::invoke_on_all([&] {
auto& info = error_injection_infos[this_shard_id()];
auto& errinj = utils::get_local_injector();
const auto enabled = errinj.is_enabled(injection);
info.enabled = enabled;
if (!enabled) {
return;
}
std::vector<error_injection_json::mapper> parameters;
for (const auto& p : errinj.get_injection_parameters(injection)) {
error_injection_json::mapper param;
param.key = p.first;
param.value = p.second;
parameters.push_back(std::move(param));
}
info.parameters = std::move(parameters);
});
co_return json::json_return_type(error_injection_infos);
});
hf::disable_on_all.set(r, [](std::unique_ptr<request> req) {
auto& errinj = utils::get_local_injector();
return errinj.disable_on_all().then([] {

View File

@@ -61,17 +61,31 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
co_return json_void{};
});
r::get_leader_host.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {
return smp::submit_to(0, [&] {
auto& srv = std::invoke([&] () -> raft::server& {
if (req->query_parameters.contains("group_id")) {
raft::group_id id{utils::UUID{req->get_query_param("group_id")}};
return raft_gr.local().get_server(id);
} else {
return raft_gr.local().group0();
}
if (!req->query_parameters.contains("group_id")) {
const auto leader_id = co_await raft_gr.invoke_on(0, [] (service::raft_group_registry& raft_gr) {
auto& srv = raft_gr.group0();
return srv.current_leader();
});
return json_return_type(srv.current_leader().to_sstring());
co_return json_return_type{leader_id.to_sstring()};
}
const raft::group_id gid{utils::UUID{req->get_query_param("group_id")}};
std::atomic<bool> found_srv{false};
std::atomic<raft::server_id> leader_id = raft::server_id::create_null_id();
co_await raft_gr.invoke_on_all([gid, &found_srv, &leader_id] (service::raft_group_registry& raft_gr) {
if (raft_gr.find_server(gid)) {
found_srv = true;
leader_id = raft_gr.get_server(gid).current_leader();
}
return make_ready_future<>();
});
if (!found_srv) {
throw bad_param_exception{fmt::format("Server for group ID {} not found", gid)};
}
co_return json_return_type(leader_id.load().to_sstring());
});
}

View File

@@ -36,6 +36,7 @@
#include <seastar/http/exception.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/exception.hh>
#include "repair/row_level.hh"
#include "locator/snitch_base.hh"
#include "column_family.hh"
@@ -1685,32 +1686,41 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
ss::get_snapshot_details.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto result = co_await snap_ctl.local().get_snapshot_details();
co_return std::function([res = std::move(result)] (output_stream<char>&& o) -> future<> {
auto result = std::move(res);
std::exception_ptr ex;
output_stream<char> out = std::move(o);
bool first = true;
try {
auto result = std::move(res);
bool first = true;
co_await out.write("[");
for (auto& [name, details] : result) {
if (!first) {
co_await out.write(", ");
co_await out.write("[");
for (auto& [name, details] : result) {
if (!first) {
co_await out.write(", ");
}
std::vector<ss::snapshot> snapshot;
for (auto& cf : details) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.details.live;
snp.total = cf.details.total;
snapshot.push_back(std::move(snp));
}
ss::snapshots all_snapshots;
all_snapshots.key = name;
all_snapshots.value = std::move(snapshot);
co_await all_snapshots.write(out);
first = false;
}
std::vector<ss::snapshot> snapshot;
for (auto& cf : details) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.details.live;
snp.total = cf.details.total;
snapshot.push_back(std::move(snp));
}
ss::snapshots all_snapshots;
all_snapshots.key = name;
all_snapshots.value = std::move(snapshot);
co_await all_snapshots.write(out);
first = false;
co_await out.write("]");
co_await out.flush();
} catch (...) {
ex = std::current_exception();
}
co_await out.write("]");
co_await out.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
});
});

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/http/exception.hh>
#include "task_manager.hh"
@@ -23,6 +24,8 @@ namespace tm = httpd::task_manager_json;
using namespace json;
using namespace seastar::httpd;
using task_variant = std::variant<tasks::task_manager::foreign_task_ptr, tasks::task_manager::task::task_essentials>;
inline bool filter_tasks(tasks::task_manager::task_ptr task, std::unordered_map<sstring, sstring>& query_params) {
return (!query_params.contains("keyspace") || query_params["keyspace"] == task->get_status().keyspace) &&
(!query_params.contains("table") || query_params["table"] == task->get_status().table);
@@ -102,13 +105,14 @@ future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task
s.module = task->get_module_name();
s.progress.completed = progress.completed;
s.progress.total = progress.total;
std::vector<std::string> ct{task->get_children().size()};
boost::transform(task->get_children(), ct.begin(), [] (const auto& child) {
std::vector<std::string> ct = co_await task->get_children().map_each_task<std::string>([] (const tasks::task_manager::foreign_task_ptr& child) {
return child->id().to_sstring();
}, [] (const tasks::task_manager::task::task_essentials& child) {
return child.task_status.id.to_sstring();
});
s.children_ids = std::move(ct);
co_return s;
}
};
void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg) {
tm::get_modules.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
@@ -138,19 +142,28 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
std::exception_ptr ex;
try {
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
}
}
co_await s.write("]");
co_await s.flush();
} catch (...) {
ex = std::current_exception();
}
co_await s.write("]");
co_await s.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
co_return std::move(f);
});
@@ -179,7 +192,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
co_await task->abort();
task->abort();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
@@ -193,7 +206,6 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
// done() is called only because we want the task to be complete before getting its status.
// The future should be ignored here as the result does not matter.
f.ignore_ready_future();
@@ -210,7 +222,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
tm::get_task_status_recursively.set(r, [&_tm = tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& tm = _tm;
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
std::queue<tasks::task_manager::foreign_task_ptr> q;
std::queue<task_variant> q;
utils::chunked_vector<full_task_status> res;
tasks::task_manager::foreign_task_ptr task;
@@ -230,10 +242,33 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
q.push(co_await task.copy()); // Task cannot be moved since we need it to be alive during whole loop execution.
while (!q.empty()) {
auto& current = q.front();
res.push_back(co_await retrieve_status(current));
for (auto& child: current->get_children()) {
q.push(co_await child.copy());
}
co_await std::visit(overloaded_functor {
[&] (const tasks::task_manager::foreign_task_ptr& task) -> future<> {
res.push_back(co_await retrieve_status(task));
co_await task->get_children().for_each_task([&q] (const tasks::task_manager::foreign_task_ptr& child) -> future<> {
q.push(co_await child.copy());
}, [&] (const tasks::task_manager::task::task_essentials& child) {
q.push(child);
return make_ready_future();
});
},
[&] (const tasks::task_manager::task::task_essentials& task) -> future<> {
res.push_back(full_task_status{
.task_status = task.task_status,
.type = task.type,
.progress = task.task_progress,
.parent_id = task.parent_id,
.abortable = task.abortable,
.children_ids = boost::copy_range<std::vector<std::string>>(task.failed_children | boost::adaptors::transformed([] (auto& child) {
return child.task_status.id.to_sstring();
}))
});
for (auto& child: task.failed_children) {
q.push(child);
}
return make_ready_future();
}
}, current);
q.pop();
}

View File

@@ -89,14 +89,13 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man
std::string error = fail ? it->second : "";
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
if (fail) {
test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
co_await test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
test_task.finish();
co_await test_task.finish();
}
return make_ready_future<>();
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());

View File

@@ -24,7 +24,6 @@
#include "service/raft/group0_state_machine.hh"
#include "timeout_config.hh"
#include "db/config.hh"
#include "db/system_auth_keyspace.hh"
#include "utils/error_injection.hh"
namespace auth {
@@ -41,14 +40,14 @@ constinit const std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.")
static logging::logger auth_log("auth");
bool legacy_mode(cql3::query_processor& qp) {
return qp.auth_version < db::system_auth_keyspace::version_t::v2;
return qp.auth_version < db::system_keyspace::auth_version_t::v2;
}
std::string_view get_auth_ks_name(cql3::query_processor& qp) {
if (legacy_mode(qp)) {
return meta::legacy::AUTH_KS;
}
return db::system_auth_keyspace::NAME;
return db::system_keyspace::NAME;
}
// Func must support being invoked more than once.
@@ -123,7 +122,7 @@ static future<> announce_mutations_with_guard(
::service::raft_group0_client& group0_client,
std::vector<canonical_mutation> muts,
::service::group0_guard group0_guard,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_cmd = group0_client.prepare_command(
::service::write_mutations{
@@ -139,7 +138,7 @@ future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
start_operation_func_t start_operation_func,
std::function<mutations_generator(api::timestamp_type& t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
// account for command's overhead, it's better to use smaller threshold than constantly bounce off the limit
size_t memory_threshold = group0_client.max_command_size() * 0.75;
@@ -190,7 +189,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_guard = co_await group0_client.start_operation(as, timeout);
auto timestamp = group0_guard.write_timestamp();

View File

@@ -84,7 +84,7 @@ future<> create_legacy_metadata_table_if_missing(
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source*)>;
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source&)>;
using mutations_generator = coroutine::experimental::generator<mutation>;
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
@@ -93,7 +93,7 @@ future<> announce_mutations_with_batching(
// function here
start_operation_func_t start_operation_func,
std::function<mutations_generator(api::timestamp_type& t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
@@ -102,7 +102,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
}

View File

@@ -9,7 +9,7 @@
*/
#include "auth/default_authorizer.hh"
#include "db/system_auth_keyspace.hh"
#include "db/system_keyspace.hh"
extern "C" {
#include <crypt.h>
@@ -203,7 +203,7 @@ default_authorizer::modify(
cql3::query_processor::cache_internal::no).discard_result();
}
co_return co_await announce_mutations(_qp, _group0_client, query,
{permissions::to_strings(set), sstring(role_name), resource.name()}, &_as, ::service::raft_timeout{});
{permissions::to_strings(set), sstring(role_name), resource.name()}, _as, ::service::raft_timeout{});
}
@@ -256,7 +256,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) {
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, _as, ::service::raft_timeout{});
}
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
@@ -346,9 +346,9 @@ future<> default_authorizer::revoke_all(const resource& resource) {
const auto timeout = ::service::raft_timeout{};
co_await announce_mutations_with_batching(
_group0_client,
[this, timeout](abort_source* as) { return _group0_client.start_operation(as, timeout); },
[this, timeout](abort_source& as) { return _group0_client.start_operation(as, timeout); },
std::move(gen),
&_as,
_as,
timeout);
} catch (exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", name, e);

View File

@@ -136,7 +136,7 @@ future<> password_authenticator::create_default_if_missing() {
plogger.info("Created default superuser authentication record.");
} else {
co_await announce_mutations(_qp, _group0_client, query,
{salted_pwd, _superuser}, &_as, ::service::raft_timeout{});
{salted_pwd, _superuser}, _as, ::service::raft_timeout{});
plogger.info("Created default superuser authentication record.");
}
}
@@ -271,7 +271,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, &_as, ::service::raft_timeout{});
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, _as, ::service::raft_timeout{});
}
}
@@ -294,7 +294,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, &_as, ::service::raft_timeout{});
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}, _as, ::service::raft_timeout{});
}
}
@@ -311,7 +311,7 @@ future<> password_authenticator::drop(std::string_view name) {
{sstring(name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(name)}, _as, ::service::raft_timeout{});
}
}

View File

@@ -28,7 +28,6 @@
#include "db/config.hh"
#include "db/consistency_level_type.hh"
#include "db/functions/function_name.hh"
#include "db/system_auth_keyspace.hh"
#include "log.hh"
#include "schema/schema_fwd.hh"
#include <seastar/core/future.hh>
@@ -644,7 +643,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
}
auto muts = co_await qp.get_mutations_internal(
format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_auth_keyspace::NAME,
db::system_keyspace::NAME,
cf_name,
col_names_str,
val_binders_str),
@@ -659,12 +658,12 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
}
}
co_yield co_await sys_ks.make_auth_version_mutation(ts,
db::system_auth_keyspace::version_t::v2);
db::system_keyspace::auth_version_t::v2);
};
co_await announce_mutations_with_batching(g0,
start_operation_func,
std::move(gen),
&as,
as,
std::nullopt);
}

View File

@@ -190,7 +190,7 @@ future<> standard_role_manager::create_default_role_if_missing() {
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});
}
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
@@ -285,7 +285,7 @@ future<> standard_role_manager::create_or_replace(std::string_view role_name, co
{sstring(role_name), c.is_superuser, c.can_login},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name), c.is_superuser, c.can_login}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name), c.is_superuser, c.can_login}, _as, ::service::raft_timeout{});
}
}
@@ -333,7 +333,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
return announce_mutations(_qp, _group0_client, std::move(query), {sstring(role_name)}, &_as, ::service::raft_timeout{});
return announce_mutations(_qp, _group0_client, std::move(query), {sstring(role_name)}, _as, ::service::raft_timeout{});
}
});
}
@@ -383,7 +383,7 @@ future<> standard_role_manager::drop(std::string_view role_name) {
co_await _qp.execute_internal(query, {sstring(role_name)},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, _as, ::service::raft_timeout{});
}
};
// Finally, delete the role itself.
@@ -401,7 +401,7 @@ future<> standard_role_manager::drop(std::string_view role_name) {
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {sstring(role_name)}, _as, ::service::raft_timeout{});
}
};
@@ -434,7 +434,7 @@ standard_role_manager::modify_membership(
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, std::move(query),
{role_set{sstring(role_name)}, sstring(grantee_name)}, &_as, ::service::raft_timeout{});
{role_set{sstring(role_name)}, sstring(grantee_name)}, _as, ::service::raft_timeout{});
}
};
@@ -453,7 +453,7 @@ standard_role_manager::modify_membership(
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_return co_await announce_mutations(_qp, _group0_client, insert_query,
{sstring(role_name), sstring(grantee_name)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(grantee_name)}, _as, ::service::raft_timeout{});
}
}
@@ -470,7 +470,7 @@ standard_role_manager::modify_membership(
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_return co_await announce_mutations(_qp, _group0_client, delete_query,
{sstring(role_name), sstring(grantee_name)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(grantee_name)}, _as, ::service::raft_timeout{});
}
}
}
@@ -644,7 +644,7 @@ future<> standard_role_manager::set_attribute(std::string_view role_name, std::s
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, _as, ::service::raft_timeout{});
}
}
@@ -659,7 +659,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query,
{sstring(role_name), sstring(attribute_name)}, &_as, ::service::raft_timeout{});
{sstring(role_name), sstring(attribute_name)}, _as, ::service::raft_timeout{});
}
}
}

View File

@@ -489,7 +489,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -514,7 +514,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -629,7 +629,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -855,12 +855,11 @@ void compaction_task_executor::finish_compaction(state finish_state) noexcept {
_compaction_state.compaction_done.signal();
}
future<> compaction_task_executor::abort(abort_source& as) noexcept {
void compaction_task_executor::abort(abort_source& as) noexcept {
if (!as.abort_requested()) {
as.request_abort();
stop_compaction("user requested abort");
}
return make_ready_future();
}
void compaction_task_executor::stop_compaction(sstring reason) noexcept {
@@ -1181,7 +1180,7 @@ public:
, regular_compaction_task_impl(mgr._task_manager_module, tasks::task_id::create_random_id(), mgr._task_manager_module->new_sequence_number(), t.schema()->ks_name(), t.schema()->cf_name(), "", tasks::task_id::create_null_id())
{}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -1352,7 +1351,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:
@@ -1379,13 +1378,20 @@ private:
}));
};
auto get_next_job = [&] () -> std::optional<sstables::compaction_descriptor> {
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), sstables::reshape_mode::strict);
return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
auto get_next_job = [&] () -> future<std::optional<sstables::compaction_descriptor>> {
auto candidates = get_reshape_candidates();
if (candidates.empty()) {
co_return std::nullopt;
}
// all sstables added to maintenance set share the same underlying storage.
auto& storage = candidates.front()->get_storage();
sstables::reshape_config cfg = co_await sstables::make_reshape_config(storage, sstables::reshape_mode::strict);
auto desc = t.get_compaction_strategy().get_reshaping_job(get_reshape_candidates(), t.schema(), cfg);
co_return desc.sstables.size() ? std::make_optional(std::move(desc)) : std::nullopt;
};
std::exception_ptr err;
while (auto desc = get_next_job()) {
while (auto desc = co_await get_next_job()) {
auto compacting = compacting_sstable_registration(_cm, _cm.get_compaction_state(&t), desc->sstables);
auto on_replace = compacting.update_on_sstable_replacement();
@@ -1755,7 +1761,7 @@ public:
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
}
virtual future<> abort() noexcept override {
virtual void abort() noexcept override {
return compaction_task_executor::abort(_as);
}
protected:

View File

@@ -594,7 +594,7 @@ public:
return _compaction_data.abort.abort_requested();
}
future<> abort(abort_source& as) noexcept;
void abort(abort_source& as) noexcept;
void stop_compaction(sstring reason) noexcept;

View File

@@ -83,7 +83,7 @@ reader_consumer_v2 compaction_strategy_impl::make_interposer_consumer(const muta
}
compaction_descriptor
compaction_strategy_impl::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
compaction_strategy_impl::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
return compaction_descriptor();
}
@@ -728,8 +728,8 @@ compaction_backlog_tracker compaction_strategy::make_backlog_tracker() const {
}
sstables::compaction_descriptor
compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, mode);
compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
return _compaction_strategy_impl->get_reshaping_job(std::move(input), schema, cfg);
}
uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr schema) const {
@@ -767,6 +767,13 @@ compaction_strategy make_compaction_strategy(compaction_strategy_type strategy,
return compaction_strategy(std::move(impl));
}
future<reshape_config> make_reshape_config(const sstables::storage& storage, reshape_mode mode) {
co_return sstables::reshape_config{
.mode = mode,
.free_storage_space = co_await storage.free_space() / smp::count,
};
}
}
namespace compaction {

View File

@@ -30,6 +30,7 @@ class compaction_strategy_impl;
class sstable;
class sstable_set;
struct compaction_descriptor;
class storage;
class compaction_strategy {
::shared_ptr<compaction_strategy_impl> _compaction_strategy_impl;
@@ -121,11 +122,13 @@ public:
//
// The caller should also pass a maximum number of SSTables which is the maximum amount of
// SSTables that can be added into a single job.
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const;
compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const;
};
// Creates a compaction_strategy object from one of the strategies available.
compaction_strategy make_compaction_strategy(compaction_strategy_type strategy, const std::map<sstring, sstring>& options);
future<reshape_config> make_reshape_config(const sstables::storage& storage, reshape_mode mode);
}

View File

@@ -76,6 +76,6 @@ public:
return false;
}
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const;
};
}

View File

@@ -8,6 +8,8 @@
#pragma once
#include <cstdint>
namespace sstables {
enum class compaction_strategy_type {
@@ -18,4 +20,10 @@ enum class compaction_strategy_type {
};
enum class reshape_mode { strict, relaxed };
struct reshape_config {
reshape_mode mode;
const uint64_t free_storage_space;
};
}

View File

@@ -146,7 +146,8 @@ int64_t leveled_compaction_strategy::estimated_pending_compactions(table_state&
}
compaction_descriptor
leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::array<std::vector<shared_sstable>, leveled_manifest::MAX_LEVELS> level_info;
auto is_disjoint = [schema] (const std::vector<shared_sstable>& sstables, unsigned tolerance) -> std::tuple<bool, unsigned> {
@@ -203,7 +204,7 @@ leveled_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input
if (level_info[0].size() > offstrategy_threshold) {
size_tiered_compaction_strategy stcs(_stcs_options);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, mode);
return stcs.get_reshaping_job(std::move(level_info[0]), schema, cfg);
}
for (unsigned level = leveled_manifest::MAX_LEVELS - 1; level > 0; --level) {

View File

@@ -74,7 +74,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
};
}

View File

@@ -298,8 +298,9 @@ size_tiered_compaction_strategy::most_interesting_bucket(const std::vector<sstab
}
compaction_descriptor
size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const
size_tiered_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const
{
auto mode = cfg.mode;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));

View File

@@ -96,7 +96,7 @@ public:
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
friend class ::size_tiered_backlog_tracker;
};

View File

@@ -595,28 +595,35 @@ future<> table_reshaping_compaction_task_impl::run() {
future<> shard_reshaping_compaction_task_impl::run() {
auto& table = _db.local().find_column_family(_status.keyspace, _status.table);
auto holder = table.async_gate().hold();
tasks::task_info info{_status.id, _status.shard};
std::unordered_map<size_t, std::unordered_set<sstables::shared_sstable>> sstables_grouped_by_compaction_group;
std::unordered_map<compaction::table_state*, std::unordered_set<sstables::shared_sstable>> sstables_grouped_by_compaction_group;
for (auto& sstable : _dir.get_unshared_local_sstables()) {
auto compaction_group_id = table.get_compaction_group_id_for_sstable(sstable);
sstables_grouped_by_compaction_group[compaction_group_id].insert(sstable);
auto& t = table.table_state_for_sstable(sstable);
sstables_grouped_by_compaction_group[&t].insert(sstable);
}
// reshape sstables individually within the compaction groups
for (auto& sstables_in_cg : sstables_grouped_by_compaction_group) {
co_await reshape_compaction_group(sstables_in_cg.first, sstables_in_cg.second, table, info);
co_await reshape_compaction_group(*sstables_in_cg.first, sstables_in_cg.second, table, info);
}
}
future<> shard_reshaping_compaction_task_impl::reshape_compaction_group(size_t compaction_group_id, std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info) {
future<> shard_reshaping_compaction_task_impl::reshape_compaction_group(compaction::table_state& t, std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info) {
while (true) {
auto reshape_candidates = boost::copy_range<std::vector<sstables::shared_sstable>>(sstables_in_cg
| boost::adaptors::filtered([&filter = _filter] (const auto& sst) {
return filter(sst);
}));
auto desc = table.get_compaction_strategy().get_reshaping_job(std::move(reshape_candidates), table.schema(), _mode);
if (reshape_candidates.empty()) {
break;
}
// all sstables were found in the same sstable_directory instance, so they share the same underlying storage.
auto& storage = reshape_candidates.front()->get_storage();
auto cfg = co_await sstables::make_reshape_config(storage, _mode);
auto desc = table.get_compaction_strategy().get_reshaping_job(std::move(reshape_candidates), table.schema(), cfg);
if (desc.sstables.empty()) {
break;
}
@@ -635,7 +642,6 @@ future<> shard_reshaping_compaction_task_impl::reshape_compaction_group(size_t c
desc.creator = _creator;
try {
auto& t = table.get_compaction_group(compaction_group_id)->as_table_state();
co_await table.get_compaction_manager().run_custom_job(t, sstables::compaction_type::Reshape, "Reshape compaction", [&dir = _dir, sstlist = std::move(sstlist), desc = std::move(desc), &sstables_in_cg, &t] (sstables::compaction_data& info, sstables::compaction_progress_monitor& progress_monitor) mutable -> future<> {
sstables::compaction_result result = co_await sstables::compact_sstables(std::move(desc), info, t, progress_monitor);
// update the sstables_in_cg set with new sstables and remove the reshaped ones

View File

@@ -606,7 +606,7 @@ private:
std::function<bool (const sstables::shared_sstable&)> _filter;
uint64_t& _total_shard_size;
future<> reshape_compaction_group(size_t compaction_group_id, std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info);
future<> reshape_compaction_group(compaction::table_state& t, std::unordered_set<sstables::shared_sstable>& sstables_in_cg, replica::column_family& table, const tasks::task_info& info);
public:
shard_reshaping_compaction_task_impl(tasks::task_manager::module_ptr module,
std::string keyspace,

View File

@@ -226,12 +226,14 @@ reader_consumer_v2 time_window_compaction_strategy::make_interposer_consumer(con
}
compaction_descriptor
time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const {
time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::vector<shared_sstable> single_window;
std::vector<shared_sstable> multi_window;
size_t offstrategy_threshold = std::max(schema->min_compaction_threshold(), 4);
size_t max_sstables = std::max(schema->max_compaction_threshold(), int(offstrategy_threshold));
const uint64_t target_job_size = cfg.free_storage_space * reshape_target_space_overhead;
if (mode == reshape_mode::relaxed) {
offstrategy_threshold = max_sstables;
@@ -263,22 +265,40 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
multi_window.size(), !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0,
single_window.size(), !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0);
auto need_trimming = [max_sstables, schema, &is_disjoint] (const std::vector<shared_sstable>& ssts) {
// All sstables can be compacted at once if they're disjoint, given that partitioned set
// will incrementally open sstables which translates into bounded memory usage.
return ssts.size() > max_sstables && !is_disjoint(ssts);
auto get_job_size = [] (const std::vector<shared_sstable>& ssts) {
return boost::accumulate(ssts | boost::adaptors::transformed(std::mem_fn(&sstable::bytes_on_disk)), uint64_t(0));
};
// Targets a space overhead of 10%. All disjoint sstables can be compacted together as long as they won't
// cause an overhead above target. Otherwise, the job targets a maximum of #max_threshold sstables.
auto need_trimming = [&] (const std::vector<shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
const size_t min_sstables = 2;
auto is_above_target_size = job_size > target_job_size;
return (ssts.size() > max_sstables && !is_disjoint) ||
(ssts.size() > min_sstables && is_above_target_size);
};
auto maybe_trim_job = [&need_trimming] (std::vector<shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
while (need_trimming(ssts, job_size, is_disjoint)) {
auto sst = ssts.back();
ssts.pop_back();
job_size -= sst->bytes_on_disk();
}
};
if (!multi_window.empty()) {
auto disjoint = is_disjoint(multi_window);
auto job_size = get_job_size(multi_window);
// Everything that spans multiple windows will need reshaping
if (need_trimming(multi_window)) {
if (need_trimming(multi_window, job_size, disjoint)) {
// When trimming, let's keep sstables with overlapping time window, so as to reduce write amplification.
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
boost::partial_sort(multi_window, multi_window.begin() + max_sstables, [](const shared_sstable &a, const shared_sstable &b) {
return a->get_stats_metadata().max_timestamp < b->get_stats_metadata().max_timestamp;
});
multi_window.resize(max_sstables);
maybe_trim_job(multi_window, job_size, disjoint);
}
compaction_descriptor desc(std::move(multi_window));
desc.options = compaction_type_options::make_reshape();
@@ -297,15 +317,17 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
std::copy(ssts.begin(), ssts.end(), std::back_inserter(single_window));
continue;
}
// reuse STCS reshape logic which will only compact similar-sized files, to increase overall efficiency
// when reshaping time buckets containing a huge amount of files
auto desc = size_tiered_compaction_strategy(_stcs_options).get_reshaping_job(std::move(ssts), schema, mode);
auto desc = size_tiered_compaction_strategy(_stcs_options).get_reshaping_job(std::move(ssts), schema, cfg);
if (!desc.sstables.empty()) {
return desc;
}
}
}
if (!single_window.empty()) {
maybe_trim_job(single_window, get_job_size(single_window), all_disjoint);
compaction_descriptor desc(std::move(single_window));
desc.options = compaction_type_options::make_reshape();
return desc;

View File

@@ -76,6 +76,7 @@ public:
// To prevent an explosion in the number of sstables we cap it.
// Better co-locate some windows into the same sstables than OOM.
static constexpr uint64_t max_data_segregation_window_count = 100;
static constexpr float reshape_target_space_overhead = 0.1f;
using bucket_t = std::vector<shared_sstable>;
enum class bucket_compaction_mode { none, size_tiered, major };
@@ -168,7 +169,7 @@ public:
return true;
}
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_mode mode) const override;
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
};
}

View File

@@ -618,3 +618,6 @@ maintenance_socket: ignore
# replication_strategy_warn_list:
# - SimpleStrategy
# replication_strategy_fail_list:
# This enables tablets on newly created keyspaces
enable_tablets: true

View File

@@ -1015,7 +1015,6 @@ scylla_core = (['message/messaging_service.cc',
'cql3/result_set.cc',
'cql3/prepare_context.cc',
'db/consistency_level.cc',
'db/system_auth_keyspace.cc',
'db/system_keyspace.cc',
'db/virtual_table.cc',
'db/virtual_tables.cc',
@@ -1358,6 +1357,7 @@ scylla_perfs = ['test/perf/perf_alternator.cc',
'test/perf/perf_simple_query.cc',
'test/perf/perf_sstable.cc',
'test/perf/perf_tablets.cc',
'test/perf/tablet_load_balancing.cc',
'test/perf/perf.cc',
'test/lib/alternator_test_env.cc',
'test/lib/cql_test_env.cc',
@@ -1753,33 +1753,32 @@ def configure_seastar(build_dir, mode, mode_config):
def configure_abseil(build_dir, mode, mode_config):
# for sanitizer cflags
seastar_flags = query_seastar_flags(f'{outdir}/{mode}/seastar/seastar.pc',
mode_config['build_seastar_shared_libs'],
args.staticcxx)
seastar_cflags = seastar_flags['seastar_cflags']
abseil_cflags = mode_config['lib_cflags']
cxx_flags = mode_config['cxxflags']
if '-DSANITIZE' in cxx_flags:
abseil_cflags += ' -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr'
abseil_build_dir = os.path.join(build_dir, mode, 'abseil')
abseil_cflags = seastar_cflags + ' ' + modes[mode]['cxx_ld_flags']
# We want to "undo" coverage for abseil if we have it enabled, as we are not
# interested in the coverage of the abseil library. these flags were previously
# added to cxx_ld_flags
if args.coverage:
for flag in COVERAGE_INST_FLAGS:
abseil_cflags = abseil_cflags.replace(f' {flag}', '')
cxx_flags = cxx_flags.replace(f' {flag}', '')
cxx_flags += ' ' + abseil_cflags.strip()
cmake_mode = mode_config['cmake_build_type']
abseil_cmake_args = [
'-DCMAKE_BUILD_TYPE={}'.format(cmake_mode),
'-DCMAKE_INSTALL_PREFIX={}'.format(build_dir + '/inst'), # just to avoid a warning from absl
'-DCMAKE_C_COMPILER={}'.format(args.cc),
'-DCMAKE_CXX_COMPILER={}'.format(args.cxx),
'-DCMAKE_CXX_FLAGS_{}={}'.format(cmake_mode.upper(), abseil_cflags),
'-DCMAKE_CXX_FLAGS_{}={}'.format(cmake_mode.upper(), cxx_flags),
'-DCMAKE_EXPORT_COMPILE_COMMANDS=ON',
'-DCMAKE_CXX_STANDARD=20',
'-DABSL_PROPAGATE_CXX_STD=ON',
]
abseil_build_dir = os.path.join(build_dir, mode, 'abseil')
abseil_cmd = ['cmake', '-G', 'Ninja', real_relpath('abseil', abseil_build_dir)] + abseil_cmake_args
os.makedirs(abseil_build_dir, exist_ok=True)

View File

@@ -14,9 +14,11 @@
#include <seastar/coroutine/parallel_for_each.hh>
#include "service/storage_proxy.hh"
#include "service/topology_mutation.hh"
#include "service/migration_manager.hh"
#include "service/forward_service.hh"
#include "service/raft/raft_group0_client.hh"
#include "service/storage_service.hh"
#include "cql3/CqlParser.hpp"
#include "cql3/statements/batch_statement.hh"
#include "cql3/statements/modification_statement.hh"
@@ -42,16 +44,22 @@ const sstring query_processor::CQL_VERSION = "3.3.1";
const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono::minutes(60);
struct query_processor::remote {
remote(service::migration_manager& mm, service::forward_service& fwd, service::raft_group0_client& group0_client)
: mm(mm), forwarder(fwd), group0_client(group0_client) {}
remote(service::migration_manager& mm, service::forward_service& fwd,
service::storage_service& ss, service::raft_group0_client& group0_client)
: mm(mm), forwarder(fwd), ss(ss), group0_client(group0_client) {}
service::migration_manager& mm;
service::forward_service& forwarder;
service::storage_service& ss;
service::raft_group0_client& group0_client;
seastar::gate gate;
};
bool query_processor::topology_global_queue_empty() {
return remote().first.get().ss.topology_global_queue_empty();
}
static service::query_state query_state_for_internal_call() {
return {service::client_state::for_internal_calls(), empty_service_permit()};
}
@@ -498,8 +506,8 @@ query_processor::~query_processor() {
}
void query_processor::start_remote(service::migration_manager& mm, service::forward_service& forwarder,
service::raft_group0_client& group0_client) {
_remote = std::make_unique<struct remote>(mm, forwarder, group0_client);
service::storage_service& ss, service::raft_group0_client& group0_client) {
_remote = std::make_unique<struct remote>(mm, forwarder, ss, group0_client);
}
future<> query_processor::stop_remote() {
@@ -835,7 +843,7 @@ bool query_processor::has_more_results(cql3::internal_query_state& state) const
future<> query_processor::for_each_cql_result(
cql3::internal_query_state& state,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)>&& f) {
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)> f) {
do {
auto msg = co_await execute_paged_internal(state);
for (auto& row : *msg) {
@@ -1018,16 +1026,29 @@ query_processor::execute_schema_statement(const statements::schema_altering_stat
cql3::cql_warnings_vec warnings;
auto request_id = guard->new_group0_state_id();
stmt.global_req_id = request_id;
auto [ret, m, cql_warnings] = co_await stmt.prepare_schema_mutations(*this, options, guard->write_timestamp());
warnings = std::move(cql_warnings);
ce = std::move(ret);
if (!m.empty()) {
auto description = format("CQL DDL statement: \"{}\"", stmt.raw_cql_statement);
co_await remote_.get().mm.announce(std::move(m), std::move(*guard), description);
if (ce && ce->target == cql_transport::event::schema_change::target_type::TABLET_KEYSPACE) {
co_await remote_.get().mm.announce<service::topology_change>(std::move(m), std::move(*guard), description);
// TODO: eliminate timeout from alter ks statement on the cqlsh/driver side
auto error = co_await remote_.get().ss.wait_for_topology_request_completion(request_id);
co_await remote_.get().ss.wait_for_topology_not_busy();
if (!error.empty()) {
log.error("CQL statement \"{}\" with topology request_id \"{}\" failed with error: \"{}\"", stmt.raw_cql_statement, request_id, error);
throw exceptions::request_execution_exception(exceptions::exception_code::INVALID, error);
}
} else {
co_await remote_.get().mm.announce<service::schema_change>(std::move(m), std::move(*guard), description);
}
}
ce = std::move(ret);
// If an IF [NOT] EXISTS clause was used, this may not result in an actual schema change. To avoid doing
// extra work in the drivers to handle schema changes, we return an empty message in this case. (CASSANDRA-7600)
::shared_ptr<messages::result_message> result;
@@ -1158,14 +1179,14 @@ future<> query_processor::query_internal(
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
auto query_state = create_paged_state(query_string, cl, values, page_size);
co_return co_await for_each_cql_result(query_state, std::move(f));
}
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
}

View File

@@ -31,7 +31,6 @@
#include "lang/wasm.hh"
#include "service/raft/raft_group0_client.hh"
#include "types/types.hh"
#include "db/system_auth_keyspace.hh"
namespace service {
@@ -151,7 +150,8 @@ public:
~query_processor();
void start_remote(service::migration_manager&, service::forward_service&, service::raft_group0_client&);
void start_remote(service::migration_manager&, service::forward_service&,
service::storage_service& ss, service::raft_group0_client&);
future<> stop_remote();
data_dictionary::database db() {
@@ -176,7 +176,7 @@ public:
wasm::manager& wasm() { return _wasm; }
db::system_auth_keyspace::version_t auth_version;
db::system_keyspace::auth_version_t auth_version;
statements::prepared_statement::checked_weak_ptr get_prepared(const std::optional<auth::authenticated_user>& user, const prepared_cache_key_type& key) {
if (user) {
@@ -315,7 +315,7 @@ public:
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
/*
* \brief iterate over all cql results using paging
@@ -330,7 +330,7 @@ public:
*/
future<> query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
class cache_internal_tag;
using cache_internal = bool_class<cache_internal_tag>;
@@ -461,6 +461,8 @@ public:
void reset_cache();
bool topology_global_queue_empty();
private:
// Keep the holder until you stop using the `remote` services.
std::pair<std::reference_wrapper<remote>, gate::holder> remote();
@@ -499,7 +501,7 @@ private:
*/
future<> for_each_cql_result(
cql3::internal_query_state& state,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
/*!
* \brief check, based on the state if there are additional results

View File

@@ -8,11 +8,15 @@
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#include <boost/range/algorithm.hpp>
#include <fmt/format.h>
#include <seastar/core/coroutine.hh>
#include <stdexcept>
#include "alter_keyspace_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
#include "service/topology_mutation.hh"
#include "db/system_keyspace.hh"
#include "data_dictionary/data_dictionary.hh"
#include "data_dictionary/keyspace_metadata.hh"
@@ -21,6 +25,8 @@
#include "create_keyspace_statement.hh"
#include "gms/feature_service.hh"
static logging::logger mylogger("alter_keyspace");
bool is_system_keyspace(std::string_view keyspace);
cql3::statements::alter_keyspace_statement::alter_keyspace_statement(sstring name, ::shared_ptr<ks_prop_defs> attrs)
@@ -36,6 +42,20 @@ future<> cql3::statements::alter_keyspace_statement::check_access(query_processo
return state.has_keyspace_access(_name, auth::permission::ALTER);
}
static bool validate_rf_difference(const std::string_view curr_rf, const std::string_view new_rf) {
auto to_number = [] (const std::string_view rf) {
int result;
// We assume the passed string view represents a valid decimal number,
// so we don't need the error code.
(void) std::from_chars(rf.begin(), rf.end(), result);
return result;
};
// We want to ensure that each DC's RF is going to change by at most 1
// because in that case the old and new quorums must overlap.
return std::abs(to_number(curr_rf) - to_number(new_rf)) <= 1;
}
void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, const service::client_state& state) const {
auto tmp = _name;
std::transform(tmp.begin(), tmp.end(), tmp.begin(), ::tolower);
@@ -61,6 +81,17 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
}
auto new_ks = _attrs->as_ks_metadata_update(ks.metadata(), *qp.proxy().get_token_metadata_ptr(), qp.proxy().features());
if (ks.get_replication_strategy().uses_tablets()) {
const std::map<sstring, sstring>& current_rfs = ks.metadata()->strategy_options();
for (const auto& [new_dc, new_rf] : _attrs->get_replication_options()) {
auto it = current_rfs.find(new_dc);
if (it != current_rfs.end() && !validate_rf_difference(it->second, new_rf)) {
throw exceptions::invalid_request_exception("Cannot modify replication factor of any DC by more than 1 at a time.");
}
}
}
locator::replication_strategy_params params(new_ks->strategy_options(), new_ks->initial_tablets());
auto new_rs = locator::abstract_replication_strategy::create_replication_strategy(new_ks->strategy_name(), params);
if (new_rs->is_per_table() != ks.get_replication_strategy().is_per_table()) {
@@ -83,20 +114,63 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>
cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_processor& qp, const query_options&, api::timestamp_type ts) const {
using namespace cql_transport;
try {
auto old_ksm = qp.db().find_keyspace(_name).metadata();
event::schema_change::target_type target_type = event::schema_change::target_type::KEYSPACE;
auto ks = qp.db().find_keyspace(_name);
auto ks_md = ks.metadata();
const auto& tm = *qp.proxy().get_token_metadata_ptr();
const auto& feat = qp.proxy().features();
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, tm, feat);
std::vector<mutation> muts;
std::vector<sstring> warnings;
auto ks_options = _attrs->get_all_options_flattened(feat);
auto m = service::prepare_keyspace_update_announcement(qp.db().real_database(), _attrs->as_ks_metadata_update(old_ksm, tm, feat), ts);
// we only want to run the tablets path if there are actually any tablets changes, not only schema changes
if (ks.get_replication_strategy().uses_tablets() && !_attrs->get_replication_options().empty()) {
if (!qp.topology_global_queue_empty()) {
return make_exception_future<std::tuple<::shared_ptr<::cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>(
exceptions::invalid_request_exception("Another global topology request is ongoing, please retry."));
}
if (_attrs->get_replication_options().contains(ks_prop_defs::REPLICATION_FACTOR_KEY)) {
return make_exception_future<std::tuple<::shared_ptr<::cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>(
exceptions::invalid_request_exception("'replication_factor' tag is not allowed when executing ALTER KEYSPACE with tablets, please list the DCs explicitly"));
}
qp.db().real_database().validate_keyspace_update(*ks_md_update);
service::topology_mutation_builder builder(ts);
builder.set_global_topology_request(service::global_topology_request::keyspace_rf_change);
builder.set_global_topology_request_id(this->global_req_id);
builder.set_new_keyspace_rf_change_data(_name, ks_options);
service::topology_change change{{builder.build()}};
auto topo_schema = qp.db().find_schema(db::system_keyspace::NAME, db::system_keyspace::TOPOLOGY);
boost::transform(change.mutations, std::back_inserter(muts), [topo_schema] (const canonical_mutation& cm) {
return cm.to_mutation(topo_schema);
});
service::topology_request_tracking_mutation_builder rtbuilder{utils::UUID{this->global_req_id}};
rtbuilder.set("done", false)
.set("start_time", db_clock::now());
service::topology_change req_change{{rtbuilder.build()}};
auto topo_req_schema = qp.db().find_schema(db::system_keyspace::NAME, db::system_keyspace::TOPOLOGY_REQUESTS);
boost::transform(req_change.mutations, std::back_inserter(muts), [topo_req_schema] (const canonical_mutation& cm) {
return cm.to_mutation(topo_req_schema);
});
target_type = event::schema_change::target_type::TABLET_KEYSPACE;
} else {
auto schema_mutations = service::prepare_keyspace_update_announcement(qp.db().real_database(), ks_md_update, ts);
muts.insert(muts.begin(), schema_mutations.begin(), schema_mutations.end());
}
using namespace cql_transport;
auto ret = ::make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
event::schema_change::target_type::KEYSPACE,
target_type,
keyspace());
return make_ready_future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>(std::make_tuple(std::move(ret), std::move(m), std::vector<sstring>()));
return make_ready_future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>(std::make_tuple(std::move(ret), std::move(muts), warnings));
} catch (data_dictionary::no_such_keyspace& e) {
return make_exception_future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>(exceptions::invalid_request_exception("Unknown keyspace " + _name));
}
@@ -107,7 +181,6 @@ cql3::statements::alter_keyspace_statement::prepare(data_dictionary::database db
return std::make_unique<prepared_statement>(make_shared<alter_keyspace_statement>(*this));
}
static logging::logger mylogger("alter_keyspace");
future<::shared_ptr<cql_transport::messages::result_message>>
cql3::statements::alter_keyspace_statement::execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {

View File

@@ -24,7 +24,6 @@ static std::map<sstring, sstring> prepare_options(
const sstring& strategy_class,
const locator::token_metadata& tm,
std::map<sstring, sstring> options,
std::optional<unsigned>& initial_tablets,
const std::map<sstring, sstring>& old_options = {}) {
options.erase(ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY);
@@ -72,6 +71,35 @@ static std::map<sstring, sstring> prepare_options(
return options;
}
ks_prop_defs::ks_prop_defs(std::map<sstring, sstring> options) {
std::map<sstring, sstring> replication_opts, storage_opts, tablets_opts, durable_writes_opts;
auto read_property_into = [] (auto& map, const sstring& name, const sstring& value, const sstring& tag) {
map[name.substr(sstring(tag).size() + 1)] = value;
};
for (const auto& [name, value] : options) {
if (name.starts_with(KW_DURABLE_WRITES)) {
read_property_into(durable_writes_opts, name, value, KW_DURABLE_WRITES);
} else if (name.starts_with(KW_REPLICATION)) {
read_property_into(replication_opts, name, value, KW_REPLICATION);
} else if (name.starts_with(KW_TABLETS)) {
read_property_into(tablets_opts, name, value, KW_TABLETS);
} else if (name.starts_with(KW_STORAGE)) {
read_property_into(storage_opts, name, value, KW_STORAGE);
}
}
if (!replication_opts.empty())
add_property(KW_REPLICATION, replication_opts);
if (!storage_opts.empty())
add_property(KW_STORAGE, storage_opts);
if (!tablets_opts.empty())
add_property(KW_TABLETS, tablets_opts);
if (!durable_writes_opts.empty())
add_property(KW_DURABLE_WRITES, durable_writes_opts.begin()->second);
}
void ks_prop_defs::validate() {
// Skip validation if the strategy class is already set as it means we've already
// prepared (and redoing it would set strategyClass back to null, which we don't want)
@@ -110,38 +138,37 @@ data_dictionary::storage_options ks_prop_defs::get_storage_options() const {
return opts;
}
std::optional<unsigned> ks_prop_defs::get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const {
ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const {
// FIXME -- this should be ignored somehow else
init_tablets_options ret{ .enabled = false, .specified_count = std::nullopt };
if (locator::abstract_replication_strategy::to_qualified_class_name(strategy_class) != "org.apache.cassandra.locator.NetworkTopologyStrategy") {
return std::nullopt;
return ret;
}
auto tablets_options = get_map(KW_TABLETS);
if (!tablets_options) {
return enabled_by_default ? std::optional<unsigned>(0) : std::nullopt;
return enabled_by_default ? init_tablets_options{ .enabled = true } : ret;
}
std::optional<unsigned> ret;
auto it = tablets_options->find("enabled");
if (it != tablets_options->end()) {
auto enabled = it->second;
tablets_options->erase(it);
if (enabled == "true") {
ret = 0; // even if 'initial' is not set, it'll start with auto-detection
ret = init_tablets_options{ .enabled = true, .specified_count = 0 }; // even if 'initial' is not set, it'll start with auto-detection
} else if (enabled == "false") {
assert(!ret.has_value());
assert(!ret.enabled);
return ret;
} else {
throw exceptions::configuration_exception(sstring("Tablets enabled value must be true or false; found ") + it->second);
throw exceptions::configuration_exception(sstring("Tablets enabled value must be true or false; found: ") + enabled);
}
}
it = tablets_options->find("initial");
if (it != tablets_options->end()) {
try {
ret = std::stol(it->second);
ret = init_tablets_options{ .enabled = true, .specified_count = std::stol(it->second)};
} catch (...) {
throw exceptions::configuration_exception(sstring("Initial tablets value should be numeric; found ") + it->second);
}
@@ -159,29 +186,55 @@ std::optional<sstring> ks_prop_defs::get_replication_strategy_class() const {
return _strategy_class;
}
bool ks_prop_defs::get_durable_writes() const {
return get_boolean(KW_DURABLE_WRITES, true);
}
std::map<sstring, sstring> ks_prop_defs::get_all_options_flattened(const gms::feature_service& feat) const {
std::map<sstring, sstring> all_options;
auto ingest_flattened_options = [&all_options](const std::map<sstring, sstring>& options, const sstring& prefix) {
for (auto& option: options) {
all_options[prefix + ":" + option.first] = option.second;
}
};
ingest_flattened_options(get_replication_options(), KW_REPLICATION);
ingest_flattened_options(get_storage_options().to_map(), KW_STORAGE);
ingest_flattened_options(get_map(KW_TABLETS).value_or(std::map<sstring, sstring>{}), KW_TABLETS);
ingest_flattened_options({{sstring(KW_DURABLE_WRITES), to_sstring(get_boolean(KW_DURABLE_WRITES, true))}}, KW_DURABLE_WRITES);
return all_options;
}
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat) {
auto sc = get_replication_strategy_class().value();
std::optional<unsigned> initial_tablets = get_initial_tablets(sc, feat.tablets);
auto options = prepare_options(sc, tm, get_replication_options(), initial_tablets);
auto initial_tablets = get_initial_tablets(sc, feat.tablets);
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0
if (initial_tablets.enabled && !initial_tablets.specified_count) {
initial_tablets.specified_count = 0;
}
auto options = prepare_options(sc, tm, get_replication_options());
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
std::move(options), initial_tablets.specified_count, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_update(lw_shared_ptr<data_dictionary::keyspace_metadata> old, const locator::token_metadata& tm, const gms::feature_service& feat) {
std::map<sstring, sstring> options;
const auto& old_options = old->strategy_options();
auto sc = get_replication_strategy_class();
std::optional<unsigned> initial_tablets;
if (sc) {
initial_tablets = get_initial_tablets(*sc, old->initial_tablets().has_value());
options = prepare_options(*sc, tm, get_replication_options(), initial_tablets, old_options);
options = prepare_options(*sc, tm, get_replication_options(), old_options);
} else {
sc = old->strategy_name();
options = old_options;
initial_tablets = old->initial_tablets();
}
auto initial_tablets = get_initial_tablets(*sc, old->initial_tablets().has_value());
// if tablets options have not been specified, inherit them if it's tablets-enabled KS
if (initial_tablets.enabled && !initial_tablets.specified_count) {
initial_tablets.specified_count = old->initial_tablets();
}
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets.specified_count, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}

View File

@@ -49,11 +49,21 @@ public:
private:
std::optional<sstring> _strategy_class;
public:
struct init_tablets_options {
bool enabled;
std::optional<unsigned> specified_count;
};
ks_prop_defs() = default;
explicit ks_prop_defs(std::map<sstring, sstring> options);
void validate();
std::map<sstring, sstring> get_replication_options() const;
std::optional<sstring> get_replication_strategy_class() const;
std::optional<unsigned> get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const;
init_tablets_options get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const;
data_dictionary::storage_options get_storage_options() const;
bool get_durable_writes() const;
std::map<sstring, sstring> get_all_options_flattened(const gms::feature_service& feat) const;
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata(sstring ks_name, const locator::token_metadata&, const gms::feature_service&);
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata_update(lw_shared_ptr<data_dictionary::keyspace_metadata> old, const locator::token_metadata&, const gms::feature_service&);
};

View File

@@ -63,6 +63,7 @@ protected:
public:
virtual future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>> prepare_schema_mutations(query_processor& qp, const query_options& options, api::timestamp_type) const = 0;
mutable utils::UUID global_req_id;
};
}

View File

@@ -2032,7 +2032,10 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
&& !restrictions->need_filtering() // No filtering
&& group_by_cell_indices->empty() // No GROUP BY
&& db.get_config().enable_parallelized_aggregation()
&& !is_local_table();
&& !is_local_table()
&& !( // Do not parallelize the request if it's single partition read
restrictions->partition_key_restrictions_is_all_eq()
&& restrictions->partition_key_restrictions_size() == schema->partition_key_size());
};
if (_parameters->is_prune_materialized_view()) {

View File

@@ -390,6 +390,12 @@ struct fmt::formatter<data_dictionary::user_types_metadata> {
};
auto fmt::formatter<data_dictionary::keyspace_metadata>::format(const data_dictionary::keyspace_metadata& m, fmt::format_context& ctx) const -> decltype(ctx.out()) {
return fmt::format_to(ctx.out(), "KSMetaData{{name={}, strategyClass={}, strategyOptions={}, cfMetaData={}, durable_writes={}, userTypes={}}}",
m.name(), m.strategy_name(), m.strategy_options(), m.cf_meta_data(), m.durable_writes(), m.user_types());
fmt::format_to(ctx.out(), "KSMetaData{{name={}, strategyClass={}, strategyOptions={}, cfMetaData={}, durable_writes={}, tablets=",
m.name(), m.strategy_name(), m.strategy_options(), m.cf_meta_data(), m.durable_writes());
if (m.initial_tablets()) {
fmt::format_to(ctx.out(), "{{\"initial\":{}}}", m.initial_tablets().value());
} else {
fmt::format_to(ctx.out(), "{{\"enabled\":false}}");
}
return fmt::format_to(ctx.out(), ", userTypes={}}}", m.user_types());
}

View File

@@ -2,7 +2,6 @@ add_library(db STATIC)
target_sources(db
PRIVATE
consistency_level.cc
system_auth_keyspace.cc
system_keyspace.cc
virtual_table.cc
virtual_tables.cc

View File

@@ -133,7 +133,7 @@ future<> db::batchlog_manager::stop() {
}
future<size_t> db::batchlog_manager::count_all_batches() const {
sstring query = format("SELECT count(*) FROM {}.{}", system_keyspace::NAME, system_keyspace::BATCHLOG);
sstring query = format("SELECT count(*) FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG);
return _qp.execute_internal(query, cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> rs) {
return size_t(rs->one().get_as<int64_t>("count"));
});
@@ -152,26 +152,26 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
auto limiter = make_lw_shared<utils::rate_limiter>(throttle);
auto batch = [this, limiter](const cql3::untyped_result_set::row& row) {
auto batch = [this, limiter](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto timeout = get_batch_log_timeout();
if (db_clock::now() < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
return make_ready_future<>();
return make_ready_future<stop_iteration>(stop_iteration::no);
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
return make_ready_future<>();
return make_ready_future<stop_iteration>(stop_iteration::no);
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Skipping logged batch because of incorrect version");
return make_ready_future<>();
return make_ready_future<stop_iteration>(stop_iteration::no);
}
auto data = row.get_blob("data");
@@ -253,49 +253,20 @@ future<> db::batchlog_manager::replay_all_failed_batches() {
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
});
}).then([] { return make_ready_future<stop_iteration>(stop_iteration::no); });
};
return seastar::with_gate(_gate, [this, batch = std::move(batch)] {
return seastar::with_gate(_gate, [this, batch = std::move(batch)] () mutable {
blogger.debug("Started replayAllFailedBatches (cpu {})", this_shard_id());
typedef ::shared_ptr<cql3::untyped_result_set> page_ptr;
sstring query = format("SELECT id, data, written_at, version FROM {}.{} LIMIT {:d}", system_keyspace::NAME, system_keyspace::BATCHLOG, page_size);
return _qp.execute_internal(query, cql3::query_processor::cache_internal::yes).then([this, batch = std::move(batch)](page_ptr page) {
return do_with(std::move(page), [this, batch = std::move(batch)](page_ptr & page) mutable {
return repeat([this, &page, batch = std::move(batch)]() mutable {
if (page->empty()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
auto id = page->back().get_as<utils::UUID>("id");
return parallel_for_each(*page, batch).then([this, &page, id]() {
if (page->size() < page_size) {
return make_ready_future<stop_iteration>(stop_iteration::yes); // we've exhausted the batchlog, next query would be empty.
}
sstring query = format("SELECT id, data, written_at, version FROM {}.{} WHERE token(id) > token(?) LIMIT {:d}",
system_keyspace::NAME,
system_keyspace::BATCHLOG,
page_size);
return _qp.execute_internal(query, {id}, cql3::query_processor::cache_internal::yes).then([&page](auto res) {
page = std::move(res);
return make_ready_future<stop_iteration>(stop_iteration::no);
});
});
});
});
}).then([] {
// TODO FIXME : cleanup()
#if 0
ColumnFamilyStore cfs = Keyspace.open(SystemKeyspace.NAME).getColumnFamilyStore(SystemKeyspace.BATCHLOG);
cfs.forceBlockingFlush();
Collection<Descriptor> descriptors = new ArrayList<>();
for (SSTableReader sstr : cfs.getSSTables())
descriptors.add(sstr.descriptor);
if (!descriptors.isEmpty()) // don't pollute the logs if there is nothing to compact.
CompactionManager.instance.submitUserDefined(cfs, descriptors, Integer.MAX_VALUE).get();
#endif
return _qp.query_internal(
format("SELECT id, data, written_at, version FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
std::move(batch)).then([this] {
// Replaying batches could have generated tombstones, flush to disk,
// where they can be compacted away.
return replica::database::flush_table_on_all_shards(_qp.proxy().get_db(), system_keyspace::NAME, system_keyspace::BATCHLOG);
}).then([] {
blogger.debug("Finished replayAllFailedBatches");
});

View File

@@ -991,7 +991,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, unspooled_dirty_soft_limit(this, "unspooled_dirty_soft_limit", value_status::Used, 0.6, "Soft limit of unspooled dirty memory expressed as a portion of the hard limit.")
, sstable_summary_ratio(this, "sstable_summary_ratio", value_status::Used, 0.0005, "Enforces that 1 byte of summary is written for every N (2000 by default)"
"bytes written to data file. Value must be between 0 and 1.")
, components_memory_reclaim_threshold(this, "components_memory_reclaim_threshold", liveness::LiveUpdate, value_status::Used, .1, "Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
, components_memory_reclaim_threshold(this, "components_memory_reclaim_threshold", liveness::LiveUpdate, value_status::Used, .2, "Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
, large_memory_allocation_warning_threshold(this, "large_memory_allocation_warning_threshold", value_status::Used, size_t(1) << 20, "Warn about memory allocations above this size; set to zero to disable.")
, enable_deprecated_partitioners(this, "enable_deprecated_partitioners", value_status::Used, false, "Enable the byteordered and random partitioners. These partitioners are deprecated and will be removed in a future version.")
, enable_keyspace_column_family_metrics(this, "enable_keyspace_column_family_metrics", value_status::Used, false, "Enable per keyspace and per column family metrics reporting.")
@@ -1031,6 +1031,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Start serializing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_kill_limit_multiplier(this, "reader_concurrency_semaphore_kill_limit_multiplier", liveness::LiveUpdate, value_status::Used, 4,
"Start killing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_cpu_concurrency(this, "reader_concurrency_semaphore_cpu_concurrency", liveness::LiveUpdate, value_status::Used, 1,
"Admit new reads while there are less than this number of requests that need CPU.")
, twcs_max_window_count(this, "twcs_max_window_count", liveness::LiveUpdate, value_status::Used, 50,
"The maximum number of compaction windows allowed when making use of TimeWindowCompactionStrategy. A setting of 0 effectively disables the restriction.")
, initial_sstable_loading_concurrency(this, "initial_sstable_loading_concurrency", value_status::Used, 4u,
@@ -1157,6 +1159,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, service_levels_interval(this, "service_levels_interval_ms", liveness::LiveUpdate, value_status::Used, 10000, "Controls how often service levels module polls configuration table")
, error_injections_at_startup(this, "error_injections_at_startup", error_injection_value_status, {}, "List of error injections that should be enabled on startup.")
, topology_barrier_stall_detector_threshold_seconds(this, "topology_barrier_stall_detector_threshold_seconds", value_status::Used, 2, "Report sites blocking topology barrier if it takes longer than this.")
, enable_tablets(this, "enable_tablets", value_status::Used, false, "Enable tablets for newly created keyspaces")
, default_log_level(this, "default_log_level", value_status::Used)
, logger_log_level(this, "logger_log_level", value_status::Used)
, log_to_stdout(this, "log_to_stdout", value_status::Used)
@@ -1347,7 +1350,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"consistent-topology-changes", feature::UNUSED},
{"broadcast-tables", feature::BROADCAST_TABLES},
{"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},
{"tablets", feature::TABLETS},
{"tablets", feature::UNUSED},
};
}

View File

@@ -111,7 +111,6 @@ struct experimental_features_t {
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
TABLETS,
};
static std::map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();
@@ -390,6 +389,7 @@ public:
named_value<uint64_t> max_memory_for_unlimited_query_hard_limit;
named_value<uint32_t> reader_concurrency_semaphore_serialize_limit_multiplier;
named_value<uint32_t> reader_concurrency_semaphore_kill_limit_multiplier;
named_value<uint32_t> reader_concurrency_semaphore_cpu_concurrency;
named_value<uint32_t> twcs_max_window_count;
named_value<unsigned> initial_sstable_loading_concurrency;
named_value<bool> enable_3_1_0_compatibility_mode;
@@ -495,6 +495,7 @@ public:
named_value<std::vector<error_injection_at_startup>> error_injections_at_startup;
named_value<double> topology_barrier_stall_detector_threshold_seconds;
named_value<bool> enable_tablets;
static const sstring default_tls_priority;
private:

View File

@@ -278,7 +278,7 @@ sync_point::shard_rps manager::calculate_current_sync_point(std::span<const gms:
auto it = _ep_managers.find(*hid);
if (it != _ep_managers.end()) {
const hint_endpoint_manager& ep_man = it->second;
rps[addr] = ep_man.last_written_replay_position();
rps[*hid] = ep_man.last_written_replay_position();
}
}
@@ -316,10 +316,14 @@ future<> manager::wait_for_sync_point(abort_source& as, const sync_point::shard_
hid_rps.reserve(rps.size());
for (const auto& [addr, rp] : rps) {
const auto maybe_hid = tmptr->get_host_id_if_known(addr);
// Ignore the IPs we cannot map.
if (maybe_hid) [[likely]] {
hid_rps.emplace(*maybe_hid, rp);
if (std::holds_alternative<gms::inet_address>(addr)) {
const auto maybe_hid = tmptr->get_host_id_if_known(std::get<gms::inet_address>(addr));
// Ignore the IPs we cannot map.
if (maybe_hid) [[likely]] {
hid_rps.emplace(*maybe_hid, rp);
}
} else {
hid_rps.emplace(std::get<locator::host_id>(addr), rp);
}
}
@@ -409,6 +413,12 @@ bool manager::have_ep_manager(const std::variant<locator::host_id, gms::inet_add
bool manager::store_hint(endpoint_id host_id, gms::inet_address ip, schema_ptr s, lw_shared_ptr<const frozen_mutation> fm,
tracing::trace_state_ptr tr_state) noexcept
{
if (utils::get_local_injector().enter("reject_incoming_hints")) {
manager_logger.debug("Rejecting a hint to {} / {} due to an error injection", host_id, ip);
++_stats.dropped;
return false;
}
if (stopping() || draining_all() || !started() || !can_hint_for(host_id)) {
manager_logger.trace("Can't store a hint to {}", host_id);
++_stats.dropped;
@@ -554,10 +564,16 @@ future<> manager::change_host_filter(host_filter filter) {
const auto maybe_host_id_and_ip = std::invoke([&] () -> std::optional<pair_type> {
try {
locator::host_id_or_endpoint hid_or_ep{de.name};
if (hid_or_ep.has_host_id()) {
// If hinted handoff is host-ID-based, hint directories representing IP addresses must've
// been created by mistake and they're invalid. The same for pre-host-ID hinted handoff
// -- hint directories representing host IDs are NOT valid.
if (hid_or_ep.has_host_id() && _uses_host_id) {
return std::make_optional(pair_type{hid_or_ep.id(), hid_or_ep.resolve_endpoint(*tmptr)});
} else {
} else if (hid_or_ep.has_endpoint() && !_uses_host_id) {
return std::make_optional(pair_type{hid_or_ep.resolve_id(*tmptr), hid_or_ep.endpoint()});
} else {
return std::nullopt;
}
} catch (...) {
return std::nullopt;
@@ -565,6 +581,8 @@ future<> manager::change_host_filter(host_filter filter) {
});
if (!maybe_host_id_and_ip) {
manager_logger.warn("Encountered a hint directory of invalid name while changing the host filter: {}. "
"Hints stored in it won't be replayed.", de.name);
co_return;
}
@@ -618,12 +636,12 @@ bool manager::check_dc_for(endpoint_id ep) const noexcept {
}
}
future<> manager::drain_for(endpoint_id endpoint) noexcept {
future<> manager::drain_for(endpoint_id host_id, gms::inet_address ip) noexcept {
if (!started() || stopping() || draining_all()) {
co_return;
}
manager_logger.trace("on_leave_cluster: {} is removed/decommissioned", endpoint);
manager_logger.trace("on_leave_cluster: {} is removed/decommissioned", host_id);
const auto holder = seastar::gate::holder{_draining_eps_gate};
// As long as we hold on to this lock, no migration of hinted handoff to host IDs
@@ -642,7 +660,7 @@ future<> manager::drain_for(endpoint_id endpoint) noexcept {
std::exception_ptr eptr = nullptr;
if (_proxy.local_db().get_token_metadata().get_topology().is_me(endpoint)) {
if (_proxy.local_db().get_token_metadata().get_topology().is_me(host_id)) {
set_draining_all();
try {
@@ -657,28 +675,45 @@ future<> manager::drain_for(endpoint_id endpoint) noexcept {
_ep_managers.clear();
_hint_directory_manager.clear();
} else {
auto it = _ep_managers.find(endpoint);
if (it != _ep_managers.end()) {
try {
co_await drain_ep_manager(it->second);
} catch (...) {
eptr = std::current_exception();
const auto maybe_host_id = std::invoke([&] () -> std::optional<locator::host_id> {
if (_uses_host_id) {
return host_id;
}
// Before the whole cluster is migrated to the host-ID-based hinted handoff,
// one hint directory may correspond to multiple target nodes. If *any* of them
// leaves the cluster, we should drain the hint directory. This is why we need
// to rely on this mapping here.
const auto maybe_mapping = _hint_directory_manager.get_mapping(host_id, ip);
if (maybe_mapping) {
return maybe_mapping->first;
}
return std::nullopt;
});
// We can't provide the function with `it` here because we co_await above,
// so iterators could have been invalidated.
// This never throws.
_ep_managers.erase(endpoint);
_hint_directory_manager.remove_mapping(endpoint);
if (maybe_host_id) {
auto it = _ep_managers.find(*maybe_host_id);
if (it != _ep_managers.end()) {
try {
co_await drain_ep_manager(it->second);
} catch (...) {
eptr = std::current_exception();
}
// We can't provide the function with `it` here because we co_await above,
// so iterators could have been invalidated.
// This never throws.
_ep_managers.erase(*maybe_host_id);
_hint_directory_manager.remove_mapping(*maybe_host_id);
}
}
}
if (eptr) {
manager_logger.error("Exception when draining {}: {}", endpoint, eptr);
manager_logger.error("Exception when draining {}: {}", host_id, eptr);
}
manager_logger.trace("drain_for: finished draining {}", endpoint);
manager_logger.trace("drain_for: finished draining {}", host_id);
}
void manager::update_backlog(size_t backlog, size_t max_backlog) {
@@ -700,8 +735,6 @@ future<> manager::with_file_update_mutex_for(const std::variant<locator::host_id
return _ep_managers.at(host_id).with_file_update_mutex(std::move(func));
}
// The function assumes that if `_uses_host_id == true`, then there are no directories that represent IP addresses,
// i.e. every directory is either valid and represents a host ID, or is invalid (so it should be ignored anyway).
future<> manager::initialize_endpoint_managers() {
auto maybe_create_ep_mgr = [this] (const locator::host_id& host_id, const gms::inet_address& ip) -> future<> {
if (!check_dc_for(host_id)) {
@@ -729,16 +762,29 @@ future<> manager::initialize_endpoint_managers() {
// The directory is invalid, so there's nothing more to do.
if (!maybe_host_id_or_ep) {
manager_logger.warn("Encountered a hint directory of invalid name while initializing endpoint managers: {}. "
"Hints stored in it won't be replayed", de.name);
co_return;
}
if (_uses_host_id) {
// If hinted handoff is host-ID-based but the directory doesn't represent a host ID,
// it's invalid. Ignore it.
if (!maybe_host_id_or_ep->has_host_id()) {
co_return;
}
// If hinted handoff is host-ID-based, `get_ep_manager` will NOT use the passed IP address,
// so we simply pass the default value there.
co_return co_await maybe_create_ep_mgr(maybe_host_id_or_ep->id(), gms::inet_address{});
}
// If we have got to this line, hinted handoff is still IP-based and we need to map the IP.
if (!maybe_host_id_or_ep->has_endpoint()) {
// If the directory name doesn't represent an IP, it's invalid. We ignore it.
co_return;
}
const auto maybe_host_id = std::invoke([&] () -> std::optional<locator::host_id> {
try {
return maybe_host_id_or_ep->resolve_id(*tmptr);

View File

@@ -317,11 +317,16 @@ public:
/// In both cases - removes the corresponding hints' directories after all hints have been drained and erases the
/// corresponding hint_endpoint_manager objects.
///
/// \param endpoint node that left the cluster
future<> drain_for(endpoint_id endpoint) noexcept;
/// \param host_id host ID of the node that left the cluster
/// \param ip the IP of the node that left the cluster
future<> drain_for(endpoint_id host_id, gms::inet_address ip) noexcept;
void update_backlog(size_t backlog, size_t max_backlog);
bool uses_host_id() const noexcept {
return _uses_host_id;
}
private:
bool stopping() const noexcept {
return _state.contains(state::stopping);

View File

@@ -148,10 +148,16 @@ void space_watchdog::on_timer() {
auto maybe_variant = std::invoke([&] () -> std::optional<std::variant<locator::host_id, gms::inet_address>> {
try {
const auto hid_or_ep = locator::host_id_or_endpoint{de.name};
if (hid_or_ep.has_host_id()) {
// If hinted handoff is host-ID-based, hint directories representing IP addresses must've
// been created by mistake and they're invalid. The same for pre-host-ID hinted handoff
// -- hint directories representing host IDs are NOT valid.
if (hid_or_ep.has_host_id() && shard_manager.uses_host_id()) {
return std::variant<locator::host_id, gms::inet_address>(hid_or_ep.id());
} else {
} else if (hid_or_ep.has_endpoint() && !shard_manager.uses_host_id()) {
return std::variant<locator::host_id, gms::inet_address>(hid_or_ep.endpoint());
} else {
return std::nullopt;
}
} catch (...) {
return std::nullopt;
@@ -173,6 +179,8 @@ void space_watchdog::on_timer() {
// Case 3: The directory isn't managed by an endpoint manager, and it represents neither an IP address,
// nor a host ID.
else {
// We use trace here to prevent flooding logs with unnecessary information.
resource_manager_logger.trace("Encountered a hint directory of invalid name while scanning: {}", de.name);
return scan_one_ep_dir(dir / de.name, shard_manager, {});
}
}).get();

View File

@@ -26,52 +26,63 @@ namespace hints {
//
// Format V1 (encoded in base64):
// uint8_t 0x01 - version of format
// sync_point_v1 - encoded using IDL
// sync_point_v1_or_v2 - encoded using IDL
//
// Format V2 (encoded in base64):
// uint8_t 0x02 - version of format
// sync_point_v1 - encoded using IDL
// sync_point_v1_or_v2 - encoded using IDL
// uint64_t - checksum computed using the xxHash algorithm
//
// sync_point_v1:
// Format V3 (encoded in base64):
// uint8_t 0x03 - version of format
// sync_point_v3 - encoded using IDL
// uint64_t - checksum computed using the xxHash algorithm
//
// sync_point_v1_or_v2:
// UUID host_id - ID of the host which created the sync point
// uint16_t shard_count - the number of shards in this sync point
// per_manager_sync_point_v1 regular_sp - replay positions for regular mutation hint queues
// per_manager_sync_point_v1 mv_sp - replay positions for materialized view hint queues
// per_manager_sync_point_v1_or_v2 regular_sp - replay positions for regular mutation hint queues
// per_manager_sync_point_v1_or_v2 mv_sp - replay positions for materialized view hint queues
//
// per_manager_sync_point_v1:
// std::vector<gms::inet_address> addresses - addresses for which this sync point defines replay positions
// per_manager_sync_point_v1_or_v2:
// std::vector<gms::inet_address> endpoints - addresses for which this sync point defines replay positions
// std::vector<db::replay_position> flattened_rps:
// A flattened collection of replay positions for all addresses and shards.
// A flattened collection of replay positions for all endpoints and shards.
// Replay positions are grouped by address, in the same order as in
// the `addresses` field, and there is one replay position for each of
// the `endpoints` field, and there is one replay position for each of
// the shards (shard count is defined by the `shard_count`) field.
// Flattened representation was chosen in order to save space on
// vector lengths etc.
//
// sync_point_v3:
// similar to sync_point_v1_or_v2 except it uses per_manager_sync_point_v3 instead
// of per_manager_sync_point_v1_or_v2, which has locator::host_id instead of
// gms::inet_address.
static constexpr size_t version_size = sizeof(uint8_t);
static constexpr size_t checksum_size = sizeof(uint64_t);
static std::vector<sync_point::shard_rps> decode_one_type_v1(uint16_t shard_count, const per_manager_sync_point_v1& v1) {
template <typename PerManagerType>
static std::vector<sync_point::shard_rps> decode_one_type(uint16_t shard_count, const PerManagerType& v) {
std::vector<sync_point::shard_rps> ret;
if (size_t(shard_count) * v1.addresses.size() != v1.flattened_rps.size()) {
if (size_t(shard_count) * v.endpoints.size() != v.flattened_rps.size()) {
throw std::runtime_error(format("Could not decode the sync point - there should be {} rps in flattened_rps, but there are only {}",
size_t(shard_count) * v1.addresses.size(), v1.flattened_rps.size()));
size_t(shard_count) * v.endpoints.size(), v.flattened_rps.size()));
}
ret.resize(std::max(unsigned(shard_count), smp::count));
auto rps_it = v1.flattened_rps.begin();
for (const auto addr : v1.addresses) {
auto rps_it = v.flattened_rps.begin();
for (const auto ep : v.endpoints) {
uint16_t shard;
for (shard = 0; shard < shard_count; shard++) {
ret[shard].emplace(addr, *rps_it++);
ret[shard].emplace(ep, *rps_it++);
}
// Fill missing shards with zero replay positions so that segments
// which were moved across shards will be correctly waited on
for (; shard < smp::count; shard++) {
ret[shard].emplace(addr, db::replay_position());
ret[shard].emplace(ep, db::replay_position());
}
}
@@ -94,50 +105,62 @@ sync_point sync_point::decode(sstring_view s) {
seastar::simple_memory_input_stream in{raw_s.data(), raw_s.size()};
uint8_t version = ser::serializer<uint8_t>::read(in);
if (version == 2) {
if (version == 2 || version == 3) {
if (raw_s.size() < version_size + checksum_size) {
throw std::runtime_error("Could not decode the sync point encoded in the V2 format - serialized blob is too short");
throw std::runtime_error("Could not decode the sync point encoded in the V2/V3 format - serialized blob is too short");
}
seastar::simple_memory_input_stream in_checksum{raw_s.end() - checksum_size, checksum_size};
uint64_t checksum = ser::serializer<uint64_t>::read(in_checksum);
if (checksum != calculate_checksum(raw_s.substr(0, raw_s.size() - checksum_size))) {
throw std::runtime_error("Could not decode the sync point encoded in the V2 format - wrong checksum");
throw std::runtime_error("Could not decode the sync point encoded in the V2/V3 format - wrong checksum");
}
}
else if (version != 1) {
throw std::runtime_error(format("Unsupported sync point format version: {}", int(version)));
}
sync_point_v1 v1 = ser::serializer<sync_point_v1>::read(in);
if (version == 1 || version == 2) {
sync_point_v1_or_v2 v = ser::serializer<sync_point_v1_or_v2>::read(in);
return sync_point{
v.host_id,
decode_one_type(v.shard_count, v.regular_sp),
decode_one_type(v.shard_count, v.mv_sp),
};
}
// version == 3
sync_point_v3 v3 = ser::serializer<sync_point_v3>::read(in);
return sync_point{
v1.host_id,
decode_one_type_v1(v1.shard_count, v1.regular_sp),
decode_one_type_v1(v1.shard_count, v1.mv_sp),
v3.host_id,
decode_one_type(v3.shard_count, v3.regular_sp),
decode_one_type(v3.shard_count, v3.mv_sp),
};
}
static per_manager_sync_point_v1 encode_one_type_v1(unsigned shards, const std::vector<sync_point::shard_rps>& rps) {
per_manager_sync_point_v1 ret;
static per_manager_sync_point_v3 encode_one_type_v3(unsigned shards, const std::vector<sync_point::shard_rps>& rps) {
per_manager_sync_point_v3 ret;
// Gather all addresses, from all shards
std::unordered_set<gms::inet_address> all_addrs;
// Gather all endpoints, from all shards
std::unordered_set<locator::host_id> all_eps;
for (const auto& shard_rps : rps) {
for (const auto& p : shard_rps) {
all_addrs.insert(p.first);
// New sync points are created with host_id only
all_eps.insert(std::get<locator::host_id>(p.first));
}
}
ret.flattened_rps.reserve(size_t(shards) * all_addrs.size());
ret.flattened_rps.reserve(size_t(shards) * all_eps.size());
// Encode into v1 struct
// For each address, we encode a replay position for all shards.
// Encode into v3 struct
// For each endpoint, we encode a replay position for all shards.
// If there is no replay position for a shard, we use a zero replay position.
for (const auto addr : all_addrs) {
ret.addresses.push_back(addr);
for (const auto ep : all_eps) {
ret.endpoints.push_back(ep);
for (const auto& shard_rps : rps) {
auto it = shard_rps.find(addr);
auto it = shard_rps.find(ep);
if (it != shard_rps.end()) {
ret.flattened_rps.push_back(it->second);
} else {
@@ -154,24 +177,24 @@ static per_manager_sync_point_v1 encode_one_type_v1(unsigned shards, const std::
}
sstring sync_point::encode() const {
// Encode as v1 structure
sync_point_v1 v1;
v1.host_id = this->host_id;
v1.shard_count = std::max(this->regular_per_shard_rps.size(), this->mv_per_shard_rps.size());
v1.regular_sp = encode_one_type_v1(v1.shard_count, this->regular_per_shard_rps);
v1.mv_sp = encode_one_type_v1(v1.shard_count, this->mv_per_shard_rps);
// Encode as v3 structure
sync_point_v3 v3;
v3.host_id = this->host_id;
v3.shard_count = std::max(this->regular_per_shard_rps.size(), this->mv_per_shard_rps.size());
v3.regular_sp = encode_one_type_v3(v3.shard_count, this->regular_per_shard_rps);
v3.mv_sp = encode_one_type_v3(v3.shard_count, this->mv_per_shard_rps);
// Measure how much space we need
seastar::measuring_output_stream measure;
ser::serializer<sync_point_v1>::write(measure, v1);
ser::serializer<sync_point_v3>::write(measure, v3);
// Reserve version_size bytes for the version and checksum_size bytes for the checksum
bytes serialized{bytes::initialized_later{}, version_size + measure.size() + checksum_size};
// Encode using V2 format
// Encode using V3 format
seastar::simple_memory_output_stream out{reinterpret_cast<char*>(serialized.data()), serialized.size()};
ser::serializer<uint8_t>::write(out, 2);
ser::serializer<sync_point_v1>::write(out, v1);
ser::serializer<uint8_t>::write(out, 3);
ser::serializer<sync_point_v3>::write(out, v3);
sstring_view serialized_s(reinterpret_cast<const char*>(serialized.data()), version_size + measure.size());
uint64_t checksum = calculate_checksum(serialized_s);
ser::serializer<uint64_t>::write(out, checksum);

View File

@@ -22,7 +22,8 @@ namespace hints {
// A sync point is a collection of positions in hint queues which can be waited on.
// The sync point encompasses one type of hints manager only.
struct sync_point {
using shard_rps = std::unordered_map<gms::inet_address, db::replay_position>;
using host_id_or_addr = std::variant<locator::host_id, gms::inet_address>;
using shard_rps = std::unordered_map<host_id_or_addr, db::replay_position>;
// ID of the host which created this sync point
locator::host_id host_id;
std::vector<shard_rps> regular_per_shard_rps;
@@ -40,21 +41,41 @@ struct sync_point {
// IDL type
// Contains per-endpoint and per-shard information about replay positions
// for a particular type of hint queues (regular mutation hints or MV update hints)
struct per_manager_sync_point_v1 {
std::vector<gms::inet_address> addresses;
struct per_manager_sync_point_v1_or_v2 {
std::vector<gms::inet_address> endpoints;
std::vector<db::replay_position> flattened_rps;
};
// IDL type
struct sync_point_v1 {
struct sync_point_v1_or_v2 {
locator::host_id host_id;
uint16_t shard_count;
// Sync point information for regular mutation hints
db::hints::per_manager_sync_point_v1 regular_sp;
db::hints::per_manager_sync_point_v1_or_v2 regular_sp;
// Sync point information for materialized view hints
db::hints::per_manager_sync_point_v1 mv_sp;
db::hints::per_manager_sync_point_v1_or_v2 mv_sp;
};
// IDL type
// same as per_manager_sync_point_v1_or_v2 except that it stores the
// endpoints as host_id instead of address
struct per_manager_sync_point_v3 {
std::vector<locator::host_id> endpoints;
std::vector<db::replay_position> flattened_rps;
};
// IDL type
struct sync_point_v3 {
locator::host_id host_id;
uint16_t shard_count;
// Sync point information for regular mutation hints
db::hints::per_manager_sync_point_v3 regular_sp;
// Sync point information for materialized view hints
db::hints::per_manager_sync_point_v3 mv_sp;
};
}

View File

@@ -55,6 +55,10 @@ public:
return ser::serialize_to_buffer<bytes>(_paxos_gc_sec);
}
std::string options_to_string() const override {
return std::to_string(_paxos_gc_sec);
}
static int32_t deserialize(const bytes_view& buffer) {
return ser::deserialize_from_buffer(buffer, boost::type<int32_t>());
}

View File

@@ -14,7 +14,6 @@
#include "gms/feature_service.hh"
#include "partition_slice_builder.hh"
#include "dht/i_partitioner.hh"
#include "system_auth_keyspace.hh"
#include "system_keyspace.hh"
#include "query-result-set.hh"
#include "query-result-writer.hh"
@@ -235,7 +234,6 @@ future<> save_system_schema(cql3::query_processor& qp) {
co_await save_system_schema_to_keyspace(qp, schema_tables::NAME);
// #2514 - make sure "system" is written to system_schema.keyspaces.
co_await save_system_schema_to_keyspace(qp, system_keyspace::NAME);
co_await save_system_schema_to_keyspace(qp, system_auth_keyspace::NAME);
}
namespace v3 {
@@ -1296,7 +1294,6 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, shar
schema_ptr s = keyspaces();
// compare before/after schemas of the affected keyspaces only
std::set<sstring> keyspaces;
std::set<table_id> column_families;
std::unordered_map<keyspace_name, table_selector> affected_tables;
bool has_tablet_mutations = false;
for (auto&& mutation : mutations) {
@@ -1311,7 +1308,6 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, shar
}
keyspaces.emplace(std::move(keyspace_name));
column_families.emplace(mutation.column_family_id());
// We must force recalculation of schema version after the merge, since the resulting
// schema may be a mix of the old and new schemas, with the exception of entries
// that originate from group 0.

View File

@@ -1,141 +0,0 @@
/*
* Modified by ScyllaDB
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#include "system_auth_keyspace.hh"
#include "system_keyspace.hh"
#include "db/schema_tables.hh"
#include "schema/schema_builder.hh"
#include "types/set.hh"
namespace db {
// all system auth tables use schema commitlog
namespace {
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == system_auth_keyspace::NAME) {
props.enable_schema_commitlog();
}
});
} // anonymous namespace
namespace system_auth_keyspace {
// use the same gc setting as system_schema tables
using days = std::chrono::duration<int, std::ratio<24 * 3600>>;
// FIXME: in some cases time-based gc may cause data resurrection,
// for more info see https://github.com/scylladb/scylladb/issues/15607
static constexpr auto auth_gc_grace = std::chrono::duration_cast<std::chrono::seconds>(days(7)).count();
schema_ptr roles() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLES), NAME, ROLES,
// partition key
{{"role", utf8_type}},
// clustering key
{},
// regular columns
{
{"can_login", boolean_type},
{"is_superuser", boolean_type},
{"member_of", set_type_impl::get_instance(utf8_type, true)},
{"salted_hash", utf8_type}
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"roles for authentication and RBAC"
);
builder.set_gc_grace_seconds(auth_gc_grace);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr role_members() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLE_MEMBERS), NAME, ROLE_MEMBERS,
// partition key
{{"role", utf8_type}},
// clustering key
{{"member", utf8_type}},
// regular columns
{},
// static columns
{},
// regular column name type
utf8_type,
// comment
"joins users and their granted roles in RBAC"
);
builder.set_gc_grace_seconds(auth_gc_grace);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr role_attributes() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLE_ATTRIBUTES), NAME, ROLE_ATTRIBUTES,
// partition key
{{"role", utf8_type}},
// clustering key
{{"name", utf8_type}},
// regular columns
{
{"value", utf8_type}
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"role permissions in RBAC"
);
builder.set_gc_grace_seconds(auth_gc_grace);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr role_permissions() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLE_PERMISSIONS), NAME, ROLE_PERMISSIONS,
// partition key
{{"role", utf8_type}},
// clustering key
{{"resource", utf8_type}},
// regular columns
{
{"permissions", set_type_impl::get_instance(utf8_type, true)}
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"role permissions for CassandraAuthorizer"
);
builder.set_gc_grace_seconds(auth_gc_grace);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
std::vector<schema_ptr> all_tables() {
return {roles(), role_members(), role_attributes(), role_permissions()};
}
} // namespace system_auth_keyspace
} // namespace db

View File

@@ -1,38 +0,0 @@
/*
* Modified by ScyllaDB
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: (AGPL-3.0-or-later and Apache-2.0)
*/
#pragma once
#include "schema/schema_fwd.hh"
#include <vector>
namespace db {
namespace system_auth_keyspace {
enum class version_t: int64_t {
v1 = 1,
v2 = 2,
};
static constexpr auto NAME = "system_auth_v2";
// tables
static constexpr auto ROLES = "roles";
static constexpr auto ROLE_MEMBERS = "role_members";
static constexpr auto ROLE_ATTRIBUTES = "role_attributes";
static constexpr auto ROLE_PERMISSIONS = "role_permissions";
schema_ptr roles();
schema_ptr role_members();
schema_ptr role_attributes();
schema_ptr role_permissions();
std::vector<schema_ptr> all_tables();
}; // namespace system_auth_keyspace
} // namespace db

View File

@@ -18,7 +18,6 @@
#include <seastar/core/on_internal_error.hh>
#include "system_keyspace.hh"
#include "cql3/untyped_result_set.hh"
#include "db/system_auth_keyspace.hh"
#include "thrift/server.hh"
#include "cql3/query_processor.hh"
#include "partition_slice_builder.hh"
@@ -88,6 +87,10 @@ namespace {
system_keyspace::SCYLLA_LOCAL,
system_keyspace::COMMITLOG_CLEANUPS,
system_keyspace::SERVICE_LEVELS_V2,
system_keyspace::ROLES,
system_keyspace::ROLE_MEMBERS,
system_keyspace::ROLE_ATTRIBUTES,
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::v3::CDC_LOCAL
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
@@ -233,12 +236,15 @@ schema_ptr system_keyspace::topology() {
.with_column("request_id", timeuuid_type)
.with_column("ignore_nodes", set_type_impl::get_instance(uuid_type, true), column_kind::static_column)
.with_column("new_cdc_generation_data_uuid", timeuuid_type, column_kind::static_column)
.with_column("new_keyspace_rf_change_ks_name", utf8_type, column_kind::static_column)
.with_column("new_keyspace_rf_change_data", map_type_impl::get_instance(utf8_type, utf8_type, false), column_kind::static_column)
.with_column("version", long_type, column_kind::static_column)
.with_column("fence_version", long_type, column_kind::static_column)
.with_column("transition_state", utf8_type, column_kind::static_column)
.with_column("committed_cdc_generations", set_type_impl::get_instance(cdc_generation_ts_id_type, true), column_kind::static_column)
.with_column("unpublished_cdc_generations", set_type_impl::get_instance(cdc_generation_ts_id_type, true), column_kind::static_column)
.with_column("global_topology_request", utf8_type, column_kind::static_column)
.with_column("global_topology_request_id", timeuuid_type, column_kind::static_column)
.with_column("enabled_features", set_type_impl::get_instance(utf8_type, true), column_kind::static_column)
.with_column("session", uuid_type, column_kind::static_column)
.with_column("tablet_balancing_enabled", boolean_type, column_kind::static_column)
@@ -1139,6 +1145,103 @@ schema_ptr system_keyspace::service_levels_v2() {
return schema;
}
schema_ptr system_keyspace::roles() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLES), NAME, ROLES,
// partition key
{{"role", utf8_type}},
// clustering key
{},
// regular columns
{
{"can_login", boolean_type},
{"is_superuser", boolean_type},
{"member_of", set_type_impl::get_instance(utf8_type, true)},
{"salted_hash", utf8_type}
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"roles for authentication and RBAC"
);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr system_keyspace::role_members() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLE_MEMBERS), NAME, ROLE_MEMBERS,
// partition key
{{"role", utf8_type}},
// clustering key
{{"member", utf8_type}},
// regular columns
{},
// static columns
{},
// regular column name type
utf8_type,
// comment
"joins users and their granted roles in RBAC"
);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr system_keyspace::role_attributes() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLE_ATTRIBUTES), NAME, ROLE_ATTRIBUTES,
// partition key
{{"role", utf8_type}},
// clustering key
{{"name", utf8_type}},
// regular columns
{
{"value", utf8_type}
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"role permissions in RBAC"
);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr system_keyspace::role_permissions() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, ROLE_PERMISSIONS), NAME, ROLE_PERMISSIONS,
// partition key
{{"role", utf8_type}},
// clustering key
{{"resource", utf8_type}},
// regular columns
{
{"permissions", set_type_impl::get_instance(utf8_type, true)}
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"role permissions for CassandraAuthorizer"
);
builder.with_version(system_keyspace::generate_schema_version(builder.uuid()));
return builder.build();
}();
return schema;
}
schema_ptr system_keyspace::legacy::hints() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, HINTS), NAME, HINTS,
@@ -2130,10 +2233,16 @@ future<> system_keyspace::set_bootstrap_state(bootstrap_state state) {
});
}
std::vector<schema_ptr> system_keyspace::auth_tables() {
return {roles(), role_members(), role_attributes(), role_permissions()};
}
std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
std::vector<schema_ptr> r;
auto schema_tables = db::schema_tables::all_tables(schema_features::full());
std::copy(schema_tables.begin(), schema_tables.end(), std::back_inserter(r));
auto auth_tables = system_keyspace::auth_tables();
std::copy(auth_tables.begin(), auth_tables.end(), std::back_inserter(r));
r.insert(r.end(), { built_indexes(), hints(), batchlog(), paxos(), local(),
peers(), peer_events(), range_xfers(),
compactions_in_progress(), compaction_history(),
@@ -2149,14 +2258,11 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
topology(), cdc_generations_v3(), topology_requests(), service_levels_v2(),
});
auto auth_tables = db::system_auth_keyspace::all_tables();
std::copy(auth_tables.begin(), auth_tables.end(), std::back_inserter(r));
if (cfg.check_experimental(db::experimental_features_t::feature::BROADCAST_TABLES)) {
r.insert(r.end(), {broadcast_kv_store()});
}
if (cfg.check_experimental(db::experimental_features_t::feature::TABLETS)) {
if (cfg.enable_tablets()) {
r.insert(r.end(), {tablets()});
}
@@ -2691,17 +2797,17 @@ future<std::optional<mutation>> system_keyspace::get_group0_schema_version() {
static constexpr auto AUTH_VERSION_KEY = "auth_version";
future<system_auth_keyspace::version_t> system_keyspace::get_auth_version() {
future<system_keyspace::auth_version_t> system_keyspace::get_auth_version() {
auto str_opt = co_await get_scylla_local_param(AUTH_VERSION_KEY);
if (!str_opt) {
co_return db::system_auth_keyspace::version_t::v1;
co_return auth_version_t::v1;
}
auto& str = *str_opt;
if (str == "" || str == "1") {
co_return db::system_auth_keyspace::version_t::v1;
co_return auth_version_t::v1;
}
if (str == "2") {
co_return db::system_auth_keyspace::version_t::v2;
co_return auth_version_t::v2;
}
on_internal_error(slogger, fmt::format("unexpected auth_version in scylla_local got {}", str));
}
@@ -2719,7 +2825,7 @@ static service::query_state& internal_system_query_state() {
return qs;
};
future<mutation> system_keyspace::make_auth_version_mutation(api::timestamp_type ts, db::system_auth_keyspace::version_t version) {
future<mutation> system_keyspace::make_auth_version_mutation(api::timestamp_type ts, db::system_keyspace::auth_version_t version) {
static sstring query = format("INSERT INTO {}.{} (key, value) VALUES (?, ?);", db::system_keyspace::NAME, db::system_keyspace::SCYLLA_LOCAL);
auto muts = co_await _qp.get_mutations_internal(query, internal_system_query_state(), ts, {AUTH_VERSION_KEY, std::to_string(int64_t(version))});
if (muts.size() != 1) {
@@ -2967,6 +3073,11 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
ret.committed_cdc_generations = decode_cdc_generations_ids(deserialize_set_column(*topology(), some_row, "committed_cdc_generations"));
}
if (some_row.has("new_keyspace_rf_change_data")) {
ret.new_keyspace_rf_change_ks_name = some_row.get_as<sstring>("new_keyspace_rf_change_ks_name");
ret.new_keyspace_rf_change_data = some_row.get_map<sstring,sstring>("new_keyspace_rf_change_data");
}
if (!ret.committed_cdc_generations.empty()) {
// Sanity check for CDC generation data consistency.
auto gen_id = ret.committed_cdc_generations.back();
@@ -2998,6 +3109,10 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
ret.global_request.emplace(req);
}
if (some_row.has("global_topology_request_id")) {
ret.global_request_id = some_row.get_as<utils::UUID>("global_topology_request_id");
}
if (some_row.has("enabled_features")) {
ret.enabled_features = decode_features(deserialize_set_column(*topology(), some_row, "enabled_features"));
}

View File

@@ -14,7 +14,6 @@
#include <unordered_map>
#include <utility>
#include <vector>
#include "db/system_auth_keyspace.hh"
#include "gms/gossiper.hh"
#include "schema/schema_fwd.hh"
#include "utils/UUID.hh"
@@ -180,6 +179,12 @@ public:
static constexpr auto TABLETS = "tablets";
static constexpr auto SERVICE_LEVELS_V2 = "service_levels_v2";
// auth
static constexpr auto ROLES = "roles";
static constexpr auto ROLE_MEMBERS = "role_members";
static constexpr auto ROLE_ATTRIBUTES = "role_attributes";
static constexpr auto ROLE_PERMISSIONS = "role_permissions";
struct v3 {
static constexpr auto BATCHES = "batches";
static constexpr auto PAXOS = "paxos";
@@ -267,6 +272,12 @@ public:
static schema_ptr tablets();
static schema_ptr service_levels_v2();
// auth
static schema_ptr roles();
static schema_ptr role_members();
static schema_ptr role_attributes();
static schema_ptr role_permissions();
static table_schema_version generate_schema_version(table_id table_id, uint16_t offset = 0);
future<> build_bootstrap_info();
@@ -310,7 +321,9 @@ public:
template <typename T>
future<std::optional<T>> get_scylla_local_param_as(const sstring& key);
static std::vector<schema_ptr> auth_tables();
static std::vector<schema_ptr> all_tables(const db::config& cfg);
future<> make(
locator::effective_replication_map_factory&,
replica::database&);
@@ -577,11 +590,16 @@ public:
// returns the corresponding mutation. Otherwise returns nullopt.
future<std::optional<mutation>> get_group0_schema_version();
enum class auth_version_t: int64_t {
v1 = 1,
v2 = 2,
};
// If the `auth_version` key in `system.scylla_local` is present (either live or tombstone),
// returns the corresponding mutation. Otherwise returns nullopt.
future<std::optional<mutation>> get_auth_version_mutation();
future<mutation> make_auth_version_mutation(api::timestamp_type ts, db::system_auth_keyspace::version_t version);
future<system_auth_keyspace::version_t> get_auth_version();
future<mutation> make_auth_version_mutation(api::timestamp_type ts, auth_version_t version);
future<auth_version_t> get_auth_version();
future<> sstables_registry_create_entry(sstring location, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc);
future<> sstables_registry_update_entry_status(sstring location, sstables::generation_type gen, sstring status);

View File

@@ -1625,25 +1625,26 @@ get_view_natural_endpoint(
}
}
auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();
for (auto&& view_endpoint : view_erm->get_replicas(view_token)) {
if (use_legacy_self_pairing) {
auto it = std::find(base_endpoints.begin(), base_endpoints.end(),
view_endpoint);
// If this base replica is also one of the view replicas, we use
// ourselves as the view replica.
if (view_endpoint == me) {
if (view_endpoint == me && it != base_endpoints.end()) {
return topology.my_address();
}
// We have to remove any endpoint which is shared between the base
// and the view, as it will select itself and throw off the counts
// otherwise.
auto it = std::find(base_endpoints.begin(), base_endpoints.end(),
view_endpoint);
if (it != base_endpoints.end()) {
base_endpoints.erase(it);
} else if (!network_topology || topology.get_datacenter(view_endpoint) == my_datacenter) {
} else if (!network_topology || view_topology.get_datacenter(view_endpoint) == my_datacenter) {
view_endpoints.push_back(view_endpoint);
}
} else {
if (!network_topology || topology.get_datacenter(view_endpoint) == my_datacenter) {
if (!network_topology || view_topology.get_datacenter(view_endpoint) == my_datacenter) {
view_endpoints.push_back(view_endpoint);
}
}
@@ -1658,7 +1659,7 @@ get_view_natural_endpoint(
return {};
}
auto replica = view_endpoints[base_it - base_endpoints.begin()];
return topology.get_node(replica).endpoint();
return view_topology.get_node(replica).endpoint();
}
static future<> apply_to_remote_endpoints(service::storage_proxy& proxy, locator::effective_replication_map_ptr ermp,
@@ -1715,6 +1716,7 @@ future<> view_update_generator::mutate_MV(
{
auto base_ermp = base->table().get_effective_replication_map();
static constexpr size_t max_concurrent_updates = 128;
co_await utils::get_local_injector().inject("delay_before_get_view_natural_endpoint", 8000ms);
co_await max_concurrent_for_each(view_updates, max_concurrent_updates,
[this, base_token, &stats, &cf_stats, tr_state, &pending_view_updates, allow_hints, wait_for_all, base_ermp] (frozen_mutation_and_schema mut) mutable -> future<> {
auto view_token = dht::get_token(*mut.s, mut.fm.key());

View File

@@ -7,7 +7,7 @@
*/
#include "db/view/view_update_backlog.hh"
#include "exceptions/exceptions.hh"
#include <seastar/core/timed_out_error.hh>
#include "gms/inet_address.hh"
#include <seastar/util/defer.hh>
#include <boost/range/adaptor/map.hpp>
@@ -370,6 +370,17 @@ future<> view_update_generator::populate_views(const replica::table& table,
}
}
// Generating view updates for a single client request can take a long time and might not finish before the timeout is
// reached. In such case this exception is thrown.
// "Generating a view update" means creating a view update and scheduling it to be sent later.
// This exception isn't thrown if the sending timeouts, it's only concrened with generating.
struct view_update_generation_timeout_exception : public seastar::timed_out_error {
const char* what() const noexcept override {
return "Request timed out - couldn't prepare materialized view updates in time";
}
};
/**
* Given some updates on the base table and the existing values for the rows affected by that update, generates the
* mutations to be applied to the base table's views, and sends them to the paired view replicas.
@@ -446,7 +457,7 @@ future<> view_update_generator::generate_and_propagate_view_updates(const replic
}
if (db::timeout_clock::now() > timeout) {
err = std::make_exception_ptr(exceptions::view_update_generation_timeout_exception());
err = std::make_exception_ptr(view_update_generation_timeout_exception());
break;
}
}

View File

@@ -325,6 +325,8 @@ WantedBy=local-fs.target
os.chown(dpath, uid, gid)
if is_debian_variant():
if not shutil.which('update-initramfs'):
pkg_install('initramfs-tools')
run('update-initramfs -u', shell=True, check=True)
if not udev_info.uuid_link:

View File

@@ -85,7 +85,7 @@ redirects: setup
# Preview commands
.PHONY: preview
preview: setup
$(POETRY) run sphinx-autobuild -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml --host $(PREVIEW_HOST) --port 5500 --ignore *.csv --ignore *.yaml
$(POETRY) run sphinx-autobuild -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml --host $(PREVIEW_HOST) --port 5500 --ignore *.csv --ignore *.json --ignore *.yaml
.PHONY: multiversionpreview
multiversionpreview: multiversion

View File

@@ -1,23 +1,19 @@
import os
import re
import yaml
from typing import Any, Dict, List
import jinja2
from sphinx import addnodes
from sphinx.application import Sphinx
from sphinx.directives import ObjectDescription
from sphinx.util import logging, ws_re
from sphinx.util.display import status_iterator
from sphinx.util.docfields import Field
from sphinx.util.docutils import switch_source_input, SphinxDirective
from sphinx.util.nodes import make_id, nested_parse_with_titles
from sphinx.jinja2glue import BuiltinTemplateLoader
from docutils import nodes
from docutils.parsers.rst import directives
from docutils.statemachine import StringList
from utils import maybe_add_filters
logger = logging.getLogger(__name__)
class DBConfigParser:
@@ -152,51 +148,6 @@ class DBConfigParser:
return DBConfigParser.all_properties[name]
def readable_desc(description: str) -> str:
"""
This function is deprecated and maintained only for backward compatibility
with previous versions. Use ``readable_desc_rst``instead.
"""
return (
description.replace("\\n", "")
.replace('<', '&lt;')
.replace('>', '&gt;')
.replace("\n", "<br>")
.replace("\\t", "- ")
.replace('"', "")
)
def readable_desc_rst(description):
indent = ' ' * 3
lines = description.split('\n')
cleaned_lines = []
for line in lines:
cleaned_line = line.replace('\\n', '\n')
if line.endswith('"'):
cleaned_line = cleaned_line[:-1] + ' '
cleaned_line = cleaned_line.lstrip()
cleaned_line = cleaned_line.replace('"', '')
if cleaned_line != '':
cleaned_line = indent + cleaned_line
cleaned_lines.append(cleaned_line)
return ''.join(cleaned_lines)
def maybe_add_filters(builder):
env = builder.templates.environment
if 'readable_desc' not in env.filters:
env.filters['readable_desc'] = readable_desc
if 'readable_desc_rst' not in env.filters:
env.filters['readable_desc_rst'] = readable_desc_rst
class ConfigOption(ObjectDescription):
has_content = True

View File

@@ -0,0 +1,188 @@
import os
import sys
import json
from sphinx import addnodes
from sphinx.directives import ObjectDescription
from sphinx.util.docfields import Field
from sphinx.util.docutils import switch_source_input
from sphinx.util.nodes import make_id
from sphinx.util import logging, ws_re
from docutils.parsers.rst import Directive, directives
from docutils.statemachine import StringList
from sphinxcontrib.datatemplates.directive import DataTemplateJSON
from utils import maybe_add_filters
sys.path.insert(0, os.path.abspath("../../scripts"))
import scripts.get_description as metrics
LOGGER = logging.getLogger(__name__)
class MetricsProcessor:
MARKER = "::description"
def _create_output_directory(self, app, metrics_directory):
output_directory = os.path.join(app.builder.srcdir, metrics_directory)
os.makedirs(output_directory, exist_ok=True)
return output_directory
def _process_single_file(self, file_path, destination_path, metrics_config_path):
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
if self.MARKER in content and not os.path.exists(destination_path):
try:
metrics_file = metrics.get_metrics_from_file(file_path, "scylla", metrics.get_metrics_information(metrics_config_path))
with open(destination_path, 'w+', encoding='utf-8') as f:
json.dump(metrics_file, f, indent=4)
except SystemExit:
LOGGER.info(f'Skipping file: {file_path}')
except Exception as error:
LOGGER.info(error)
def _process_metrics_files(self, repo_dir, output_directory, metrics_config_path):
for root, _, files in os.walk(repo_dir):
for file in files:
if file.endswith(".cc"):
file_path = os.path.join(root, file)
file_name = os.path.splitext(file)[0] + ".json"
destination_path = os.path.join(output_directory, file_name)
self._process_single_file(file_path, destination_path, metrics_config_path)
def run(self, app, exception=None):
repo_dir = os.path.abspath(os.path.join(app.srcdir, ".."))
metrics_config_path = os.path.join(repo_dir, app.config.scylladb_metrics_config_path)
output_directory = self._create_output_directory(app, app.config.scylladb_metrics_directory)
self._process_metrics_files(repo_dir, output_directory, metrics_config_path)
class MetricsTemplateDirective(DataTemplateJSON):
option_spec = DataTemplateJSON.option_spec.copy()
option_spec["title"] = lambda x: x
def _make_context(self, data, config, env):
context = super()._make_context(data, config, env)
context["title"] = self.options.get("title")
return context
def run(self):
return super().run()
class MetricsOption(ObjectDescription):
has_content = True
required_arguments = 1
optional_arguments = 0
final_argument_whitespace = False
option_spec = {
'type': directives.unchanged,
'component': directives.unchanged,
'key': directives.unchanged,
'source': directives.unchanged,
}
doc_field_types = [
Field('type', label='Type', has_arg=False, names=('type',)),
Field('component', label='Component', has_arg=False, names=('component',)),
Field('key', label='Key', has_arg=False, names=('key',)),
Field('source', label='Source', has_arg=False, names=('source',)),
]
def handle_signature(self, sig: str, signode: addnodes.desc_signature):
signode.clear()
signode += addnodes.desc_name(sig, sig)
return ws_re.sub(' ', sig)
@property
def env(self):
return self.state.document.settings.env
def _render(self, name, option_type, component, key, source):
item = {'name': name, 'type': option_type, 'component': component, 'key': key, 'source': source }
template = self.config.scylladb_metrics_option_template
return self.env.app.builder.templates.render(template, item)
def transform_content(self, contentnode: addnodes.desc_content) -> None:
name = self.arguments[0]
option_type = self.options.get('type', '')
component = self.options.get('component', '')
key = self.options.get('key', '')
source_file = self.options.get('source', '')
_, lineno = self.get_source_info()
source = f'scylladb_metrics:{lineno}:<{name}>'
fields = StringList(self._render(name, option_type, component, key, source_file).splitlines(), source=source, parent_offset=lineno)
with switch_source_input(self.state, fields):
self.state.nested_parse(fields, 0, contentnode)
def add_target_and_index(self, name: str, sig: str, signode: addnodes.desc_signature) -> None:
node_id = make_id(self.env, self.state.document, self.objtype, name)
signode['ids'].append(node_id)
self.state.document.note_explicit_target(signode)
entry = f'{name}; metrics option'
self.indexnode['entries'].append(('pair', entry, node_id, '', None))
self.env.get_domain('std').note_object(self.objtype, name, node_id, location=signode)
class MetricsDirective(Directive):
TEMPLATE = 'metrics.tmpl'
required_arguments = 0
optional_arguments = 1
option_spec = {'template': directives.path}
has_content = True
def _process_file(self, file, relative_path_from_current_rst):
data_directive = MetricsTemplateDirective(
name=self.name,
arguments=[os.path.join(relative_path_from_current_rst, file)],
options=self.options,
content=self.content,
lineno=self.lineno,
content_offset=self.content_offset,
block_text=self.block_text,
state=self.state,
state_machine=self.state_machine,
)
data_directive.options["template"] = self.options.get('template', self.TEMPLATE)
data_directive.options["title"] = file.replace('_', ' ').replace('.json','').capitalize()
return data_directive.run()
def _get_relative_path(self, output_directory, app, docname):
current_rst_path = os.path.join(app.builder.srcdir, docname + ".rst")
return os.path.relpath(output_directory, os.path.dirname(current_rst_path))
def run(self):
maybe_add_filters(self.state.document.settings.env.app.builder)
app = self.state.document.settings.env.app
docname = self.state.document.settings.env.docname
metrics_directory = os.path.join(app.builder.srcdir, app.config.scylladb_metrics_directory)
output = []
try:
relative_path_from_current_rst = self._get_relative_path(metrics_directory, app, docname)
files = os.listdir(metrics_directory)
for _, file in enumerate(files):
output.extend(self._process_file(file, relative_path_from_current_rst))
except Exception as error:
LOGGER.info(error)
return output
def setup(app):
app.add_config_value("scylladb_metrics_directory", default="_data/metrics", rebuild="html")
app.add_config_value("scylladb_metrics_config_path", default='scripts/metrics-config.yml', rebuild="html")
app.add_config_value('scylladb_metrics_option_template', default='metrics_option.tmpl', rebuild='html', types=[str])
app.connect("builder-inited", MetricsProcessor().run)
app.add_object_type(
'metrics_option',
'metrics_option',
objname='metrics option')
app.add_directive_to_domain('std', 'metrics_option', MetricsOption, override=True)
app.add_directive("metrics_option", MetricsOption)
app.add_directive("scylladb_metrics", MetricsDirective)
return {
"version": "0.1",
"parallel_read_safe": True,
"parallel_write_safe": True,
}

44
docs/_ext/utils.py Normal file
View File

@@ -0,0 +1,44 @@
def readable_desc(description: str) -> str:
"""
This function is deprecated and maintained only for backward compatibility
with previous versions. Use ``readable_desc_rst``instead.
"""
return (
description.replace("\\n", "")
.replace('<', '&lt;')
.replace('>', '&gt;')
.replace("\n", "<br>")
.replace("\\t", "- ")
.replace('"', "")
)
def readable_desc_rst(description):
indent = ' ' * 3
lines = description.split('\n')
cleaned_lines = []
for line in lines:
cleaned_line = line.replace('\\n', '\n')
if line.endswith('"'):
cleaned_line = cleaned_line[:-1] + ' '
cleaned_line = cleaned_line.lstrip()
cleaned_line = cleaned_line.replace('"', '')
if cleaned_line != '':
cleaned_line = indent + cleaned_line
cleaned_lines.append(cleaned_line)
return ''.join(cleaned_lines)
def maybe_add_filters(builder):
env = builder.templates.environment
if 'readable_desc' not in env.filters:
env.filters['readable_desc'] = readable_desc
if 'readable_desc_rst' not in env.filters:
env.filters['readable_desc_rst'] = readable_desc_rst

View File

@@ -41,6 +41,6 @@ dl dt:hover > a.headerlink {
visibility: visible;
}
dl.confval {
dl.confval, dl.metrics_option {
border-bottom: 1px solid #cacaca;
}

19
docs/_templates/metrics.tmpl vendored Normal file
View File

@@ -0,0 +1,19 @@
.. -*- mode: rst -*-
{{title}}
{{ '-' * title|length }}
{% if data %}
{% for key, value in data.items() %}
.. _metricsprop_{{ key }}:
.. metrics_option:: {{ key }}
:type: {{value[0]}}
:source: {{value[4]}}
:component: {{value[2]}}
:key: {{value[3]}}
{{value[1] | readable_desc_rst}}
{% endfor %}
{% endif %}

3
docs/_templates/metrics_option.tmpl vendored Normal file
View File

@@ -0,0 +1,3 @@
{% if type %}* **Type:** ``{{ type }}``{% endif %}
{% if component %}* **Component:** ``{{ component }}``{% endif %}
{% if key %}* **Key:** ``{{ key }}``{% endif %}

View File

@@ -21,6 +21,9 @@
# remove the Open Source vs. Enterprise Matrix from the Open Source docs
/stable/reference/versions-matrix-enterprise-oss.html: https://enterprise.docs.scylladb.com/stable/reference/versions-matrix-enterprise-oss.html
# Remove the outdated Troubleshooting article
/stable/troubleshooting/error-messages/create-mv.html: /stable/troubleshooting/index.html
# Remove the Learn page (replaced with a link to a page in a different repo)

View File

@@ -117,9 +117,9 @@ request. Alternator can then validate the authenticity and authorization of
each request using a known list of authorized key pairs.
In the current implementation, the user stores the list of allowed key pairs
in the `system_auth_v2.roles` table: The access key ID is the `role` column, and
in the `system.roles` table: The access key ID is the `role` column, and
the secret key is the `salted_hash`, i.e., the secret key can be found by
`SELECT salted_hash from system_auth_v2.roles WHERE role = ID;`.
`SELECT salted_hash from system.roles WHERE role = ID;`.
<!--- REMOVE IN FUTURE VERSIONS - Remove the note below in version 6.1 -->

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

View File

@@ -4,6 +4,7 @@ ScyllaDB Architecture
:titlesonly:
:hidden:
Data Distribution with Tablets </architecture/tablets>
ScyllaDB Ring Architecture <ringarchitecture/index/>
ScyllaDB Fault Tolerance <architecture-fault-tolerance>
Consistency Level Console Demo <console-CL-full-demo>
@@ -13,6 +14,7 @@ ScyllaDB Architecture
Raft Consensus Algorithm in ScyllaDB </architecture/raft>
* :doc:`Data Distribution with Tablets </architecture/tablets/>` - Tablets in ScyllaDB
* :doc:`ScyllaDB Ring Architecture </architecture/ringarchitecture/index/>` - High-Level view of ScyllaDB Ring Architecture
* :doc:`ScyllaDB Fault Tolerance </architecture/architecture-fault-tolerance>` - Deep dive into ScyllaDB Fault Tolerance
* :doc:`Consistency Level Console Demo </architecture/console-CL-full-demo>` - Console Demos of Consistency Level Settings

View File

@@ -0,0 +1,131 @@
=========================================
Data Distribution with Tablets
=========================================
A ScyllaDB cluster is a group of interconnected nodes. The data of the entire
cluster has to be distributed as evenly as possible across those nodes.
ScyllaDB is designed to ensure a balanced distribution of data by storing data
in tablets. When you add or remove nodes to scale your cluster, add or remove
a datacenter, or replace a node, tablets are moved between the nodes to keep
the same number on each node. In addition, tablets are balanced across shards
in each node.
This article explains the concept of tablets and how they let you scale your
cluster quickly and seamlessly.
Data Distribution
-------------------
ScyllaDB distributes data by splitting tables into tablets. Each tablet has
its replicas on different nodes, depending on the RF (replication factor). Each
partition of a table is mapped to a single tablet in a deterministic way. When you
query or update the data, ScyllaDB can quickly identify the tablet that stores
the relevant partition. 
The following example shows a 3-node cluster with a replication factor (RF) of
3. The data is stored in a table (Table 1) with two rows. Both rows are mapped
to one tablet (T1) with replicas on all three nodes.
.. image:: images/tablets-cluster.png
.. TODO - Add a section about tablet splitting when there are more triggers,
like throughput. In 6.0, tablets only split when reaching a threshold size
(the threshold is based on the average tablet data size).
Load Balancing
==================
ScyllaDB autonomously moves tablets to balance the load. This process
is managed by a load balancer mechanism and happens independently of
the administrator. The tablet load balancer decides where to migrate
the tablets, either within the same node to balance the shards or across
the nodes to balance the global load in the cluster.
As a table grows, each tablet can split into two, creating a new tablet.
The load balancer can migrate the split halves independently to different nodes
or shards.
The load-balancing process takes place in the background and is performed
without any service interruption.
Scaling Out
=============
A tablet can be dynamically migrated to an existing node or a newly added
empty node. Paired with consistent topology updates with Raft, tablets allow
you to add multiple nodes simultaneously. After nodes are added to the cluster,
existing nodes stream data to the new ones, and the system load eventually
converges to an even distribution as the process completes.
With tablets enabled, manual cleanup is not required.
Cleanup is performed automatically per tablet,
making tablets-based streaming user-independent and safer.
In addition, tablet cleanup is lightweight and efficient, as it doesn't
involve rewriting SStables on the existing nodes, which makes data ownership
changes faster. This dramatically reduces
the impact of cleanup on the performance of user queries.
The following diagrams show migrating tablets from heavily loaded nodes A and B
to a new node.
.. image:: images/tablets-load-balancing.png
.. _tablets-enable-tablets:
Enabling Tablets
-------------------
ScyllaDB now uses tablets by default for data distribution. This functionality is
controlled by the :confval:`enable_tablets` option. However, tablets only work if
enabled on all nodes within the cluster.
When creating a new keyspace with tablets enabled (the default), you can still disable
them on a per-keyspace basis. The recommended ``NetworkTopologyStrategy`` for keyspaces
remains *required* when using tablets.
You can create a keyspace with tablets
disabled with the ``tablets = {'enabled': false}`` option:
.. code:: cql
CREATE KEYSPACE my_keyspace
WITH replication = {
'class': 'NetworkTopologyStrategy',
'replication_factor': 3
} AND tablets = {
'enabled': false
};
.. warning::
You cannot ALTER a keyspace to enable or disable tablets.
The only way to update the tablet support for a keyspace is to DROP it
(losing the schema and data) and then recreate it after redefining
the keyspace schema with ``tablets = { 'enabled': false }`` or
``tablets = { 'enabled': true }``.
Limitations and Unsupported Features
--------------------------------------
The following ScyllaDB features are not supported if a keyspace has tablets
enabled:
* Counters
* Change Data Capture (CDC)
* Lightweight Transactions (LWT)
* Alternator (as it uses LWT)
If you plan to use any of the above features, CREATE your keyspace
:ref:`with tablets disabled <tablets-enable-tablets>`.
Resharding in keyspaces with tablets enabled has the following limitations:
* ScyllaDB does not support reducing the number of shards after node restart.
* ScyllaDB does not reshard data on node restart. Tablet replicas remain
allocated to the old shards on restart and are subject to background
load-balancing to additional shards after restart completes and the node
starts serving CQL.

View File

@@ -44,7 +44,8 @@ extensions = [
"scylladb_gcp_images",
"scylladb_include_flag",
"scylladb_dynamic_substitutions",
"scylladb_swagger"
"scylladb_swagger",
"scylladb_metrics"
]
# The suffix(es) of source filenames.
@@ -127,6 +128,10 @@ scylladb_swagger_origin_api = "../api"
scylladb_swagger_template = "swagger.tmpl"
scylladb_swagger_inc_template = "swagger_inc.tmpl"
# -- Options for scylladb_metrics
scylladb_metrics_directory = "_data/opensource/metrics"
# -- Options for HTML output
# The theme to use for pages.

View File

@@ -107,12 +107,6 @@ For example:
WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 3}
AND durable_writes = true;
.. TODO Add a link to the description of minimum_keyspace_rf when the ScyllaDB options section is added to the docs.
You can configure the minimum acceptable replication factor using the ``minimum_keyspace_rf`` option.
Attempting to create a keyspace with a replication factor lower than the value set with
``minimum_keyspace_rf`` will return an error (the default value is 0).
The supported ``options`` are:
=================== ========== =========== ========= ===================================================================
@@ -122,7 +116,7 @@ name kind mandatory default description
details below).
``durable_writes`` *simple* no true Whether to use the commit log for updates on this keyspace
(disable this option at your own risk!).
``tablets`` *map* no Experimental - enables tablets for this keyspace (see :ref:`tablets<tablets>`)
``tablets`` *map* no Enables or disables tablets for the keyspace (see :ref:`tablets<tablets>`)
=================== ========== =========== ========= ===================================================================
The ``replication`` property is mandatory and must at least contains the ``'class'`` sub-option, which defines the
@@ -142,7 +136,12 @@ query latency. For a production ready strategy, see *NetworkTopologyStrategy* .
========================= ====== ======= =============================================
sub-option type since description
========================= ====== ======= =============================================
``'replication_factor'`` int all The number of replicas to store per range
``'replication_factor'`` int all The number of replicas to store per range.
The replication factor should be equal to
or lower than the number of nodes.
Configuring a higher RF may prevent
creating tables in that keyspace.
========================= ====== ======= =============================================
.. note:: Using NetworkTopologyStrategy is recommended. Using SimpleStrategy will make it harder to add Data Center in the future.
@@ -166,6 +165,11 @@ sub-option type description
definitions or explicit datacenter settings.
For example, to have three replicas per
datacenter, supply this with a value of 3.
The replication factor configured for a DC
should be equal to or lower than the number
of nodes in that DC. Configuring a higher RF
may prevent creating tables in that keyspace.
===================================== ====== =============================================
Note that when ``ALTER`` ing keyspaces and supplying ``replication_factor``,
@@ -213,39 +217,30 @@ An example that excludes a datacenter while using ``replication_factor``::
.. _tablets:
The ``tablets`` property :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``tablets`` property
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``tablets`` property is used to make keyspace replication tablets-based.
It is only valid when ``experimental_features: tablets`` is specified in ``scylla.yaml`` (which
in turn requires ``consistent_cluster_management: true``); it must be a power of two.
The ``tablets`` property enables or disables tablets-based distribution
for a keyspace.
Options:
===================================== ====== =============================================
sub-option type description
===================================== ====== =============================================
``'enabled'`` bool Whether or not to enable tablets for keyspace
``'enabled'`` bool Whether or not to enable tablets for a keyspace
``'initial'`` int The number of tablets to start with
===================================== ====== =============================================
By default if tablets cluster feature is enabled, any keyspace will be created with tablets
enabled. The ``tablets`` option is used to opt-out a keyspace from tablets replication.
By default, a keyspace is created with tablets enabled. The ``tablets`` option
is used to opt out a keyspace from tablets-based distribution; see :ref:`Enabling Tablets <tablets-enable-tablets>`
for details.
A good rule of thumb to calculate initial tablets is to divide the expected total storage used
by tables in this keyspace by (``replication_factor`` * 5GB). For example, if you expect a 30TB
table and have a replication factor of 3, divide 30TB by (3*5GB) for a result of 2000. Since the
value must be a power of two, round up to 2048.
.. note::
The calculation applies to every table in the keyspace independently; so it can only realistically be
used for a keyspace containing a single table. It is expected that per-table controls will be available
in the future.
.. caution::
The ``initial`` option may change its definition or be completely removed as it is part
of an experimental feature.
The calculation applies to every table in the keyspace.
An example that creates a keyspace with 2048 tablets per table::
@@ -257,6 +252,9 @@ An example that creates a keyspace with 2048 tablets per table::
'initial': 2048
};
See :doc:`Data Distribution with Tablets </architecture/tablets>` for more information about tablets.
.. _use-statement:
USE
@@ -289,6 +287,17 @@ For instance::
The supported options are the same as :ref:`creating a keyspace <create-keyspace-statement>`.
ALTER KEYSPACE with Tablets :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Modifying a keyspace with tablets enabled is possible and doesn't require any special CQL syntax. However, there are some limitations:
- The replication factor (RF) can be increased or decreased by at most 1 at a time. To reach the desired RF value, modify the RF repeatedly.
- The ``ALTER`` statement rejects the ``replication_factor`` tag. List the DCs explicitly when altering a keyspace. See :ref:`NetworkTopologyStrategy <replication-strategy>`.
- If there's any other ongoing global topology operation, executing the ``ALTER`` statement will fail (with an explicit and specific error) and needs to be repeated.
- The ``ALTER`` statement may take longer than the regular query timeout, and even if it times out, it will continue to execute in the background.
- The replication strategy cannot be modified, as keyspaces with tablets only support ``NetworkTopologyStrategy``.
.. _drop-keyspace-statement:
DROP KEYSPACE

View File

@@ -341,7 +341,7 @@ The `--authenticator` command lines option allows to provide the authenticator c
#### `--authorizer AUTHORIZER`
The `--authorizer` command lines option allows to provide the authorizer class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthorizer` which allows any action to any user. The second option is using the `CassandraAuthorizer` parameter, which stores permissions in `system_auth_v2.permissions` table.
The `--authorizer` command lines option allows to provide the authorizer class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthorizer` which allows any action to any user. The second option is using the `CassandraAuthorizer` parameter, which stores permissions in `system.permissions` table.
**Since: 2.3**

View File

@@ -6,7 +6,7 @@ There are two system tables that are used to facilitate the service level featur
### Service Level Attachment Table
```
CREATE TABLE system_auth_v2.role_attributes (
CREATE TABLE system.role_attributes (
role text,
attribute_name text,
attribute_value text,
@@ -23,7 +23,7 @@ So for example in order to find out which `service_level` is attached to role `r
one can run the following query:
```
SELECT * FROM system_auth_v2.role_attributes WHERE role='r' and attribute_name='service_level'
SELECT * FROM system.role_attributes WHERE role='r' and attribute_name='service_level'
```
@@ -157,4 +157,4 @@ The command displays a table with: option name, effective service level the valu
----------------------+-------------------------+-------------
workload_type | sl2 | batch
timeout | sl1 | 2s
```
```

63
docs/dev/task_manager.md Normal file
View File

@@ -0,0 +1,63 @@
Task manager is a tool for tracking long-running background
operations.
# Structure overview
Task manager is divided into modules, e.g. repair or compaction
module, which keep track of operations of similar nature. Operations
are tracked with tasks.
Each task covers a logical part of the operation, e.g repair
of a keyspace or a table. Each operation is covered by a tree
of tasks, e.g. global repair task is a parent of tasks covering
a single keyspace, which are parents of table tasks.
# Time to live of a task
Root tasks are kept in task manager for `task_ttl` time after they are
finished. `task_ttl` value can be set in node configuration with
`--task-ttl-in-seconds` option or changed with task manager API
(`/task_manager/ttl`).
A task which isn't a root is unregistered immediately after it is
finished and its status is folded into its parent. When a task
is being folded into its parent, info about each of its children is
lost unless the child or any child's descendant failed.
# Internal
Tasks can be marked as `internal`, which means they are not listed
by default. A task should be marked as internal if it has a parent
or if it's supposed to be unregistered immediately after it's finished.
# Abortable
A flag which determines if a task can be aborted through API.
# Type vs scope
`type` of a task describes what operation is covered by a task,
e.g. "major compaction".
`scope` of a task describes for which part of the operation
the task is responsible, e.g. "shard".
# API
Documentation for task manager API is available under `api/api-doc/task_manager.json`.
Briefly:
- `/task_manager/list_modules` -
lists module supported by task manager;
- `/task_manager/list_module_tasks/{module}` -
lists (by default non-internal) tasks in the module;
- `/task_manager/task_status/{task_id}` -
gets the task's status, unregisters the task if it's finished;
- `/task_manager/abort_task/{task_id}` -
aborts the task if it's abortable;
- `/task_manager/wait_task/{task_id}` -
waits for the task and gets its status;
- `/task_manager/task_status_recursive/{task_id}` -
gets statuses of the task and all its descendants in BFS
order, unregisters the task;
- `/task_manager/ttl` -
sets new ttl, returns old value.

View File

@@ -549,7 +549,10 @@ CREATE TABLE system.topology (
committed_cdc_generations set<tuple<timestamp, timeuuid>> static,
unpublished_cdc_generations set<tuple<timestamp, timeuuid>> static,
global_topology_request text static,
global_topology_request_id timeuuid static,
new_cdc_generation_data_uuid timeuuid static,
new_keyspace_rf_change_ks_name text static,
new_keyspace_rf_change_data frozen<map<text, text>> static,
PRIMARY KEY (key, host_id)
)
```
@@ -575,8 +578,11 @@ There are also a few static columns for cluster-global properties:
- `committed_cdc_generations` - the IDs of the committed CDC generations
- `unpublished_cdc_generations` - the IDs of the committed yet unpublished CDC generations
- `global_topology_request` - if set, contains one of the supported global topology requests
- `global_topology_request_id` - if set, contains global topology request's id, which is a new group0's state id
- `new_cdc_generation_data_uuid` - used in `commit_cdc_generation` state, the time UUID of the generation to be committed
- `upgrade_state` - describes the progress of the upgrade to raft-based topology.
- 'new_keyspace_rf_change_ks_name' - the name of the KS that is being the target of the scheduled ALTER KS statement
- 'new_keyspace_rf_change_data' - the KS options to be used when executing the scheduled ALTER KS statement
# Join procedure

View File

@@ -1,15 +1,15 @@
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
+----------------------------+-------------+---------------+---------------+
| Linux Distributions |Ubuntu | Debian | Rocky / |
| | | | RHEL |
+----------------------------+------+------+-------+-------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 | 10 | 11 | 8 | 9 |
+============================+======+======+=======+=======+=======+=======+
| 6.0 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+-------+-------+-------+-------+
| 5.4 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+-------+-------+-------+-------+
+----------------------------+--------------------+-------+---------------+
| Linux Distributions |Ubuntu | Debian| Rocky / |
| | | | RHEL |
+----------------------------+------+------+------+-------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 |24.04 | 11 | 8 | 9 |
+============================+======+======+======+=======+=======+=======+
| 6.0 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
| 5.4 | |v| | |v| | |x| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
* The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
* All releases are available as a Docker container and EC2 AMI, GCP, and Azure images.

View File

@@ -40,7 +40,6 @@ Join the ScyllaDB Open Source community:
* Contribute to the ScyllaDB Open Source `project <https://github.com/scylladb/scylladb>`_.
* Join the `ScyllaDB Community Forum <https://forum.scylladb.com/>`_.
* Join our `Slack Channel <https://slack.scylladb.com/>`_.
* Sign up for the `scylladb-users <https://groups.google.com/d/forum/scylladb-users>`_ Google group.
Learn How to Use ScyllaDB
---------------------------

View File

@@ -3,16 +3,31 @@ nodetool decommission
**decommission** - Deactivate a selected node by streaming its data to the next node in the ring.
.. note::
You cannot decomission a node if any existing node is down.
For example:
``nodetool decommission``
.. include:: /operating-scylla/_common/decommission_warning.rst
Use the ``nodetool netstats`` command to monitor the progress of the token reallocation.
.. note::
You cannot decomission a node if any existing node is down.
See :doc:`Remove a Node from a ScyllaDB Cluster (Down Scale) </operating-scylla/procedures/cluster-management/remove-node>`
for procedure details.
Before you run ``nodetool decommission``:
* Review current disk space utilization on existing nodes and make sure the amount
of data streamed from the node being removed can fit into the disk space available
on the remaining nodes. If there is not enough disk space on the remaining nodes,
the removal of a node will fail. Add more storage to remaining nodes **before**
starting the removal procedure.
* Make sure that the number of nodes remaining in the DC after you decommission a node
will be the same or higher than the Replication Factor configured for the keyspace
in this DC. If the number of remaining nodes is lower than the RF, the decommission
request may fail.
In such a case, ALTER the keyspace to reduce the RF before running ``nodetool decommission``.
.. include:: nodetool-index.rst

View File

@@ -2,14 +2,28 @@ Nodetool describering
=====================
**describering** - :code:`<keyspace>`- Shows the partition ranges of a given keyspace.
For example:
.. code-block:: shell
nodetool describering nba
Example output (for three node cluster on AWS):
If :doc:`tablets </architecture/tablets>` are enabled for your keyspace, you
need to additionally specify the table name. The command will display the ring
of the table.
.. code:: shell
nodetool describering <keyspace> <table>
For example:
.. code-block:: shell
nodetool describering nba player_name
Example output (for a three-node cluster on AWS with tablets disabled):
.. code-block:: shell

View File

@@ -21,9 +21,16 @@ is removed from the cluster or replaced.
Prerequisites
------------------------
Using ``removenode`` requires at least a quorum of nodes in a cluster to be available.
If the quorum is lost, it must be restored before you change the cluster topology.
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
* Using ``removenode`` requires at least a quorum of nodes in a cluster to be available.
If the quorum is lost, it must be restored before you change the cluster topology.
See :doc:`Handling Node Failures </troubleshooting/handling-node-failures>` for details.
* Make sure that the number of nodes remaining in the DC after you remove a node
will be the same or higher than the Replication Factor configured for the keyspace
in this DC. If the number of remaining nodes is lower than the RF, the removenode
request may fail. In such a case, you should follow the procedure to
:doc:`replace a dead node </operating-scylla/procedures/cluster-management/replace-dead-node>`
instead of running ``nodetool removenode``.
Usage
--------

View File

@@ -1,10 +1,12 @@
Nodetool ring
=============
**ring** ``[<keyspace>]`` - The nodetool ring command displays the token
**ring** ``[<keyspace>] [<table>]`` - The nodetool ring command displays the token
ring information. The token ring is responsible for managing the
partitioning of data within the Scylla cluster. This command is
critical if a cluster is facing data consistency issues.
By default, ``ring`` command shows all keyspaces.
For example:
.. code:: sh
@@ -16,13 +18,23 @@ tokens that are assigned to each one of them. It will also show the
status of each of the nodes.
+------------+-----+-----------+-------+--------------+-----------+---------------------------+
|Address |Rack | Status |State | Load | Owns | Token |
|Address |Rack | Status |State | Load | Owns | Token |
+============+=====+===========+=======+==============+===========+===========================+
|172.30.0.64 | 1b | Up | Normal|551.31 MB | Mykespace | 1006916943685901788 |
|172.30.0.64 | 1b | Up | Normal|551.31 MB | Mykespace | 1006916943685901788 |
+------------+-----+-----------+-------+--------------+-----------+---------------------------+
|172.30.0.62 | 1b | Up | Normal|541.59 MB | Mykespace | 1024434117767101090 |
+------------+-----+-----------+-------+--------------+-----------+---------------------------+
|172.30.0.61 | 1b | Up | Normal|541.59 MB | Mykespace | 1043327858966261499 |
|172.30.0.61 | 1b | Up | Normal|541.59 MB | Mykespace | 1043327858966261499 |
+------------+-----+-----------+-------+--------------+-----------+---------------------------+
You can specify a ``<keyspace>`` name to filter the output and focus on
a specific keyspace. Another optional argument ``<table>`` allows you
to further narrow down. For keyspaces with :doc:`tablets </architecture/tablets>`
enabled, you need to provide both ``<keyspace>`` and ``<table>``. This
will display the partition ranges for that specific table.
.. code:: sh
nodetool ring <keyspace> <table>
.. include:: nodetool-index.rst

View File

@@ -2,11 +2,14 @@ Nodetool status
===============
**status** - This command prints the cluster information for a single keyspace or all keyspaces.
The keyspace argument is required to calculate effective ownership information (``Owns`` column).
For tablet keyspaces, a table is also required for effective ownership.
For example:
::
nodetool status
nodetool status my_keyspace
Example output:

View File

@@ -29,11 +29,16 @@ With time, SSTables are compacted, but the hard link keeps a copy of each file.
| 1. Data can only be restored from a snapshot of the table schema, where data exists in a backup. Backup your schema with the following command:
| ``$: cqlsh -e "DESC SCHEMA" > <schema_name.cql>``
| ``$: cqlsh -e "DESC SCHEMA WITH INTERNALS" > <schema_name.cql>``
For example:
| ``$: cqlsh -e "DESC SCHEMA" > db_schema.cql``
| ``$: cqlsh -e "DESC SCHEMA WITH INTERNALS" > db_schema.cql``
.. warning::
To get a proper schema description, you need to use cqlsh at least in version ``6.0.19``. Restoring a schema backup created by
an older version of cqlsh may lead to data resurrection or data loss. To check the version of your cqlsh, you can use ``cqlsh --version``.
|
| 2. Take a snapshot, including every keyspace you want to backup.

View File

@@ -17,8 +17,8 @@ limitations while applying the procedure:
retry, or the node refuses to boot on subsequent attempts, consult the
:doc:`Handling Membership Change Failures </operating-scylla/procedures/cluster-management/handling-membership-change-failures>`
document.
* The ``system_auth`` keyspace has not been upgraded to ``system_auth_v2``.
* The ``system_auth`` keyspace has not been upgraded to ``system``.
As a result, if ``authenticator`` is set to ``PasswordAuthenticator``, you must
increase the replication factor of the ``system_auth`` keyspace. It is
recommended to set ``system_auth`` replication factor to the number of nodes
in each DC.
in each DC.

View File

@@ -156,7 +156,9 @@ Add New DC
UN 54.160.174.243 109.54 KB 256 ? c7686ffd-7a5b-4124-858e-df2e61130aaa RACK1
UN 54.235.9.159 109.75 KB 256 ? 39798227-9f6f-4868-8193-08570856c09a RACK1
UN 54.146.228.25 128.33 KB 256 ? 7a4957a1-9590-4434-9746-9c8a6f796a0c RACK1
.. TODO possibly provide additional information WRT how ALTER works with tablets
#. When all nodes are up and running ``ALTER`` the following Keyspaces in the new nodes:
* Keyspace created by the user (which needed to replicate to the new DC).

View File

@@ -70,11 +70,46 @@ Step One: Determining Host IDs of Ghost Members
If you cannot determine the ghost members' host ID using the suggestions above, use the method described below.
#. Make sure there are no ongoing membership changes.
#. Execute the following CQL query on one of your nodes to obtain the host IDs of all token ring members:
#. Execute the following CQL query on one of your nodes to retrieve the Raft group 0 ID:
.. code-block:: cql
select peer, host_id, up from system.cluster_status;
select value from system.scylla_local where key = 'raft_group0_id'
For example:
.. code-block:: cql
cqlsh> select value from system.scylla_local where key = 'raft_group0_id';
value
--------------------------------------
607fef80-c276-11ed-a6f6-3075f294cc65
#. Use the obtained Raft group 0 ID to query the set of all cluster members' host IDs (which includes the ghost members), by executing the following query:
.. code-block:: cql
select server_id from system.raft_state where group_id = <group0_id>
replace ``<group0_id>`` with the group 0 ID that you obtained. For example:
.. code-block:: cql
cqlsh> select server_id from system.raft_state where group_id = 607fef80-c276-11ed-a6f6-3075f294cc65;
server_id
--------------------------------------
26a9badc-6e96-4b86-a8df-5173e5ab47fe
7991e7f5-692e-45a0-8ae5-438be5bc7c4f
aff11c6d-fbe7-4395-b7ca-3912d7dba2c6
#. Execute the following CQL query to obtain the host IDs of all token ring members:
.. code-block:: cql
select host_id, up from system.cluster_status;
For example:
@@ -83,25 +118,28 @@ If you cannot determine the ghost members' host ID using the suggestions above,
cqlsh> select peer, host_id, up from system.cluster_status;
peer | host_id | up
-----------+--------------------------------------+-------
127.0.0.3 | 42405b3b-487e-4759-8590-ddb9bdcebdc5 | False
127.0.0.1 | 4e3ee715-528f-4dc9-b10f-7cf294655a9e | True
127.0.0.2 | 225a80d0-633d-45d2-afeb-a5fa422c9bd5 | True
-----------+--------------------------------------+-------
127.0.0.3 | null | False
127.0.0.1 | 26a9badc-6e96-4b86-a8df-5173e5ab47fe | True
127.0.0.2 | 7991e7f5-692e-45a0-8ae5-438be5bc7c4f | True
The output of this query is similar to the output of ``nodetool status``.
We included the ``up`` column to see which nodes are down.
We included the ``up`` column to see which nodes are down and the ``peer`` column to see their IP addresses.
In this example, one of the 3 nodes tried to decommission but crashed while it was leaving the token ring. The node is in a partially left state and will refuse to restart, but other nodes still consider it as a normal member. We'll have to use ``removenode`` to clean up after it.
In this example, one of the nodes tried to decommission and crashed as soon as it left the token ring but before it left the Raft group. Its entry will show up in ``system.cluster_status`` queries with ``host_id = null``, like above, until the cluster is restarted.
#. A host ID belongs to a ghost member if it appears in the ``system.cluster_status`` query but does not correspond to any remaining node in your cluster.
#. A host ID belongs to a ghost member if:
* It appears in the ``system.raft_state`` query but not in the ``system.cluster_status`` query,
* Or it appears in the ``system.cluster_status`` query but does not correspond to any remaining node in your cluster.
In our example, the ghost member's host ID was ``aff11c6d-fbe7-4395-b7ca-3912d7dba2c6`` because it appeared in the ``system.raft_state`` query but not in the ``system.cluster_status`` query.
If you're unsure whether a given row in the ``system.cluster_status`` query corresponds to a node in your cluster, you can connect to each node in the cluster and execute ``select host_id from system.local`` (or search the node's logs) to obtain that node's host ID, collecting the host IDs of all nodes in your cluster. Then check if each host ID from the ``system.cluster_status`` query appears in your collected set; if not, it's a ghost member.
A good rule of thumb is to look at the members marked as down (``up = False`` in ``system.cluster_status``) - ghost members are eventually marked as down by the remaining members of the cluster. But remember that a real member might also be marked as down if it was shutdown or partitioned away from the rest of the cluster. If in doubt, connect to each node and collect their host IDs, as described in the previous paragraph.
In our example, the ghost member's host ID is ``42405b3b-487e-4759-8590-ddb9bdcebdc5`` because it is the only member marked as down and we can verify that the other two rows appearing in ``system.cluster_status`` belong to the remaining 2 nodes in the cluster.
In some cases, even after a failed topology change, there may be no ghost members left - for example, if a bootstrapping node crashed very early in the procedure or a decommissioning node crashed after it committed the membership change but before it finalized its own shutdown steps.
If any ghost members are present, proceed to the next step.

View File

@@ -190,11 +190,11 @@ In this case, the node's data will be cleaned after restart. To remedy this, you
#. Start Scylla Server
.. include:: /rst_include/scylla-commands-stop-index.rst
.. include:: /rst_include/scylla-commands-start-index.rst
Sometimes the public/ private IP of instance is changed after restart. If so refer to the Replace Procedure_ above.
.. _replace-node-upgrade-info:
.. scylladb_include_flag:: upgrade-warning-replace-node.rst
.. scylladb_include_flag:: upgrade-warning-replace-node.rst

View File

@@ -31,10 +31,10 @@ Procedure
cqlsh -u cassandra -p cassandra
.. warning::
.. note::
Before proceeding to the next step, we highly recommend creating a custom superuser
to ensure security and prevent performance degradation.
Before proceeding to the next step, we recommend creating a custom superuser
to improve security.
See :doc:`Creating a Custom Superuser </operating-scylla/security/create-superuser/>` for instructions.
#. If you want to create users and roles, continue to :doc:`Enable Authorization </operating-scylla/security/enable-authorization>`.

View File

@@ -6,12 +6,7 @@ The default ScyllaDB superuser role is ``cassandra`` with password ``cassandra``
Users with the ``cassandra`` role have full access to the database and can run
any CQL command on the database resources.
During login, the credentials for the default superuser ``cassandra`` are read with
a consistency level of QUORUM, whereas those for all other roles are read at LOCAL_ONE.
QUORUM may significantly impact performance, especially in multi-datacenter deployments.
To prevent performance degradation and ensure better security, we highly recommend creating
a custom superuser. You should:
To improve security, we recommend creating a custom superuser. You should:
#. Use the default ``cassandra`` superuser to log in.
#. Create a custom superuser.

View File

@@ -57,13 +57,13 @@ Set a Superuser
The default ScyllaDB superuser role is ``cassandra`` with password ``cassandra``. Using the default
superuser is unsafe and may significantly impact performance.
If you haven't created a custom superuser while enablint authentication, you should create a custom superuser
If you haven't created a custom superuser while enabling authentication, you should create a custom superuser
before creating additional roles.
See :doc:`Creating a Custom Superuser </operating-scylla/security/create-superuser/>` for instructions.
.. warning::
.. note::
We highly recommend creating a custom superuser to ensure security and avoid performance degradation.
We recommend creating a custom superuser to improve security.
.. _roles:

View File

@@ -0,0 +1 @@
`ScyllaDB Enterprise vs. Open Source Matrix <https://enterprise.docs.scylladb.com/stable/reference/versions-matrix-enterprise-oss.html>`_

View File

@@ -0,0 +1,11 @@
.. toctree::
:maxdepth: 2
:hidden:
AWS Images </reference/aws-images>
Azure Images </reference/azure-images>
GCP Images </reference/gcp-images>
Configuration Parameters </reference/configuration-parameters>
Glossary </reference/glossary>
API Reference (BETA) </reference/api-reference>
Metrics (BETA) </reference/metrics>

View File

@@ -2,8 +2,16 @@
Reference
===============
.. toctree::
:maxdepth: 1
:glob:
.. scylladb_include_flag:: reference-toc.rst
/reference/*
* ScyllaDB images for AWS, Azure, and GCP.
* :doc:`AWS Images </reference/aws-images>`
* :doc:`Azure Images </reference/azure-images>`
* :doc:`GCP Images </reference/gcp-images>`
* :doc:`Configuration Parameters </reference/configuration-parameters>` - ScyllaDB properties configurable in the ``scylla.yaml`` configuration file.
* :doc:`Glossary </reference/glossary>` - ScyllaDB-related terms and definitions.
* :doc:`API Reference (BETA) </reference/api-reference>`
* :doc:`Metrics (BETA) </reference/metrics>`
* .. scylladb_include_flag:: enterprise-vs-oss-matrix-link.rst

View File

@@ -0,0 +1,6 @@
==============
Metrics (BETA)
==============
.. scylladb_metrics::
:template: metrics.tmpl

View File

@@ -1,95 +0,0 @@
A Removed Node was not Removed Properly from the Seed Node List
===============================================================
Phenonoma
^^^^^^^^^
Failed to create :doc:`materialized view </cql/mv>` after node was removed from the cluster.
Error message:
.. code-block:: shell
InvalidRequest: Error from server: code=2200 [Invalid query] message="Can't create materialized views until the whole cluster has been upgraded"
Problem
^^^^^^^
A removed node was not removed properly from the seed node list.
Scylla Open Source 4.3 and later and Scylla Enterprise 2021.1 and later are seedless. See :doc:`Scylla Seed Nodes </kb/seed-nodes/>` for details.
This problem may occur in an earlier version of Scylla.
How to Verify
^^^^^^^^^^^^^
Scylla logs show the error message above.
To verify that the node wasn't remove properly use the :doc:`nodetool gossipinfo </operating-scylla/nodetool-commands/gossipinfo>` command
For example:
A three nodes cluster, with one node (54.62.0.101) removed.
.. code-block:: shell
nodetool gossipinfo
/54.62.0.99
generation:1172279348
heartbeat:7212
LOAD:2.0293227179E10
INTERNAL_IP:10.240.0.83
DC:E1
STATUS:NORMAL,-872190912874367364312
HOST_ID:12fdcf43-4642-53b1-a987-c0e825e4e10a
RPC_ADDRESS:10.240.0.83
RACK:R1
/54.62.0.100
generation:1657463198
heartbeat:8135
LOAD:2.0114638716E12
INTERNAL_IP:10.240.0.93
DC:E1
STATUS:NORMAL,-258152127640110957173
HOST_ID:99acbh55-1013-24a1-a987-s1w718c1e01b
RPC_ADDRESS:10.240.0.93
RACK:R1
/54.62.0.101
generation:1657463198
heartbeat:7022
LOAD:2.5173672157E48
INTERNAL_IP:10.240.0.103
DC:E1
STATUS:NORMAL,-365481201980413697284
HOST_ID:99acbh55-1301-55a1-a628-s4w254c1e01b
RPC_ADDRESS:10.240.0.103
RACK:R1
We can see that node ``54.62.0.101`` is still part of the cluster and needs to be removed.
Solution
^^^^^^^^
Remove the relevant node from the other nodes seed list (under scylla.yaml) and restart the nodes one by one.
For example:
Seed list before remove the node
.. code-block:: shell
- seeds: "10.240.0.83,10.240.0.93,10.240.0.103"
Seed list after removing the node
.. code-block:: shell
- seeds: "10.240.0.83,10.240.0.93"
Restart Scylla nodes
.. include:: /rst_include/scylla-commands-restart-index.rst

Some files were not shown because too many files have changed in this diff Show More