Compare commits

...

264 Commits

Author SHA1 Message Date
Yaron Kaikov
502c62d91d install-dependencies.sh: update node_exporter to 1.9.0
Update node_exporter to 1.9.0 to resolve the following CVE's
https://github.com/advisories/GHSA-49gw-vxvf-fc2g
https://github.com/advisories/GHSA-8xfx-rj4p-23jm
https://github.com/advisories/GHSA-crqm-pwhx-j97f
https://github.com/advisories/GHSA-j7vj-rw65-4v26

Fixes: https://github.com/scylladb/scylladb/issues/22884

regenerate frozen toolchain with optimized clang from
* https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-aarch64.tar.gz
* https://devpkg.scylladb.com/clang/clang-18.1.8-Fedora-40-x86_64.tar.gz

Closes scylladb/scylladb#22987

(cherry picked from commit e6227f9a25)

Closes scylladb/scylladb#23021
2025-04-21 23:12:28 +03:00
Avi Kivity
8ea6f5824a Update seastar submodule
* seastar 6d8fccf14c...ed31c1ce82 (1):
  > Merge 'Share IO queues between mountpoints' from Pavel Emelyanov

Changes in io-queue call for scylla-gdb update as well -- now the
reactor map of device to io-queue uses seastar::shared_ptr, not
std::unique_ptr.

Closes scylladb/scylladb#23733

Ref 70ac5828a8

Fixes #23820
2025-04-20 13:31:17 +03:00
Avi Kivity
6a8b033510 Merge '[Backport 2025.1] managed_bytes: in the copy constructor, respect the target preferred allocation size' from Scylladb[bot]
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

- (cherry picked from commit 4e2f62143b)

- (cherry picked from commit 6c1889f65c)

Parent PR: #23782

Closes scylladb/scylladb#23810

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-19 18:42:45 +03:00
Botond Dénes
e19ab12f5e Merge '[Backport 2025.1] service/storage_proxy: schedule_repair(): materialize the range into a vector' from Scylladb[bot]
Said method passes down its diff input to mutate_internal(), after some std::ranges massaging. Said massaging is destructive -- it moves items from the diff. If the output range is iterated-over multiple times, only the first time will see the actual output, further iterations will get an empty range.

When trace-level logging is enabled, this is exactly what happens: mutate_internal() iterates over the range multiple times, first to log its content, then to pass it down the stack. This ends up resulting in an empty range being pased down and write handlers being created with nullopt optionals.

Fixes: scylladb/scylladb#21907
Fixes: scylladb/scylladb#21714

A follow-up stability fix for the test is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23513
Fixes: https://github.com/scylladb/scylladb/issues/23512

Based on code-inspection, all versions are vulnerable, although <=6.2 use boost::ranges, not std::ranges.

- (cherry picked from commit 7150442f6a)

Parent PR: #21910

Closes scylladb/scylladb#23791

* github.com:scylladb/scylladb:
  test/cluster/test_read_repair.py: increase read request timeout
  service/storage_proxy: schedule_repair(): materialize the range into a vector
2025-04-18 14:04:23 +03:00
Anna Stuchlik
f8615b8c53 doc: add info about Scylla Doctor Automation to the docs
Fixes https://github.com/scylladb/scylladb/issues/23642

Closes scylladb/scylladb#23745

(cherry picked from commit 0b4740f3d7)

Closes scylladb/scylladb#23776
2025-04-18 14:03:54 +03:00
Botond Dénes
34ea9af232 Merge '[Backport 2025.1] tablets: rebuild: use repair for tablet rebuild' from Scylladb[bot]
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

- (cherry picked from commit b80e957a40)

- (cherry picked from commit ed7b8bb787)

- (cherry picked from commit 5d6041617b)

- (cherry picked from commit 4a847df55c)

- (cherry picked from commit eb17af6143)

- (cherry picked from commit acd32b24d3)

- (cherry picked from commit 372b562f5e)

Parent PR: #23187

Closes scylladb/scylladb#23682

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-18 14:03:25 +03:00
Michał Chojnowski
e6a2d67be4 managed_bytes_test: add a reproducer for #23781
(cherry picked from commit 6c1889f65c)
2025-04-18 07:55:46 +00:00
Michał Chojnowski
d1d21d97e1 managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

(cherry picked from commit 4e2f62143b)
2025-04-18 07:55:46 +00:00
Botond Dénes
440141a4ff test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513

Closes scylladb/scylladb#23558

(cherry picked from commit 7bbfa5293f)
2025-04-18 06:31:42 +03:00
Nadav Har'El
6b35eea1a9 Merge '[Backport 2025.1] Alternator batch rcu' from Scylladb[bot]
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.

Need backporting to 2025.1, as RCU and WCU are not fully supported

Fixes #23690

- (cherry picked from commit 0eabf8b388)

- (cherry picked from commit 88095919d0)

- (cherry picked from commit 3acde5f904)

Parent PR: #23691

Closes scylladb/scylladb#23790

* github.com:scylladb/scylladb:
  test_returnconsumedcapacity.py: test RCU for batch get item
  alternator/executor: Add RCU support for batch get items
  alternator/consumed_capacity: make functionality public
2025-04-17 21:39:58 +03:00
Botond Dénes
b14ae92f4f service/storage_proxy: schedule_repair(): materialize the range into a vector
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.

Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.

Fixes: scylladb/scylladb#21907
Fixes: scylladb/scylladb#21714

Closes scylladb/scylladb#21910

(cherry picked from commit 7150442f6a)
2025-04-17 11:15:19 +00:00
Benny Halevy
fd6c7c53b8 token_group_based_splitting_mutation_writer: maybe_switch_to_new_writer: prevent double close
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.

Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.

Fixes #22715

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22922

(cherry picked from commit 29b795709b)

Closes scylladb/scylladb#22965
2025-04-17 12:59:55 +02:00
Amnon Heiman
9434bd81b3 test_returnconsumedcapacity.py: test RCU for batch get item
This patch adds tests for consumed capacity in batch get item.  It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.

(cherry picked from commit 3acde5f904)
2025-04-17 10:30:18 +00:00
Amnon Heiman
0761eacf68 alternator/executor: Add RCU support for batch get items
This patch adds RCU support for batch get items.  With batch requests,
multiple objects are read from multiple tables. While the criterion for
adding the units is per the batch request, the units are calculated per
table—and so is the read consistency.

(cherry picked from commit 88095919d0)
2025-04-17 10:30:18 +00:00
Amnon Heiman
8bb4ee49da alternator/consumed_capacity: make functionality public
The consumed_capacity_counter is not completely applicable for batch
operations.  This patch makes some of its functionality public so that
batch get item can use the components to decide if it needs to send
consumed capacity in the reply, to get the half units used by the
metrics and returned result, and to allow an empty constructor for the
RCU counter.

(cherry picked from commit 0eabf8b388)
2025-04-17 10:30:18 +00:00
Avi Kivity
a89cdfc253 scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604

(cherry picked from commit b4d4e48381)

Closes scylladb/scylladb#23749
2025-04-16 14:37:43 +03:00
Botond Dénes
998bfe908f Merge '[Backport 2025.1] Fix EAR not applied on write to S3 (but on read).' from Scylladb[bot]
Fixes #23225
Fixes #23185

Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves
extension wrapping of file and sink objects to storage level.
(Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not).

This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail.
Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR.

Unit tests for io objects and a macro test for S3 encrypted storage included.

- (cherry picked from commit 98a6d0f79c)

- (cherry picked from commit e100af5280)

- (cherry picked from commit d46dcbb769)

- (cherry picked from commit e02be77af7)

- (cherry picked from commit 9ac9813c62)

- (cherry picked from commit 5c6337b887)

Parent PR: #23261

Closes scylladb/scylladb#23424

* github.com:scylladb/scylladb:
  encryption: Add "wrap_sink" to encryption sstable extension
  encrypted_file_impl: Add encrypted_data_sink
  sstables::storage: Move wrapping sstable components to storage provider
  sstables::file_io_extension: Add a "wrap_sink" method.
  sstables::file_io_extension: Make sstable argument to "wrap" const
  utils: Add "io-wrappers", useful IO helper types
2025-04-16 09:32:23 +03:00
Calle Wilund
0eed7f8f29 encryption: Add "wrap_sink" to encryption sstable extension
Creates a more efficient data_sink wrapper for encrypted output
stream (S3).

(cherry picked from commit 5c6337b887)
2025-04-15 11:00:22 +00:00
Calle Wilund
f174b419a4 encrypted_file_impl: Add encrypted_data_sink
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.

For making encrypted data sink writing less cumbersome.

(cherry picked from commit 9ac9813c62)
2025-04-15 11:00:22 +00:00
Calle Wilund
ac4c7a7ad2 sstables::storage: Move wrapping sstable components to storage provider
Fixes #23225
Fixes #23185

Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.

(cherry picked from commit e02be77af7)
2025-04-15 11:00:22 +00:00
Calle Wilund
6feb95ffad sstables::file_io_extension: Add a "wrap_sink" method.
Similar to wrap file, should wrap a data_sink (used for
sstable writers), in obvious write-only, simple stream
mode.

Default impl will detect if we wrap files for this component,
and if so, generate a file wrapper for the input sink, wrap
this, and the wrap it in a file_data_sink_impl.

This is obviously not efficient, so extensions used in actual
non-test code should implement the method.

(cherry picked from commit d46dcbb769)
2025-04-15 11:00:22 +00:00
Calle Wilund
b6ec0961ca sstables::file_io_extension: Make sstable argument to "wrap" const
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.

(cherry picked from commit e100af5280)
2025-04-15 10:36:47 +00:00
Calle Wilund
9a10458500 utils: Add "io-wrappers", useful IO helper types
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.

For testing, and because they fit there, place memory
sink and source types there as well.

(cherry picked from commit 98a6d0f79c)
2025-04-15 10:36:47 +00:00
Pavel Emelyanov
263416201c Merge '[Backport 2025.1] audit: add semaphore to audit_syslog_storage_helper' from Scylladb[bot]
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Moved syslog_send_helper to audit_syslog_storage_helper
 - Corutinize audit_syslog_storage_helper
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.

See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest
Fixes: scylladb/scylladb#22973

Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1.

- (cherry picked from commit dbd2acd2be)

- (cherry picked from commit 889fd5bc9f)

- (cherry picked from commit c12f976389)

Parent PR: #23464

Closes scylladb/scylladb#23674

* github.com:scylladb/scylladb:
  audit: add semaphore to audit_syslog_storage_helper
  audit: corutinize audit_syslog_storage_helper
  audit: moved syslog_send_helper to audit_syslog_storage_helper
2025-04-15 12:37:48 +03:00
Jenkins Promoter
42db149393 Update ScyllaDB version to: 2025.1.2 2025-04-15 12:13:36 +03:00
Pavel Emelyanov
4c382fbe7e cql: Remove unused "initial_tablets" mention from guardrails
All tablets configuration was moved into its own "with tablets" section,
this option name cannot be met among replication factors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23555

(cherry picked from commit d4f3a3ee4f)

Closes scylladb/scylladb#23676
2025-04-15 11:01:50 +03:00
David Garcia
7588789b02 fix: openapi not rendering in docs.scylladb.com/manual
Closes scylladb/scylladb#23686

(cherry picked from commit cf11d5eb69)

Closes scylladb/scylladb#23710
2025-04-15 10:58:59 +03:00
Jenkins Promoter
a0faf0bde0 Update pgo profiles - aarch64 2025-04-15 04:33:44 +03:00
Jenkins Promoter
a503e74bf5 Update pgo profiles - x86_64 2025-04-15 04:10:13 +03:00
Aleksandra Martyniuk
6702849f32 test: add test for rebuild with repair
(cherry picked from commit 372b562f5e)
2025-04-14 12:01:58 +02:00
Aleksandra Martyniuk
5c683449b3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.

(cherry picked from commit acd32b24d3)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
2436c24db7 locator: service: add transition to rebuild_repair stage for rebuild_v2
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.

(cherry picked from commit eb17af6143)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
6a251df136 locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.

(cherry picked from commit 4a847df55c)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
0e152e9f2e locator: add maybe_get_primary_replica
Add maybe_get_primary_replica to choose a primary replica out of
custom replica set.

(cherry picked from commit 5d6041617b)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
5a42418a19 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.

(cherry picked from commit ed7b8bb787)
2025-04-14 11:55:28 +02:00
Aleksandra Martyniuk
4df0ddb11f gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
(cherry picked from commit b80e957a40)
2025-04-14 11:55:28 +02:00
Botond Dénes
c1dce79847 Merge '[Backport 2025.1] Finalize tablet splits earlier' from Scylladb[bot]
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

This PR fixes the issue by updating the load balancer to no schedule any
migrations and to not make any repair plans when there a resize
finalization is pending in any table.

Also added a testcase to verify the fix.

Fixes #21762

- (cherry picked from commit 8cabc66f07)

- (cherry picked from commit 5b47d84399)

- (cherry picked from commit dccce670c1)

Parent PR: #22148

Closes scylladb/scylladb#23633

* github.com:scylladb/scylladb:
  topology_coordinator: fix indentation in generate_migration_updates
  topology_coordinator: do not schedule migrations when there are pending resize finalizations
  load_balancer: make repair plans only when there is no pending resize finalization
2025-04-14 06:44:57 +03:00
Botond Dénes
251db77fcb mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657

(cherry picked from commit d67202972a)

Closes scylladb/scylladb#23694
2025-04-11 10:53:31 +03:00
Botond Dénes
f7761729cc Merge '[Backport 2025.1] nodetool: cluster repair: add a command to repair tablet keyspaces' from Scylladb[bot]
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.

The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.

The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/22409.

Needs backport to 2025.1 that introduces the new tablet repair API

- (cherry picked from commit cbde835792)

- (cherry picked from commit b81c81c7f4)

- (cherry picked from commit aa3973c850)

- (cherry picked from commit 8bbc5e8923)

- (cherry picked from commit 02fb71da42)

- (cherry picked from commit 9769d7a564)

Parent PR: #22905

Closes scylladb/scylladb#23672

* github.com:scylladb/scylladb:
  docs: nodetool: update repair and add tablet-repair docs
  test: nodetool: add tests for cluster repair command
  nodetool: add cluster repair command
  nodetool: repair: extract getting hosts and dcs to functions
  nodetool: repair: warn about repairing tablet keyspaces
  nodetool: repair: move keyspace_uses_tablets function
2025-04-11 10:53:03 +03:00
Raphael S. Carvalho
75cd8e9492 replica: Fix truncate and drop table after tablet migration happens
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.

Status quo (around the assert in truncate procedure):

1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.

Note: truncated_at is likely above low_mark_at due to steps 2 and 3.

The interesting part of the assert is:
    (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)

Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().

If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.

truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.

Reproducer:

To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.

If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).

So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.

Proposed solution:

The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.

We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).

But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.

Fixes #18059.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23560

(cherry picked from commit 0f59deffaa)
(cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23649
2025-04-11 10:52:37 +03:00
Raphael S. Carvalho
7007dabdf9 storage_service: Don't retry split when table is dropped
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.

Fixes #21859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22933

(cherry picked from commit 4d8a333a7f)

Closes scylladb/scylladb#23480
2025-04-11 10:52:05 +03:00
Aleksandra Martyniuk
636ec802c3 service: tasks: hold token_metadata_ptr in tablet_virtual_task
Hold token_metadata_ptr in tablet_virtual_task methods that iterate
over tablets, to keep the tablet_map alive.

Fixes: https://github.com/scylladb/scylladb/issues/22316.

Closes scylladb/scylladb#22740

(cherry picked from commit f8e4198e72)

Closes scylladb/scylladb#22937
2025-04-11 10:51:07 +03:00
Avi Kivity
3335557075 Merge '[Backport 2025.1] row_cache: don't garbage-collect tombstones which cover data in memtables' from Scylladb[bot]
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

- (cherry picked from commit c2518cdf1a)

- (cherry picked from commit 6b5b563ef7)

- (cherry picked from commit 7e600a0747)

- (cherry picked from commit d126ea09ba)

- (cherry picked from commit cb76cafb60)

- (cherry picked from commit df09b3f970)

- (cherry picked from commit e5afd9b5fb)

- (cherry picked from commit 34b18d7ef4)

- (cherry picked from commit f7938e3f8b)

- (cherry picked from commit 6c1f6427b3)

- (cherry picked from commit 0d39091df2)

Parent PR: #23255

Closes scylladb/scylladb#23673

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-10 21:42:28 +03:00
Avi Kivity
6ff7927d67 sstables: store features early in write path
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.

Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).

However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.

Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.

Fixes #23484

Closes scylladb/scylladb#23485

(cherry picked from commit 73e4a3c581)

Closes scylladb/scylladb#23504
2025-04-10 21:41:09 +03:00
Pavel Emelyanov
1021a3d126 Merge '[Backport 2025.1] Allow abort during join_cluster' from Scylladb[bot]
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

* requires backport on top of https://github.com/scylladb/scylladb/pull/23184

- (cherry picked from commit 0fc196991a)

- (cherry picked from commit f269480f53)

- (cherry picked from commit 41f02c521d)

Parent PR: #23306

Closes scylladb/scylladb#23461

* github.com:scylladb/scylladb:
  main: allow abort during join_cluster
  main: add checkpoint before joining cluster
  storage_service: add start_sys_dist_ks
2025-04-10 19:03:46 +03:00
Avi Kivity
5d8bb068fa Merge '[Backport 2025.1] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot]
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

- (cherry picked from commit 876cf32e9d)

- (cherry picked from commit faf3aa13db)

- (cherry picked from commit 44748d624d)

- (cherry picked from commit 35bc1fe276)

Parent PR: #22868

Closes scylladb/scylladb#23290

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-04-10 18:22:16 +03:00
Lakshmi Narayanan Sreethar
fb069f0fbf topology_coordinator: fix indentation in generate_migration_updates
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit dccce670c1)
2025-04-10 18:39:10 +05:30
Lakshmi Narayanan Sreethar
48077b160d topology_coordinator: do not schedule migrations when there are pending resize finalizations
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.

Also added a testcase to verify the fix.

Fixes #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 5b47d84399)
2025-04-10 18:39:10 +05:30
Lakshmi Narayanan Sreethar
c286fc231a load_balancer: make repair plans only when there is no pending resize finalization
Do not make repair plans if any table has pending resize finalization.
This is to ensure that the finalization doesn't get delayed by reapir
tasks.

Refs #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 8cabc66f07)
2025-04-10 18:20:00 +05:30
Botond Dénes
df4872b82a test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.

(cherry picked from commit 0d39091df2)
2025-04-10 06:52:18 -04:00
Botond Dénes
7943db9844 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.

(cherry picked from commit 6c1f6427b3)
2025-04-10 06:52:18 -04:00
Botond Dénes
bd8c584a01 utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.

(cherry picked from commit f7938e3f8b)
2025-04-10 06:52:18 -04:00
Botond Dénes
50c05abd14 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.

(cherry picked from commit 34b18d7ef4)
2025-04-10 06:52:18 -04:00
Aleksandra Martyniuk
3a49808707 streaming: fix the way a reason of streaming failure is determined
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.
(cherry picked from commit 35bc1fe276)
2025-04-10 09:35:56 +02:00
Aleksandra Martyniuk
b57774dea6 streaming: save a continuation lambda
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.

Assign a lambda to a variable to prolong the captures lifetime.

(cherry picked from commit 44748d624d)
2025-04-10 09:35:55 +02:00
Aleksandra Martyniuk
67b0ea99a0 streaming: use streaming namespace in table_check.{cc,hh}
(cherry picked from commit faf3aa13db)
2025-04-10 09:35:54 +02:00
Aleksandra Martyniuk
7fa0e041eb repair: streaming: move table_check.{cc,hh} to streaming
(cherry picked from commit 876cf32e9d)
2025-04-10 09:34:23 +02:00
Botond Dénes
de1d8372fa test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).

(cherry picked from commit e5afd9b5fb)
2025-04-10 03:17:27 -04:00
Botond Dénes
dcc3604e02 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.

(cherry picked from commit df09b3f970)
2025-04-10 03:17:27 -04:00
Botond Dénes
39ca3463b3 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.

(cherry picked from commit cb76cafb60)
2025-04-10 03:17:27 -04:00
Botond Dénes
1c7a6ba140 replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.

(cherry picked from commit d126ea09ba)
2025-04-10 03:17:27 -04:00
Botond Dénes
4febf2a938 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.

(cherry picked from commit 7e600a0747)
2025-04-10 03:17:27 -04:00
Botond Dénes
b43d024ffb db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.

(cherry picked from commit 6b5b563ef7)
2025-04-10 03:17:27 -04:00
Botond Dénes
4bb1969a7f mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.

(cherry picked from commit c2518cdf1a)
2025-04-10 03:17:27 -04:00
Anna Stuchlik
6bcf513f11 doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024
This commit adds the procedure to enable consistent topology updates for upgrades
from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled
after upgrading from 2024.1 to 2024.2).

Fixes https://github.com/scylladb/scylladb/issues/23650

Closes scylladb/scylladb#23651

(cherry picked from commit 93a7b3ac1d)

Closes scylladb/scylladb#23670
2025-04-10 10:09:23 +03:00
Botond Dénes
b1a995b571 Merge '[Backport 2025.1] tablets: Make tablet allocation equalize per-shard load ' from Scylladb[bot]
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Fixes #23378

- (cherry picked from commit d6232a4f5f)

- (cherry picked from commit 6bff596fce)

Parent PR: #23478

Closes scylladb/scylladb#23635

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-10 10:08:38 +03:00
Botond Dénes
ec7da3d785 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395

(cherry picked from commit bd8973a025)

Closes scylladb/scylladb#23476
2025-04-10 10:05:18 +03:00
Botond Dénes
02d89435a9 Merge '[Backport 2025.1] Ignore wrapped exceptions gate_closed_exception and rpc::closed_error when node shuts down.' from Scylladb[bot]
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325
Fixes scylladb/scylladb#23305
Fixes scylladb/scylladb#21815

Backport: looks like this is quite a frequent issue, therefore backport to 2025.1.

- (cherry picked from commit 6abfed9817)

- (cherry picked from commit b1e89246d4)

- (cherry picked from commit 0d9d0fe60e)

- (cherry picked from commit d448f3de77)

Parent PR: #23336

Closes scylladb/scylladb#23470

* github.com:scylladb/scylladb:
  database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`
  database: Unify exception handling in `do_apply` and `apply_with_commitlog`
  storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down.
  exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.
2025-04-10 10:04:50 +03:00
Kefu Chai
4e500bc806 gms: Fix fmt formatter for gossip_digest_sync
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.

This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.

Fixes #23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23155

(cherry picked from commit 2a9966a20e)

Closes scylladb/scylladb#23199
2025-04-10 10:00:37 +03:00
Botond Dénes
0a86511359 Merge '[Backport 2025.1] reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Scylladb[bot]
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.

Fixes: scylladb/scylladb#22919

Bug is present in all live versions, backports are required.

- (cherry picked from commit 4d8eb02b8d)

- (cherry picked from commit 7ba29ec46c)

Parent PR: #23044

Closes scylladb/scylladb#23145

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
  test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
2025-04-10 09:58:19 +03:00
Asias He
a7ab9149e8 repair: Fix return type for storage_service/tablets/repair API
The API returns the repair task UUID. For example:

{"tablet_task_id":"3597e990-dc4f-11ef-b961-95d5ead302a7"}

Fixes #23032

Closes scylladb/scylladb#23050

(cherry picked from commit 3f59a89e85)

Closes scylladb/scylladb#23090
2025-04-10 09:57:45 +03:00
Piotr Dulikowski
a866dada1d test: test_mv_topology_change: increase timeout for removenode
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.

Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.

Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.

Fixes: scylladb/scylladb#22953

Closes scylladb/scylladb#22958

(cherry picked from commit 43ae3ab703)

Closes scylladb/scylladb#23053
2025-04-10 09:56:38 +03:00
Lakshmi Narayanan Sreethar
4e51f37c76 topology_coordinator: handle_table_migration: do not continue after executing metadata barrier
Return after executing the global metadata barrier to allow the topology
handler to handle any transitions that might have started by a
concurrect transaction.

Fixes #22792

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#22793

(cherry picked from commit 0f7d08d41d)

Closes scylladb/scylladb#23019
2025-04-10 09:55:53 +03:00
Botond Dénes
c44362451c replica/database: setup_scylla_memory_diagnostics_producer() un-static semaphore dump lambda
The lambda which dumps the diagnostics for each semaphore, is static.
Considering that said lambda captures a local (writeln) by reference, this
is wrong on two levels:
* The writeln captured on the shard which happens to initialize this
  static, will be used on all shards.
* The writeln captured on the first dump, will be used on later dumps,
  possibly triggering a segfault.

Drop the `static` to make the lambda local and resolve this problem.

Fixes: scylladb/scylladb#22756

Closes scylladb/scylladb#22776

(cherry picked from commit 820f196a49)

Closes scylladb/scylladb#22938
2025-04-10 09:54:37 +03:00
Calle Wilund
7b351682ac network_topology_strategy/alter ks: Remove dc:s from options once rf=0
Fixes #22688

If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.

Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.

v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some

Closes scylladb/scylladb#22693

(cherry picked from commit 342df0b1a8)

Closes scylladb/scylladb#22877
2025-04-10 09:53:48 +03:00
Benny Halevy
eaf67dd227 main: allow abort during join_cluster
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 41f02c521d)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-10 09:02:09 +03:00
Benny Halevy
fa92b6787c main: add checkpoint before joining cluster
(cherry picked from commit f269480f53)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-10 09:01:21 +03:00
Benny Halevy
a86e7ff286 storage_service: add start_sys_dist_ks
Currently, there's a call to
`supervisor::notify("starting system distributed keyspace")`
which is misleading as it is identical to a similar
message in main() when starting the sharded service.

Change that to a storage_service log messages
and be more specific that the sys_dist_ks shards are started.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0fc196991a)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-10 09:01:20 +03:00
Andrzej Jackowski
6da78533ed audit: add semaphore to audit_syslog_storage_helper
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.
 - Added 1 hour timeout to the semaphore, so semaphore stalls are
   failed just as all other syslog auditing failures.

Fixes: scylladb#22973
(cherry picked from commit c12f976389)
2025-04-09 14:13:24 +00:00
Andrzej Jackowski
1bb35952d7 audit: corutinize audit_syslog_storage_helper
This change:
 - Corutinize audit_syslog_storage_helper::syslog_send_helper
 - Corutinize audit_syslog_storage_helper::start
 - Corutinize audit_syslog_storage_helper::write

(cherry picked from commit 889fd5bc9f)
2025-04-09 14:13:24 +00:00
Andrzej Jackowski
efb99f29bc audit: moved syslog_send_helper to audit_syslog_storage_helper
This change:
 - Make syslog_send_helper() a method of audit_syslog_storage_helper, so
   syslog_send_helper() can access private members of
   audit_syslog_storage_helper in the next commits.
 - Remove unneeded syslog_send_helper() arguments that now are class
   members.

(cherry picked from commit dbd2acd2be)
2025-04-09 14:13:24 +00:00
Aleksandra Martyniuk
c7f1e1814c docs: nodetool: update repair and add tablet-repair docs
(cherry picked from commit 9769d7a564)
2025-04-09 14:03:29 +00:00
Aleksandra Martyniuk
7bbffb53dd test: nodetool: add tests for cluster repair command
(cherry picked from commit 02fb71da42)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
c5c631f175 nodetool: add cluster repair command
Add a new nodetool cluster repair command that repairs tablet keyspaces.

Users may specify keyspace and tables that they want to repair.
If the keyspace and tables are not specified, all tablet keyspaces
are repaired.

The command calls the new tablet repair API /storage_service/tablets/repair.

(cherry picked from commit 8bbc5e8923)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
8453d4f987 nodetool: repair: extract getting hosts and dcs to functions
(cherry picked from commit aa3973c850)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
a1b8ae57d9 nodetool: repair: warn about repairing tablet keyspaces
Warn about an attempt to repair tablet keysapce with nodetool repair.

A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.

(cherry picked from commit b81c81c7f4)
2025-04-09 14:03:28 +00:00
Aleksandra Martyniuk
b500fa498d nodetool: repair: move keyspace_uses_tablets function
(cherry picked from commit cbde835792)
2025-04-09 14:03:28 +00:00
Nadav Har'El
c94d8e2471 Merge '[Backport 2025.1] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot]
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

- (cherry picked from commit ca6bddef35)

- (cherry picked from commit f7e1695068)

Parent PR: #23174

Closes scylladb/scylladb#23524

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-09 14:59:13 +03:00
Kefu Chai
d7265a1bc2 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359

(cherry picked from commit ebf9125728)

Closes scylladb/scylladb#23387
2025-04-09 14:56:10 +03:00
Nadav Har'El
7f19a27f4f Merge '[Backport 2025.1] main: safely check stop_signal in-between starting services' from Scylladb[bot]
To simplify aborting scylla while starting the services,
add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.

This patch prevents gate_closed_exception to escape handling
when start-up is aborted early with the stop signal,
causing https://github.com/scylladb/scylladb/issues/23153
The regression is apparently due to a25c3eaa1c

Fixes https://github.com/scylladb/scylladb/issues/23153

* Requires backport to 2025.1 due to a25c3eaa1c

- (cherry picked from commit 23433f593c)

- (cherry picked from commit 282ff344db)

- (cherry picked from commit feef7d3fa1)

- (cherry picked from commit b6705ad48b)

Parent PR: #23103

Closes scylladb/scylladb#23184

* github.com:scylladb/scylladb:
  main: add checkpoints
  main: safely check stop_signal in-between starting services
  main: move prometheus start message
  main: move per-shard database start message
2025-04-09 14:54:19 +03:00
Nadav Har'El
c6825920a6 alternator: in GetRecords, enforce Limit to be <= 1000
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.

The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.

Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).

Fixes #23534

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23547

(cherry picked from commit 84fd52315f)

Closes scylladb/scylladb#23643
2025-04-09 12:46:30 +03:00
Botond Dénes
bff75aa812 Merge '[Backport 2025.1] Add tablet enforcing option' from Scylladb[bot]
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Fixes scylladb/scylla-enterprise#4355

[Edit: changed `Refs` above to `Fixes` to apeace the backport bot gods]

* Requires backport to 2025.1

- (cherry picked from commit c62865df90)

- (cherry picked from commit 62aeba759b)

- (cherry picked from commit 9fac0045d1)

Parent PR: #22273

Closes scylladb/scylladb#23602

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-09 08:47:10 +03:00
Michał Chojnowski
2a74426084 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397

(cherry picked from commit e23fdc0799)

Closes scylladb/scylladb#23628
2025-04-08 19:07:22 +03:00
Lakshmi Narayanan Sreethar
b7e72b3167 replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579

(cherry picked from commit 750f4baf44)

Closes scylladb/scylladb#23645
2025-04-08 18:59:22 +03:00
Yaron Kaikov
98359dbfb1 .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641

(cherry picked from commit 2dc7ea366b)

Closes scylladb/scylladb#23654
2025-04-08 13:49:27 +03:00
Benny Halevy
27ca0d1812 boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9fac0045d1)
2025-04-08 08:35:26 +03:00
Benny Halevy
736f89b31a tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 62aeba759b)
2025-04-08 08:35:14 +03:00
Benny Halevy
a49e27ac8f db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c62865df90)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-08 08:08:47 +03:00
Tomasz Grabiec
4f4c884d5d tablets: Make tablet allocation equalize per-shard load
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogenous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

(cherry picked from commit 6bff596fce)
2025-04-07 18:14:11 +02:00
Tomasz Grabiec
55bfbe8ea3 tablets: load_balancer: Fix reporting of total load per node
Load is now utilization, not count, so we should report average
per-shard load, which is equivalent to node's utilization.

(cherry picked from commit d6232a4f5f)
2025-04-07 15:51:08 +00:00
Botond Dénes
1a896169dc Merge '[Backport 2025.1] repair: release erm in repair_writer_impl::create_writer when possible' from Scylladb[bot]
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

- (cherry picked from commit 1dc29ddc86)

- (cherry picked from commit bae6711809)

Parent PR: #23455

Closes scylladb/scylladb#23580

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
2025-04-07 10:10:20 +03:00
Kefu Chai
9ccad33e59 .github: Make "make-pr-ready-for-review" workflow run in base repo
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error.  This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
    Error: Input required and not supplied: token
```

This commit addresses the issue by:

- Running the workflow in the base repository instead of the fork. This
  grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
  controlexecution, as recommended in the GitHub Actions documentation
  (https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
  This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.

This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.

Fixes scylladb/scylladb#22765

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22766

(cherry picked from commit ca832dc4fb)

Closes scylladb/scylladb#23561
2025-04-07 08:10:10 +03:00
Piotr Smaron
a17dd4d4c9 [Backport 2025.1] auth: forbid modifying system ks by non-superusers
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.

Fixes: scylladb/scylladb#23218
(cherry picked from commit fee50f287c)

Parent PR: #23219

Closes scylladb/scylladb#23594
2025-04-06 15:10:06 +03:00
Nadav Har'El
a2a4c6e4b2 test/alternator: increase timeout in Alternator RBAC test
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.

We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.

So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.

Fixes #23569.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23578

(cherry picked from commit a9a6f9eecc)

Closes scylladb/scylladb#23601
2025-04-06 11:49:46 +03:00
Avi Kivity
64182d9df6 Update seastar submodule (prefaulter leaving zombie threads)
* seastar a350b5d70e...6d8fccf14c (1):
  > smp: prefaulter: don't leave zombie worker threads

Fixes #23316
2025-04-05 22:28:53 +03:00
Pavel Emelyanov
8e85ef90d2 sstables_loader: Do not stop sharded<progress_monitor> unconditionally
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.

fixes: #23231

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23483

(cherry picked from commit 832d83ae4b)

Closes scylladb/scylladb#23557
2025-04-04 17:46:20 +03:00
Aleksandra Martyniuk
b5b2ffa5df \test: add test to check concurrent migration and repair of two different tablets
(cherry picked from commit bae6711809)
2025-04-04 10:14:51 +02:00
Andrzej Jackowski
b7f067ce33 audit: fix empty query string in BATCH query
Function modification_statement::add_raw() is never called, which
makes query string in audit_info of batch queries empty. In enterprise
branch, add_raw is called in Cql.g and those changes were never merged
to master.

This changes:
 - Add missing call of add_raw() to Cql.g
 - Include other related changes (from PR#3228 in scylla-enterprise)

Fixes scylladb#23311

Closes scylladb/scylladb#23315

(cherry picked from commit b8adbcbc84)

Closes scylladb/scylladb#23495
2025-04-03 16:46:33 +03:00
Aleksandra Martyniuk
307f00a398 repair: release erm in repair_writer_impl::create_writer when possible
Currently, repair_writer_impl::create_writer keeps erm to ensure
that a sharder is valid. If we repair a tablet, erm blocks the state
machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the
operation is safe and that tablet operations on the whole table
aren't blocked.

Fixes: #23453.
(cherry picked from commit 1dc29ddc86)
2025-04-03 13:19:40 +00:00
Dawid Mędrek
c56e47f72f db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811

(cherry picked from commit 0a6137218a)

Closes scylladb/scylladb#23370
2025-04-03 09:09:05 +02:00
Tomasz Grabiec
51ee15f02d Merge '[Backport 2025.1] tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Fixes #23042

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config

Closes scylladb/scylladb#23443

* github.com:scylladb/scylladb:
  Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
2025-04-01 20:31:05 +02:00
Vlad Zolotarov
feadb781f2 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.

(cherry picked from commit f7e1695068)
2025-04-01 11:45:54 +00:00
Vlad Zolotarov
6b71d6b9ba transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173

(cherry picked from commit ca6bddef35)
2025-04-01 11:45:54 +00:00
Jenkins Promoter
36bb089663 Update ScyllaDB version to: 2025.1.1 2025-04-01 14:18:18 +03:00
yangpeiyu2_yewu
1661b35050 mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.

Fixes scylladb/scylladb#22790

Closes scylladb/scylladb#22812

(cherry picked from commit 0de232934a)

Closes scylladb/scylladb#22943
2025-04-01 13:46:27 +03:00
Asias He
8c93a331f7 repair: Enable small table optimization for system_replicated_keys
This enterprise-only system table is replicated and small. It should be
included for small table optimization.

Fixes scylladb/scylla-enterprise#5256

Closes scylladb/scylladb#23135

Closes scylladb/scylladb#23147
2025-04-01 13:36:51 +03:00
Calle Wilund
85c161b9f1 generic_server: Update conditions for is_broken_pipe_or_connection_reset
Refs scylla-enterprise#5185
Fixes #22901

If a tls socket gets EPIPE the error is not translated to a specific
gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat
EPIPE as ignorable for plain sockets, we need to unwind nested exception
here to detect that the error was in fact due to this, so we can suppress
log output for this.

Closes scylladb/scylladb#22888

(cherry picked from commit e49f2046e5)

Closes scylladb/scylladb#23045
2025-04-01 13:06:29 +03:00
Patryk Jędrzejczak
d088cc8a2d Merge '[Backport 2025.1] Fix a regression that sometimes causes an internal error and demote barrier_and_drain rpc error log to a warning ' from Scylladb[bot]
The series fixes a regression and demotes a barrier_and_drain logging error to a warning since this particular condition may happen during normal operation.

We want to backport both since one is a bug fix and another is trivial and reduces CI flakiness.

- (cherry picked from commit 1da7d6bf02)

- (cherry picked from commit fe45ea505b)

Parent PR: #22650

Closes scylladb/scylladb#22923

* https://github.com/scylladb/scylladb:
  topology_coordinator: demote barrier_and_drain rpc failure to warning
  topology_coordinator: read peers table only once during topology state application
2025-04-01 11:54:56 +02:00
Patryk Jędrzejczak
39c20144e5 Merge '[Backport 2025.1] raft topology: Add support for raft topology init to happen before group0 initialization' from Scylladb[bot]
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.

This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
anner.

Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also performs
a refactoring to populate all init data to system tables even before
group0 creation(via `raft_initialize_discovery_leader` function). Now
because `raft_initialize_discovery_leader` happens before the group 0
creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.

Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.

By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#21114

- (cherry picked from commit 4748125a48)

- (cherry picked from commit e491950c47)

- (cherry picked from commit 623e01344b)

- (cherry picked from commit d7884cf651)

Parent PR: #22484

Closes scylladb/scylladb#22966

* https://github.com/scylladb/scylladb:
  storage_service: Remove the variable _manage_topology_change_kind_from_group0
  storage_service: fix indentation after the previous commit
  raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
  service/raft: Refactor mutation writing helper functions.
2025-04-01 11:46:15 +02:00
Jenkins Promoter
f1e7cee7a5 Update pgo profiles - aarch64 2025-04-01 04:20:56 +03:00
Jenkins Promoter
023b27312d Update pgo profiles - x86_64 2025-04-01 04:08:00 +03:00
Anna Stuchlik
2ffbc81e19 doc: remove the outdated info on seeds-info
This commit removes the outdated information about seed nodes.
We no longer need it in the docs, as a) the documentation is versioned,
and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions
mentioned in the docs are no longer supported.

In addition, some clarification has been added to the existing sections.

Fixes https://github.com/scylladb/scylladb/issues/22400

Closes scylladb/scylladb#23282

(cherry picked from commit dbbf9e19e4)

Closes scylladb/scylladb#23327
2025-03-31 12:33:59 +02:00
Yaron Kaikov
88e548ed72 .github: add action to make PR ready for review when conflicts label was removed
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed

Closes scylladb/scylladb#22446

(cherry picked from commit ed4bfad5c3)

Closes scylladb/scylladb#23023
2025-03-31 13:22:04 +03:00
Tomasz Grabiec
975882a489 test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451

(cherry picked from commit 8e506c5a8f)

Closes scylladb/scylladb#23469
2025-03-28 14:56:02 +01:00
Sergey Zolotukhin
bfb242b735 database: Pass schema_ptr as const ref in wrap_commitlog_add_error
(cherry picked from commit d448f3de77)
2025-03-27 21:28:13 +00:00
Sergey Zolotukhin
fe94b5a475 database: Unify exception handling in do_apply and apply_with_commitlog
Move exception wrapping logic from `do_apply` and `apply_with_commitlog`
to `wrap_commitlog_add_error` to ensure consistent error handling.

(cherry picked from commit 0d9d0fe60e)
2025-03-27 21:28:13 +00:00
Sergey Zolotukhin
f7b7a47404 storage_proxy: Ignore wrapped gate_closed_exception and rpc::closed_error when node shuts down.
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325

(cherry picked from commit b1e89246d4)
2025-03-27 21:28:13 +00:00
Sergey Zolotukhin
0e0d5241db exceptions: Add try_catch_nested to universally handle nested exceptions of the same type.
(cherry picked from commit 6abfed9817)
2025-03-27 21:28:12 +00:00
Evgeniy Naydanov
3653662099 test.py: random_failures: deselect topology ops for some injections
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.

The solution is the same: deselect the tests.

Fixes #23302

Closes scylladb/scylladb#23405

(cherry picked from commit 574c81eac6)

Closes scylladb/scylladb#23460
2025-03-27 13:19:59 +02:00
Anna Stuchlik
7336bb38fa doc: fix product names in the 2025.1 upgrage guides
This commit fixes the product names in the upgrade 2025.1 guides so that:

- 6.2 is preceded with "ScyllaDB Open Source"
- 2024.x is preceded with "ScyllaDB Enterprise"
- 2025.1 is preceded with "ScyllaDB"

Fixes https://github.com/scylladb/scylladb/issues/23154

Closes scylladb/scylladb#23223

(cherry picked from commit cd61f60549)

Closes scylladb/scylladb#23328
2025-03-27 11:58:01 +02:00
Avi Kivity
cff90755d8 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats

(cherry picked from commit b1d9f80d85)
2025-03-25 23:16:35 +01:00
Tomasz Grabiec
3be469da29 test: tablets_test: Add support for auto-split mode
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.

shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.

(cherry picked from commit 5e471c6f1b)
2025-03-25 18:23:22 +01:00
Tomasz Grabiec
1895724465 test: cql_test_env: Expose db config
(cherry picked from commit f3b63bfeff)
2025-03-25 18:22:32 +01:00
Jenkins Promoter
9dca28d2b8 Update ScyllaDB version to: 2025.1.0 2025-03-25 09:19:12 +02:00
Avi Kivity
bc98301783 Merge '[Backport 2025.1] repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.

Use topology_guard to maintain safety.

Fixes: https://github.com/scylladb/scylladb/issues/22408.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#23362

* github.com:scylladb/scylladb:
  test: add test to check concurrent tablets migration and repair
  repair: do not hold erm for repair scheduled by scheduler
  repair: get total rf based on current erm
  repair: make shard_repair_task_impl::erm private
  repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
  repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
  repair: pass session_id to repair_writer_impl::create_writer
  repair: keep materialized topology guard in shard_repair_task_impl
  repair: pass session_id to repair_meta
2025-03-23 20:14:53 +02:00
Avi Kivity
220bbcf329 Merge '[Backport 2025.1] cql3: Introduce RF-rack-valid keyspaces' from Scylladb[bot]
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.

The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
  keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
  so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
  keyspaces are RF-rack-valid. If not, the initialization fails.

We provide tests verifying that the changes behave as intended.

---

Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.

---

Implementation strategy (going beyond the scope of this PR):

1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.

---

Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300

---

Backport: this is part of the requirements for releasing 2025.1.

- (cherry picked from commit 32879ec0d5)

- (cherry picked from commit 41f862d7ba)

- (cherry picked from commit 0e04a6f3eb)

Parent PR: #23138

Closes scylladb/scylladb#23398

* github.com:scylladb/scylladb:
  main: Refuse to start node when RF-rack-invalid keyspace exists
  cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
  db/config: Introduce RF-rack-valid keyspaces
2025-03-23 16:16:29 +02:00
Dawid Mędrek
ecdefe801c main: Refuse to start node when RF-rack-invalid keyspace exists
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.

Fixes scylladb/scylladb#23300

(cherry picked from commit 0e04a6f3eb)
2025-03-21 12:27:04 +00:00
Dawid Mędrek
af2215c2d2 cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
In this commit, we refuse to create or alter a keyspace when that operation
would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is
enabled.

We provide two tests verifying that the changes work as intended.

Fixes scylladb/scylladb#23276

(cherry picked from commit 41f862d7ba)
2025-03-21 12:27:04 +00:00
Dawid Mędrek
864528eb9b db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356

(cherry picked from commit 32879ec0d5)
2025-03-21 12:27:04 +00:00
Aleksandra Martyniuk
5153b91514 test: add test to check concurrent tablets migration and repair
Add a test to check whether a tablet can be migrated while another
tablet is repaired.

(cherry picked from commit 20f9d7b6eb)
2025-03-19 10:15:19 +01:00
Aleksandra Martyniuk
0a0347cb4e repair: do not hold erm for repair scheduled by scheduler
Do not hold erm	for tablet repair scheduled by scheduler. Thanks to
that one tablet repair won't exclude migration of other tablets.

Concurrent repair and migration of the same tablet isn't possible,
since a tablet can be in one type of transition only at the time.
Hence the change is safe.

Refs: https://github.com/scylladb/scylladb/issues/22408.
(cherry picked from commit 5b792bdc98)
2025-03-19 10:09:51 +01:00
Aleksandra Martyniuk
da64c02b92 repair: get total rf based on current erm
Get total rf based on erm. Currently, it does not change anything
because erm stays the same during the whole repair.

(cherry picked from commit a1375896df)
2025-03-19 10:09:30 +01:00
Aleksandra Martyniuk
39aabe5191 repair: make shard_repair_task_impl::erm private
Make shard_repair_task_impl::erm private. Access it with getter.

(cherry picked from commit 34cd485553)
2025-03-19 10:09:11 +01:00
Aleksandra Martyniuk
9eeff8573b repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream
does not access erm. Pass small_table_optimization_params containing erm
only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.

(cherry picked from commit 444c7eab90)
2025-03-19 10:08:22 +01:00
Aleksandra Martyniuk
4115f6f367 repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
When small_table_optimization isn't enabled, flush_rows_in_working_row_buf
does not access erm. Add small_table_optimization_params containing erm and
pass it only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.

(cherry picked from commit e56bb5b6e2)
2025-03-19 10:07:45 +01:00
Aleksandra Martyniuk
fb2c46dfbe repair: pass session_id to repair_writer_impl::create_writer
(cherry picked from commit 09c74aa294)
2025-03-19 10:07:00 +01:00
Aleksandra Martyniuk
b4e37600d6 repair: keep materialized topology guard in shard_repair_task_impl
Keep materialized topology guard in shard_repair_task_impl and check
it in check_in_abort_or_shutdown and before each range repair.

(cherry picked from commit 47bb9dcf78)
2025-03-19 10:04:17 +01:00
Aleksandra Martyniuk
6bbf20a440 repair: pass session_id to repair_meta
Pass session_id of tablet repair down the stack from the repair request
to repair_meta.

The session_id will be utiziled in the following patches.

(cherry picked from commit 928f92c780)
2025-03-19 10:02:24 +01:00
Botond Dénes
b8797551eb Merge '[Backport 2025.1] Rack aware tablet merge colocation migration ' from Tomasz Grabiec
service: Introduce rack-aware co-location migrations for tablet merge

Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

(cherry picked from commit af949f3b6a)

Also backports dependent patches:
  - locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
  - locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
  - Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec

Closes scylladb/scylladb#22657
Closes scylladb/scylladb#22652

Closes scylladb/scylladb#23297

* github.com:scylladb/scylladb:
  service: Introduce rack-aware co-location migrations for tablet merge
  Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
  locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
  locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
2025-03-18 16:22:29 +02:00
Nadav Har'El
b1cf1890a9 alternator: document the state of tablet support in Alternator
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.

We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.

This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.

Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.

When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.

Fixes #21629

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22462

(cherry picked from commit c0821842de)

Closes scylladb/scylladb#23298
2025-03-16 18:25:21 +02:00
Jenkins Promoter
2f0ebe9f49 Update pgo profiles - aarch64 2025-03-15 04:21:14 +02:00
Jenkins Promoter
3633fb9ff8 Update pgo profiles - x86_64 2025-03-15 04:13:25 +02:00
Raphael S. Carvalho
33b5f27057 service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

(cherry picked from commit af949f3b6a)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-03-14 20:02:33 +01:00
Anna Stuchlik
11ecc886c3 doc: Remove "experimental" from ALTER KEYSPACE with Tablets
Altering a keyspace with tablets is no longer experimental.
This commit removes the "Experimental" label from the feature.

Fixes https://github.com/scylladb/scylladb/issues/23166

Closes scylladb/scylladb#23183

(cherry picked from commit 562b5db5b8)

Closes scylladb/scylladb#23274
2025-03-14 13:57:55 +01:00
Botond Dénes
eb147ec564 Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
This PR converts boost load balancer tests in preparation for load balancer changes
which add per-table tablet hints. After those changes, load balancer consults with the replication
strategy in the database, so we need to create proper schema in the
database. To do that, we need proper topology for replication
strategies which use RF > 1, otherwise keyspace creation will fail.

Topology is created in tests via group0 commands, which is abstracted by
the new `topology_builder` class.

Tests cannot modify token_metadata only in memory now as it needs to be
consistent with the schema and on-disk metadata. That's why modifications to
tablet metadata are now made under group0 guard and save back metadata to disk.

Closes scylladb/scylladb#22648

* github.com:scylladb/scylladb:
  test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario
  tests: tablets: Set initial tablets to 1 to exit growing mode
  test: tablets_test: Create proper schema in load balancer tests
  test: lib: Introduce topology_builder
  test: cql_test_env: Expose topology_state_machine
  topology_state_machine: Introduce lock transition

(cherry picked from commit 51a273401c)
2025-03-13 14:08:30 +01:00
Tomasz Grabiec
637e5fc9b5 locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
For example, nodes which are being decommissioned should not be
consider as available capacity for new tables. We don't allocate
tablets on such nodes.

Would result in higher per-shard load then planned.

Closes scylladb/scylladb#22657

(cherry picked from commit 3bb19e9ac9)
2025-03-13 14:08:27 +01:00
Tomasz Grabiec
0d77754c63 locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
In that case, new_racks will be used, but when we discover no
candidates, we try to pop from existing_racks.

Fixes #22625

Closes scylladb/scylladb#22652

(cherry picked from commit e22e3b21b1)
2025-03-13 14:00:48 +01:00
Benny Halevy
5481c9aedd docs: document the views-with-tablets experimental feature
Refs scylladb/scylladb#22217

Fixes scylladb/scylladb#22893

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22896

(cherry picked from commit 55dbf5493c)

Closes scylladb/scylladb#23024
2025-03-10 13:26:36 +01:00
Botond Dénes
59db708cba Merge '[Backport 2025.1] tablets: repair: fix hosts and dcs filters behavior for tablet repair' from Scylladb[bot]
If hosts and/or dcs filters are specified for tablet repair and
some replicas match these filters, choose the replica that will
be the repair master according to round-robin principle
(currently it's always the first replica).

If hosts and/or dcs filters are specified for tablet repair and
no replica matches these filters, the repair succeeds and
the repair request is removed (currently an exception is thrown
and tablet repair scheduler reschedules the repair forever).

Fixes: https://github.com/scylladb/scylladb/issues/23100.

Needs backport to 2025.1 that introduces hosts and dcs filters for tablet repair

- (cherry picked from commit 9bce40d917)

- (cherry picked from commit fe4e99d7b3)

- (cherry picked from commit 2b538d228c)

- (cherry picked from commit c40eaa0577)

- (cherry picked from commit c7c6d820d7)

Parent PR: #23101

Closes scylladb/scylladb#23109

* github.com:scylladb/scylladb:
  test: add new cases to tablet_repair tests
  test: extract repiar check to function
  locator: add round-robin selection of filtered replicas
  locator: add tablet_task_info::selected_by_filters
  service: finish repair successfully if no matching replica found
2025-03-10 12:49:01 +02:00
Botond Dénes
28690f8203 Merge '[Backport 2025.1] repair: Introduce Host and DC filter support' from Scylladb[bot]
Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support.

Fixes https://github.com/scylladb/scylladb/issues/22417

New feature. No backport is needed.

- (cherry picked from commit 4c75701756)

- (cherry picked from commit 5545289bfa)

- (cherry picked from commit 1c8a41e2dd)

- (cherry picked from commit e499f7c971)

Parent PR: #22621

Closes scylladb/scylladb#23080

* github.com:scylladb/scylladb:
  test: add test to check dcs and hosts repair filter
  test: add repair dc selection to test_tablet_metadata_persistence
  repair: Introduce Host and DC filter support
  docs: locator: update the docs and formatter of tablet_task_info
2025-03-10 12:48:49 +02:00
Anna Stuchlik
235c859b98 doc: zero-token nodes and Arbiter DC
This commit adds documentation for zero-token nodes and an explanation
of how to use them to set up an arbiter DC to prevent a quorum loss
in multi-DC deployments.

The commit adds two documents:
- The one in Architecture describes zero-token nodes.
- The other in Cluster Management explains how to use them.

We need separate documents because zero-token nodes may be used
for other purposes in the future.

In addition, the documents are cross-linked, and the link is added
to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document.

Refs https://github.com/scylladb/scylladb/pull/19684

Fixes https://github.com/scylladb/scylladb/issues/20294

Closes scylladb/scylladb#21348

(cherry picked from commit 9ac0aa7bba)

Closes scylladb/scylladb#23201
2025-03-10 10:59:07 +01:00
Benny Halevy
64248f2635 main: add checkpoints
Before starting significant services that didn't
have a corresponding call to supervisor::notify
before them.

Fixes #23153

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b6705ad48b)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-10 08:52:16 +02:00
Benny Halevy
b139b09129 main: safely check stop_signal in-between starting services
To simplify aborting scylla while starting the services,
Add a _ready state to stop_signal, so that until
main is ready to be stopped by the abort_source,
just register that the signal is caught, and
let a check() method poll that and request abort
and throw respective exception only then, in controlled
points that are in-between starting of services
after the service started successfully and a deferred
stop action was installed.

Refs #23153

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit feef7d3fa1)
2025-03-10 08:42:30 +02:00
Benny Halevy
ca161900cd main: move prometheus start message
The `prometheus_server` is started only conditionally
but the notification message is sent and logged
unconditionally.
Move it inside the condtional code block.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 282ff344db)
2025-03-10 08:16:37 +02:00
Benny Halevy
5417d4d517 main: move per-shard database start message
It is now logged out of place, so move it to right before
calling `start` on every database shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 23433f593c)
2025-03-10 08:16:37 +02:00
Anna Stuchlik
5453e85f39 doc: remove the reference to the 6.2 version
This commit removes the OSS version name, which is irrelevant
and confusing for 2025.1 and later users.
Also, it updates the warning to avoid specifying the release
when the deprecated feature will be removed.

Fixes https://github.com/scylladb/scylladb/issues/22839

Closes scylladb/scylladb#22936

(cherry picked from commit d0a48c5661)

Closes scylladb/scylladb#23022
2025-03-07 12:53:42 +02:00
Anna Stuchlik
7a6bcb3a3f doc: remove references to Enterprise
This commit removes the redundant references to Enterprise,
which are no longer valid.

Fixes https://github.com/scylladb/scylladb/issues/22927

Closes scylladb/scylladb#22930

(cherry picked from commit a28bbc22bd)

Closes scylladb/scylladb#22963
2025-03-07 12:53:22 +02:00
Anna Stuchlik
8b2a382eb6 doc: add support for Ubuntu 24.04 in 2024.1
Fixes https://github.com/scylladb/scylladb/issues/22841

Refs https://github.com/scylladb/scylla-enterprise/issues/4550

Closes scylladb/scylladb#22843

(cherry picked from commit 439463dbbf)

Closes scylladb/scylladb#23092
2025-03-07 12:51:13 +02:00
Dusan Malusev
cdd51d8b7a docs: add instruction for installing cassandra-stress
Signed-off-by: Dusan Malusev <dusan.malusev@scylladb.com>

Closes scylladb/scylladb#21723

(cherry picked from commit 4e6ea232d2)

Closes scylladb/scylladb#22947
2025-03-07 11:48:46 +02:00
Anna Stuchlik
88a8d140b3 doc: add information about tablets limitation to the CQL page
This commit adds a link to the Limitations section on the Tablets page
to the CQL pag, the tablets option.
This is actually the place where the user will need the information:
when creating a keyspace.

In addition, I've reorganized the section for better readability
(otherwise, the section about limitations was easy to miss)
and moved the section up on the page.

Note that I've removed the updated content from the  `_common` folder
(which I deleted) to the .rst page - we no longer split OSS and Enterprise,
so there's no need to keep using the `scylladb_include_flag` directive
to include OSS- and Ent-specific content.

Fixes https://github.com/scylladb/scylladb/issues/22892

Fixes https://github.com/scylladb/scylladb/issues/22940

Closes scylladb/scylladb#22939

(cherry picked from commit 0999fad279)

Closes scylladb/scylladb#23091
2025-03-07 11:48:07 +02:00
Aleksandra Martyniuk
1957dac2b4 test: add new cases to tablet_repair tests
Add tests for tablet repair with host and dc filters that select
one or no replica.

(cherry picked from commit c7c6d820d7)
2025-03-05 10:59:00 +01:00
Aleksandra Martyniuk
1091ef89e1 test: extract repiar check to function
(cherry picked from commit c40eaa0577)
2025-03-05 10:59:00 +01:00
Aleksandra Martyniuk
b081e07ffa locator: add round-robin selection of filtered replicas
(cherry picked from commit 2b538d228c)
2025-03-05 10:58:59 +01:00
Aleksandra Martyniuk
1f102ca2f7 locator: add tablet_task_info::selected_by_filters
Extract dcs and hosts filters check to a method.

(cherry picked from commit fe4e99d7b3)
2025-03-05 10:36:51 +01:00
Aleksandra Martyniuk
8a98f0d5b6 service: finish repair successfully if no matching replica found
If hosts and/or dcs filters are specified for tablet repair and
no replica matches these filters, an exception is thrown. The repair
fails and tablet repair scheduler reschedules it forever.

Such a repair should actually succeed (as all specified relpicas were
repaired) and the repair request should be removed.

Treat the repair as successful if the filters were specified and
selected no replica.

(cherry picked from commit 9bce40d917)
2025-03-05 10:36:50 +01:00
Botond Dénes
01485b2158 reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.

Fixes: scylladb/scylladb#22919
(cherry picked from commit 7ba29ec46c)
2025-03-04 18:46:55 +00:00
Botond Dénes
953c7cd71a test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.

(cherry picked from commit 4d8eb02b8d)
2025-03-04 18:46:55 +00:00
Anna Stuchlik
cdae92065b doc: add the 2025.1 upgrade guides and reorganize the upgrade section
This commit adds the upgrade guides relevant in version 2025.1:
- From 6.2 to 2025.1
- From 2024.x to 2025.1

It also removes the upgrade guides that are not relevant in 2025.1 source available:
- Open Source upgrade guides
- From Open Source to Enterprise upgrade guides
- Links to the Enterprise upgrade guides

Also, as part of this PR, the remaining relevant content has been moved to
the new About Upgrade page.

WHAT NEEDS TO BE REVIEWED
- Review the instructions in the 6.2-to-2025.1 guide
- Review the instructions in the 2024.x-to-2025.1 guide
- Verify that there are no references to Open Source and Enterprise.

The scope of this PR does not have to include metrics - the info can be added
in a follow-up PR.

Fixes https://github.com/scylladb/scylladb/issues/22208
Fixes https://github.com/scylladb/scylladb/issues/22209
Fixes https://github.com/scylladb/scylladb/issues/23072
Fixes https://github.com/scylladb/scylladb/issues/22346

Closes scylladb/scylladb#22352

(cherry picked from commit 850aec58e0)

Closes scylladb/scylladb#23106
2025-03-04 08:15:08 +02:00
Jenkins Promoter
4813c48d64 Update pgo profiles - aarch64 2025-03-01 04:23:19 +02:00
Jenkins Promoter
b623b108c3 Update pgo profiles - x86_64 2025-03-01 04:05:24 +02:00
Aleksandra Martyniuk
7fdc7bdc4b test: add test to check dcs and hosts repair filter
(cherry picked from commit e499f7c971)
2025-02-27 12:14:47 +01:00
Aleksandra Martyniuk
c2e926850d test: add repair dc selection to test_tablet_metadata_persistence
(cherry picked from commit 1c8a41e2dd)
2025-02-27 12:14:47 +01:00
Asias He
6d5b029812 repair: Introduce Host and DC filter support
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.

Fixes #22417

(cherry picked from commit 5545289bfa)
2025-02-27 12:14:44 +01:00
Aleksandra Martyniuk
ffeb55cf77 docs: locator: update the docs and formatter of tablet_task_info
(cherry picked from commit 4c75701756)
2025-02-26 23:49:50 +00:00
Jenkins Promoter
37aa7c216c Update ScyllaDB version to: 2025.1.0-rc4 2025-02-25 21:33:18 +02:00
Gleb Natapov
0b0e9f0c32 treewide: include build_mode.hh for SCYLLA_BUILD_MODE_RELEASE where it is missing
Fixes: #22914

Closes scylladb/scylladb#22915

(cherry picked from commit 914c9f1711)

Closes scylladb/scylladb#22962
2025-02-25 18:12:54 +03:00
Evgeniy Naydanov
871fabd60a test.py: test_random_failures: improve handling of hung node
In some cases the paused/unpaused node can hang not after 30s timeout.
This make the test flaky.  Change the condition to always check the
coordinator's log if there is a hung node.

Add `stop_after_streaming` to the list of error injections which can
cause a node's hang.

Also add a wait for a new coordinator election in cluster events
which cause such elections.

Closes scylladb/scylladb#22825

(cherry picked from commit 99be9ac8d8)

Closes scylladb/scylladb#23007
2025-02-25 14:31:51 +03:00
Abhi
67b7ea12a2 storage_service: Remove the variable _manage_topology_change_kind_from_group0
This commit removes the variable _manage_topology_change_kind_from_group0
which was used earlier as a work around for correctly handling
topology_change_kind variable, it was brittle and had some bugs. Earlier commits
made some modifications to deal with handling topology_change_kind variable
post _manage_topology_change_kind_from_group0 removal

(cherry picked from commit d7884cf651)
2025-02-20 21:21:31 +00:00
Abhi
d74bb95f54 storage_service: fix indentation after the previous commit
(cherry picked from commit 623e01344b)
2025-02-20 21:21:31 +00:00
Abhinav Jha
98977e9465 raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
In the current scenario, topology_change_kind variable, was been handled using
 _manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)

Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.

Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.

By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.

Fixes: scylladb/scylladb#21114
(cherry picked from commit e491950c47)
2025-02-20 21:21:31 +00:00
Abhi
e84376c9dc service/raft: Refactor mutation writing helper functions.
We use these changes in following commit.

(cherry picked from commit 4748125a48)
2025-02-20 21:21:31 +00:00
Gleb Natapov
79556be7a7 topology_coordinator: demote barrier_and_drain rpc failure to warning
The failure may happen during normal operation as well (for instance if
leader changes).

Fixes: scylladb/scylladb#22364
(cherry picked from commit fe45ea505b)
2025-02-19 08:59:53 +00:00
Gleb Natapov
fe0740ff56 topology_coordinator: read peers table only once during topology state application
During topology state application peers table may be updated with the
new ip->id mapping. The update is not atomic: it adds new mapping and
then removes the old one. If we call get_host_id_to_ip_map while this is
happening it may trigger an internal error there. This is a regression
since ef929c5def. Before that commit the
code read the peers table only once before starting the update loop.
This patch restores the behaviour.

Fixes: scylladb/scylladb#22578
(cherry picked from commit 1da7d6bf02)
2025-02-19 08:59:53 +00:00
Pavel Emelyanov
aa5cb15166 Merge 'Alternator: implement UpdateTable operation to add or delete GSI' from Nadav Har'El
In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is **not** the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes #11567.

We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes #11471.

This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. **It should be backported to 2025.1**, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1.

Closes scylladb/scylladb#21989

* github.com:scylladb/scylladb:
  alternator: fix view build on oversized GSI key attribute
  mv: clean up do_delete_old_entry
  test/alternator: unflake test for IndexStatus
  test/alternator: work around unrelated bug causing test flakiness
  docs/alternator: adding a GSI is no longer an unimplemented feature
  test/alternator: remove xfail from all tests for issue 11567
  alternator: overhaul implementation of GSIs and support UpdateTable
  mv: support regular_column_transformation key columns in view
  alternator: add new materialized-view computed column for item in map
  build: in cmake build, schema needs alternator
  build: build tests with Alternator
  alternator: add function serialized_value_if_type()
  mv: introduce regular_column_transformation, a new type of computed column
  alternator: add IndexStatus/Backfilling in DescribeTable
  alternator: add "LimitExceededException" error type
  docs/alternator: document two more unimplemented Alternator features

(cherry picked from commit 529ff3efa5)

Closes scylladb/scylladb#22826
2025-02-18 19:05:21 +02:00
Jenkins Promoter
13d79ba990 Update ScyllaDB version to: 2025.1.0-rc3 2025-02-18 15:06:57 +02:00
Nadav Har'El
35b410326b test/topology_custom: fix very slow test test_localnodes_broadcast_rpc_address
The test
topology_custom/test_alternator::test_localnodes_broadcast_rpc_address
sets up nodes with a silly "broadcast rpc address" and checks that
Alternator's "/localnodes" requests returns it correctly.

The problem is that although we don't use CQL in this test, the test
framework does open a CQL connection when the test starts, and closes
it when it ends. It turns out that when we set a silly "broadcast RPC
address", the driver tends to try to connect to it when shutting down,
I'm not even sure why. But the choice of the silly address was 1.2.3.4
is unfortunate, because this IP address is actually routable - and
the driver hangs until it times out (in practice, in a bit over two
minutes). This trivial patch changes 1.2.3.4 to 127.0.0.0 - and equally
silly address but one to which connections fail immediately.

Before this patch, the test often takes more than 2 minutes to finish
on my laptop, after this patch, it always finishes in 4-5 seconds.

Fixes #22744

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22746

(cherry picked from commit f89235517d)

Closes scylladb/scylladb#22875
2025-02-18 10:33:21 +02:00
Botond Dénes
12a3fcceae Merge '[Backport 2025.1] sstable_loader: fix cross-shard resource cleanup in download_task_impl ' from Scylladb[bot]
This PR addresses two related issues in our task system:

1. Prepares for asynchronous resource cleanup by converting release_resources() to a coroutine. This refactoring enables future improvements in how we handle resource cleanup.

2. Fixes a cross-shard resource cleanup issue in the SSTable loader where destruction of per-shard progress elements could trigger "shared_ptr accessed on non-owner cpu" errors in multi-shard environments. The fix uses coroutines to ensure resources are released on their owner shards.

Fixes #22759

---

this change addresses a regression introduced by d815d7013c, which is contained by 2025.1 and master branches. so it should be backported to 2025.1 branch.

- (cherry picked from commit 4c1f1baab4)

- (cherry picked from commit b448fea260)

Parent PR: #22791

Closes scylladb/scylladb#22871

* github.com:scylladb/scylladb:
  sstable_loader: fix cross-shard resource cleanup in download_task_impl
  tasks: make release_resources() a coroutine
2025-02-18 10:32:48 +02:00
Gleb Natapov
040c59674a api: initialize token metadata API after starting the gossiper
Token metadata API now depend on gossiper to do ip to host id mappings,
so initialized it after the gossiper is initialized and de-initialized
it before gossiper is stopped.

Fixes: scylladb/scylladb#22743

Closes scylladb/scylladb#22760

(cherry picked from commit d288d79d78)

Closes scylladb/scylladb#22854
2025-02-18 10:32:24 +02:00
Asias He
b50a6657e8 repair: Add await_completion option for tablet_repair api
Set true to wait for the repair to complete. Set false to skip waiting
for the repair to complete. When the option is not provided, it defaults
to false.

It is useful for management tool that wants the api to be async.

Fixes #22418

Closes scylladb/scylladb#22436

(cherry picked from commit fb318d0c81)

Closes scylladb/scylladb#22851
2025-02-18 10:31:53 +02:00
Botond Dénes
93479ffcf9 Merge '[Backport 2025.1] raft/group0_state_machine: load current RPC compression dict on startup' from Michał Chojnowski
We are supposed to be loading the most recent RPC compression dictionary on startup, but we forgot to port the relevant piece of logic during the source-available port. This causes a restarted node not to use the dictionary for RPC compression until the next dictionary update.

Fix that.

Fixes #22738

This is more of a bugfix than an improvement, so it should be backported to 2025.1.

* (cherry picked from commit [dd82b40](dd82b40186))

* (cherry picked from commit [8fb2ea6](8fb2ea61ba))

Additionally cherry picked https://github.com/scylladb/scylladb/pull/22836 to fix the timeout.

Parent PR: #22739

Closes scylladb/scylladb#22837

* github.com:scylladb/scylladb:
  test_rpc_compression.py: fix an overly-short timeout
  test_rpc_compression.py: test the dictionaries are loaded on startup
  raft/group0_state_machine: load current RPC compression dict on startup
2025-02-18 10:31:23 +02:00
Botond Dénes
38bd74b2d4 tools/scylla-nodetool: netstats: don't assume both senders and receivers
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.

Fixes: #22770

Closes scylladb/scylladb#22771

(cherry picked from commit 87e8e00de6)

Closes scylladb/scylladb#22874
2025-02-17 14:34:36 +02:00
Takuya ASADA
6ee1779578 dist: fix upgrade error from 2024.1
We need to allow replacing nodetool from scylla-enterprise-tools < 2024.2,
just like we did for scylla-tools < 5.5.
This is required to make packages able to upgrade from 2024.1.

Fixes #22820

Closes scylladb/scylladb#22821

(cherry picked from commit b5e306047f)

Closes scylladb/scylladb#22867
2025-02-16 14:47:48 +02:00
Kefu Chai
9fe2301647 sstable_loader: fix cross-shard resource cleanup in download_task_impl
Previously, download_task_impl's destructor would destroy per-shard progress
elements on whatever shard the task was destroyed on. In multi-shard
environments, this caused "shared_ptr accessed on non-owner cpu" errors when
attempting to free memory allocated on a different shard.

Fix by:
- Convert progress_per_shard into a sharded service
- Stop the service on owner shards during cleanup using coroutines
- Add operator+= to stream_progress to leverage seastar's built-in adder
  instead of a custom adder struct

Alternative approaches considered:

1. Using foreign_ptr: Rejected as it would require interface changes
   that complicate stream delegation. foreign_ptr manages the underlying
   pointee with another smart pointer but does not expose the smart
   pointer instance in its APIs, making it impossible to use
   shared_ptr<stream_progress> in the interface.
2. Using vector<stream_progress>: Rejected for similar interface
   compatibility reasons.

This solution maintains the existing interfaces while ensuring proper
cross-shard cleanup.

Fixes scylladb/scylladb#22759
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit b448fea260)
2025-02-15 22:46:43 +00:00
Kefu Chai
6b27459de3 tasks: make release_resources() a coroutine
Convert tasks::task_manager::task::impl::release_resources() to a coroutine
to prepare for upcoming changes that will implement asynchronous resource
release.

This is a preparatory refactoring that enables future coroutine-based
implementation of resource cleanup logic.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 4c1f1baab4)
2025-02-15 22:46:43 +00:00
Jenkins Promoter
48130ca2e9 Update pgo profiles - aarch64 2025-02-15 04:20:15 +02:00
Jenkins Promoter
5054087f0b Update pgo profiles - x86_64 2025-02-15 04:05:06 +02:00
Botond Dénes
889fb9c18b Update tools/java submodule
* tools/java 807e991d...6dfe728a (1):
  > dist: support smooth upgrade from enterprise to source availalbe

Fixes: scylladb/scylladb#22820
2025-02-14 11:14:07 +02:00
Botond Dénes
c627aff5f7 Merge '[Backport 2025.1] reader_concurrency_semaphore: set_notify_handler(): disable timeout ' from Scylladb[bot]
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.

Fixes: https://github.com/scylladb/scylladb/issues/22629

Backport required to all active releases, they are all affected.

- (cherry picked from commit a3ae0c7cee)

- (cherry picked from commit 9174f27cc8)

Parent PR: #22701

Closes scylladb/scylladb#22752

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: set_notify_handler(): disable timeout
  reader_permit: mark check_abort() as const
2025-02-13 15:24:54 +02:00
Michał Chojnowski
ffca4a9f85 test_rpc_compression.py: fix an overly-short timeout
The timeout of 10 seconds is too small for CI.
I didn't mean to make it so short, it was an accident.

Fix that by changing the timeout to 10 minutes.
2025-02-13 10:03:13 +01:00
Michał Chojnowski
2c0ffdce31 pgo: disable tablets for training with secondary index, lwt and counters
As of right now, materialized views (and consequently secondary
indexes), lwt and counters are unsupported or experimental with tablets.
Since by defaults tablets are enabled, training cases using those
features are currently broken.

The right thing to do here is to disable tablets in those cases.

Fixes https://github.com/scylladb/scylladb/issues/22638

Closes scylladb/scylladb#22661

(cherry picked from commit bea434f417)

Closes scylladb/scylladb#22808
2025-02-13 09:42:09 +02:00
Botond Dénes
ff7e93ddd5 db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.

Fixes: #22450

Closes scylladb/scylladb#22679

(cherry picked from commit 3d12451d1f)

Closes scylladb/scylladb#22722
2025-02-13 09:40:25 +02:00
Botond Dénes
1998733228 service: query_pager: fix last-position for filtering queries
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.

Fixes: #22620

Closes scylladb/scylladb#22622

(cherry picked from commit 7ce932ce01)

Closes scylladb/scylladb#22719
2025-02-13 09:40:05 +02:00
Botond Dénes
e79ee2ddb0 reader_concurrency_semaphore: foreach_permit(): include _inactive_reads
So inactive reads show up in semaphore diagnostics dumps (currently the
only non-test user of this method).

Fixes: #22574

Closes scylladb/scylladb#22575

(cherry picked from commit e1b1a2068a)

Closes scylladb/scylladb#22611
2025-02-13 09:39:39 +02:00
Aleksandra Martyniuk
4c39943b3f replica: mark registry entry as synch after the table is added
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.

Mark schema as synced after the table is added to database::_tables_metadata.

Fixes: #22347.

Closes scylladb/scylladb#22348

(cherry picked from commit 328818a50f)

Closes scylladb/scylladb#22604
2025-02-13 09:39:13 +02:00
Calle Wilund
17c86f8b57 encryption: Fix encrypted components mask check in describe
Fixes #22401

In the fix for scylladb/scylla-enterprise#892, the extraction and check for sstable component encryption mask was copied
to a subroutine for description purposes, but a very important 1 << <value> shift was somehow
left on the floor.

Without this, the check for whether we actually contain a component encrypted can be wholly
broken for some components.

Closes scylladb/scylladb#22398

(cherry picked from commit 7db14420b7)

Closes scylladb/scylladb#22599
2025-02-13 09:38:41 +02:00
Botond Dénes
d05b3897a2 Merge '[Backport 2025.1] api: task_manager: do not unregister finish task when its status is queried' from Scylladb[bot]
Currently, when the status of a task is queried and the task is already finished,
it gets unregistered. Getting the status shouldn't be a one-time operation.

Stop removing the task after its status is queried. Adjust tests not to rely
on this behavior. Add task_manager/drain API and nodetool tasks drain
command to remove finished tasks in the module.

Fixes: https://github.com/scylladb/scylladb/issues/21388.

It's a fix to task_manager API, should be backported to all branches

- (cherry picked from commit e37d1bcb98)

- (cherry picked from commit 18cc79176a)

Parent PR: #22310

Closes scylladb/scylladb#22598

* github.com:scylladb/scylladb:
  api: task_manager: do not unregister tasks on get_status
  api: task_manager: add /task_manager/drain
2025-02-13 09:38:12 +02:00
Botond Dénes
9116fc635e Merge '[Backport 2025.1] split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Scylladb[bot]
`tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups.

The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that
`tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but
`tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

Fixes #22431

This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1

- (cherry picked from commit 24e8d2a55c)

- (cherry picked from commit 8bff7786a8)

Parent PR: #22330

Closes scylladb/scylladb#22560

* github.com:scylladb/scylladb:
  test: add reproducer and test for fix to split ready CG creation
  table: run set_split_mode() on all storage groups during all_storage_groups_split()
2025-02-13 09:36:23 +02:00
Raphael S. Carvalho
5f74b5fdff test: Use linux-aio backend again on seastar-based tests
Since mid December, tests started failing with ENOMEM while
submitting I/O requests.

Logs of failed tests show IO uring was used as backend, but we
never deliberately switched to IO uring. Investigation pointed
to it happening accidentaly in commit 1bac6b75dc,
which turned on IO uring for allowing native tool in production,
and picked linux-aio backend explicitly when initializing Scylla.
But it missed that seastar-based tests would pick the default
backend, which is io_uring once enabled.

There's a reason we never made io_uring the default, which is
that it's not stable enough, and turns out we made the right
choice back then and it apparently continue to be unstable
causing flakiness in the tests.

Let's undo that accidental change in tests by explicitly
picking the linux-aio backend for seastar-based tests.
This should hopefully bring back stability.

Refs #21968.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22695

(cherry picked from commit ce65164315)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22800
2025-02-12 20:50:51 +02:00
Michał Chojnowski
a746fd2bb8 test_rpc_compression.py: test the dictionaries are loaded on startup
Reproduces scylladb/scylladb#22738

(cherry picked from commit 8fb2ea61ba)
2025-02-11 15:52:34 +00:00
Michał Chojnowski
89a5889bed raft/group0_state_machine: load current RPC compression dict on startup
We are supposed to be loading the most recent RPC compression dictionary
on startup, but we forgot to port the relevant piece of logic during
the source-available port.

(cherry picked from commit dd82b40186)
2025-02-11 15:52:33 +00:00
Michael Litvak
8d1f6df818 test/test_view_build_status: fix flaky asserts
In few test cases of test_view_build_status we create a view, wait for
it and then query the view_build_status table and expect it to have all
rows for each node and view.

But it may fail because it could happen that the wait_for_view query and
the following queries are done on different nodes, and some of the nodes
didn't apply all the table updates yet, so they have missing rows.

To fix it, we change the assert to work in the eventual consistency
sense, retrying until the number of rows is as expectd.

Fixes scylladb/scylladb#22644

Closes scylladb/scylladb#22654

(cherry picked from commit c098e9a327)

Closes scylladb/scylladb#22780
2025-02-11 10:21:54 +01:00
Avi Kivity
75320c9a13 Update tools/cqlsh submodule (driver update, upgradability)
* tools/cqlsh 52c6130...02ec7c5 (18):
  > chore(deps): update dependency scylla-driver to v3.28.2
  > dist: support smooth upgrade from enterprise to source availalbe
  > github action: fix downloading of artifacts
  > chore(deps): update docker/setup-buildx-action action to v3
  > chore(deps): update docker/login-action action to v3
  > chore(deps): update docker/build-push-action action to v6
  > chore(deps): update docker/setup-qemu-action action to v3
  > chore(deps): update peter-evans/dockerhub-description action to v4
  > upload actions: update the usage for multiple artifacts
  > chore(deps): update actions/download-artifact action to v4.1.8
  > chore(deps): update dependency scylla-driver to v3.28.0
  > chore(deps): update pypa/cibuildwheel action to v2.22.0
  > chore(deps): update actions/checkout action to v4
  > chore(deps): update python docker tag to v3.13
  > chore(deps): update actions/upload-artifact action to v4
  > github actions: update it to work
  > add option to output driver debug
  > Add renovate.json (#107)

Fixes: https://github.com/scylladb/scylladb/issues/22420
2025-02-09 18:07:55 +02:00
Yaron Kaikov
359af0ae9c dist: support smooth upgrade from enterprise to source availalbe
When upgrading for example from `2024.1` to `2025.1` the package name is
not identical casuing the upgrade command to fail:
```
Command: 'sudo DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade scylla -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"'
Exit code: 100
Stdout:
Selecting previously unselected package scylla.
Preparing to unpack .../6-scylla_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb ...
Unpacking scylla (2025.1.0~dev-0.20250118.1ef2d9d07692-1) ...
Errors were encountered while processing:
/tmp/apt-dpkg-install-JbOMav/0-scylla-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/1-scylla-python3_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/2-scylla-server_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/3-scylla-kernel-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/4-scylla-node-exporter_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
/tmp/apt-dpkg-install-JbOMav/5-scylla-cqlsh_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb
Stderr:
E: Sub-process /usr/bin/dpkg returned an error code (1)
```

Adding `Obsoletes` (for rpm) and `Replaces` (for deb)

Fixes: https://github.com/scylladb/scylladb/issues/22420

Closes scylladb/scylladb#22457

(cherry picked from commit 93f53f4eb8)

Closes scylladb/scylladb#22753
2025-02-09 18:06:52 +02:00
Avi Kivity
7f350558c2 Update tools/python3 (smooth upgrade from enterprise)
* tools/python3 8415caf...91c9531 (1):
  > dist: support smooth upgrade from enterprise to source availalbe

Ref #22420
2025-02-09 14:22:33 +02:00
Botond Dénes
fa9b1800b6 reader_concurrency_semaphore: set_notify_handler(): disable timeout
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.

Fixes: #scylladb/scylladb#22629
(cherry picked from commit 9174f27cc8)
2025-02-09 00:32:38 +00:00
Botond Dénes
c25d447b9c reader_permit: mark check_abort() as const
All it does is read one field, making it const makes using it easier.

(cherry picked from commit a3ae0c7cee)
2025-02-09 00:32:38 +00:00
Ferenc Szili
cf147d8f85 truncate: create session during request handling
Currently, the session ID under which the truncate for tablets request is
running is created during the request creation and queuing. This is a problem
because this could overwrite the session ID of any ongoing operation on
system.topology#session

This change moves the creation of the session ID for truncate from the request
creation to the request handling.

Fixes #22613

Closes scylladb/scylladb#22615

(cherry picked from commit a59618e83d)

Closes scylladb/scylladb#22705
2025-02-06 10:09:00 +02:00
Botond Dénes
319626e941 reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.

Fixes: #22588

Closes scylladb/scylladb#22624

(cherry picked from commit f2d5819645)

Closes scylladb/scylladb#22704
2025-02-06 10:08:19 +02:00
Aleksandra Martyniuk
cca2d974b6 service: use read barrier in tablet_virtual_task::contains
Currently, when the tablet repair is started, info regarding
the operation is kept in the system.tablets. The new tablet states
are reflected in memory after load_topology_state is called.
Before that, the data in the table and the memory aren't consistent.

To check the supported operations, tablet_virtual_task uses in-memory
tablet_metadata. Hence, it may not see the operation, even though
its info is already kept in system.tablets table.

Run read barrier in tablet_virtual_task::contains to ensure it will
see the latest data. Add a test to check it.

Fixes: #21975.

Closes scylladb/scylladb#21995

(cherry picked from commit 610a761ca2)

Closes scylladb/scylladb#22694
2025-02-06 10:07:51 +02:00
Aleksandra Martyniuk
43f2e5f86b nodetool: tasks: print empty string for start_time/end_time if unspecified
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.

Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.

Modify nodetool tasks status docs to show empty end_time.

Fixes: #22373.

Closes scylladb/scylladb#22370

(cherry picked from commit 477ad98b72)

Closes scylladb/scylladb#22601
2025-02-06 10:05:07 +02:00
Takuya ASADA
ad81d49923 dist: Support FIPS mode
- To make Scylla able to run in FIPS-compliant system, add .hmac files for
  crypto libraries on relocatable/rpm/deb packages.
- Currently we just write hmac value on *.hmac files, but there is new
  .hmac file format something like this:

  ```
  [global]
  format-version = 1
  [lib.xxx.so.yy]
  path = /lib64/libxxx.so.yy
  hmac = <hmac>
  ```
  Seems like GnuTLS rejects fips selftest on .libgnutls.so.30.hmac when
  file format is older one.
  Since we need to absolute path on "path" directive, we need to generate
  .libgnutls.so.30.hmac in older format on create-relocatable-script.py,

Fixes scylladb/scylladb#22573

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes scylladb/scylladb#22384

(cherry picked from commit fb4c7dc3d8)

Closes scylladb/scylladb#22587
2025-02-06 10:01:12 +02:00
Wojciech Mitros
138c68d80e mv: forbid views with tablets by default
Materialized views with tablets are not stable yet, but we want
them available as an experimental feature, mainly for teseting.

The feature was added in scylladb/scylladb#21833,
but currently it has no effect. All tests have been updated to use the
feature, so we should finally make it work.
This patch prevents users from creating materialized views in keyspaces
using tablets when the VIEWS_WITH_TABLETS feature is not enabled - such
requests will now get rejected.

Fixes scylladb/scylladb#21832

Closes scylladb/scylladb#22217

(cherry picked from commit 677f9962cf)

Closes scylladb/scylladb#22659
2025-02-04 08:06:23 +01:00
Avi Kivity
e0fb727f18 Update seastar submodule (hwloc failure on some AWS instances)
* seastar 1822136684...a350b5d70e (1):
  > resource: fallback to sysconf when failed to detect memory size from hwloc

Fixes #22382.
2025-02-03 22:47:39 +02:00
Jenkins Promoter
440833ae59 Update ScyllaDB version to: 2025.1.0-rc2 2025-02-03 13:23:18 +02:00
Michael Litvak
246635c426 test/test_view_build_status: fix wrong assert in test
The test expects and asserts that after wait_for_view is completed we
read the view_build_status table and get a row for each node and view.
But this is wrong because wait_for_view may have read the table on one
node, and then we query the table on a different node that didn't insert
all the rows yet, so the assert could fail.

To fix it we change the test to retry and check that eventually all
expected rows are found and then eventually removed on the same host.

Fixes scylladb/scylladb#22547

Closes scylladb/scylladb#22585

(cherry picked from commit 44c06ddfbb)

Closes scylladb/scylladb#22608
2025-02-03 09:24:17 +01:00
Michael Litvak
58eda6670f view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493

(cherry picked from commit 6d34125eb7)

Closes scylladb/scylladb#22607
2025-02-02 22:29:52 +02:00
Jenkins Promoter
28b8896680 Update pgo profiles - aarch64 2025-02-01 04:30:11 +02:00
Jenkins Promoter
e9cae4be17 Update pgo profiles - x86_64 2025-02-01 04:05:22 +02:00
Avi Kivity
daf1c96ad3 seatar: point submodule at scylla-seastar.git
This allows backporting commits to seastar.
2025-01-31 19:47:30 +02:00
Botond Dénes
1a1893078a Merge '[Backport 2025.1] encrypted_file_impl: Check for reads on or past actual file length in transform' from Scylladb[bot]
Fixes #22236

If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+&lt;1-15&gt;) (if crossing block boundary) and try to decrypt this buffer (last one).

Simplest example:
Actual data size: 4095
Physical file size: 4095 + key block size (typically 16)
Read from 4096: -> 15 bytes (padding) -> transform return `_file_size` - `read offset` -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero).

Check on last block in `transform` would wrap around size due to us being >= file size (l).
Just do an early bounds check and return zero if we're past the actual data limit.

- (cherry picked from commit e96cc52668)

- (cherry picked from commit 2fb95e4e2f)

Parent PR: #22395

Closes scylladb/scylladb#22583

* github.com:scylladb/scylladb:
  encrypted_file_test: Test reads beyond decrypted file length
  encrypted_file_impl: Check for reads on or past actual file length in transform
2025-01-31 11:38:50 +02:00
Aleksandra Martyniuk
8cc5566a3c api: task_manager: do not unregister tasks on get_status
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.

The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.

(cherry picked from commit 18cc79176a)
2025-01-31 08:21:03 +00:00
Aleksandra Martyniuk
1f52ced2ff api: task_manager: add /task_manager/drain
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.

Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.

(cherry picked from commit e37d1bcb98)
2025-01-31 08:21:03 +00:00
Avi Kivity
d7e3ab2226 Merge '[Backport 2025.1] truncate: trigger truncate logic from a transition state instead of global topology request' from Ferenc Szili
This is a manual backport of #22452

Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in topology_coordinator::handle_global_request() while topology::tstate remains empty. This creates problems because topology::is_busy() uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing.

This change introduces a new topology transition topology::transition_state::truncate_table and moves the truncate logic to a new method topology_coordinator::handle_truncate_table(). This method is now called as a handler of the truncate_table transition state instead of a handler of the truncate_table global topology request.

Fixes #22552

Closes scylladb/scylladb#22557

* github.com:scylladb/scylladb:
  truncate: trigger truncate logic from transition state instead of global request handler
  truncate: add truncate_table transition state
2025-01-30 22:49:17 +02:00
Anna Stuchlik
cf589222a0 doc: update the Web Installer docs to remove OSS
Fixes https://github.com/scylladb/scylladb/issues/22292

Closes scylladb/scylladb#22433

(cherry picked from commit 2a6445343c)

Closes scylladb/scylladb#22581
2025-01-30 13:04:16 +02:00
Anna Stuchlik
156800a3dd doc: add SStable support in 2025.1
This commit adds the information about SStable version support in 2025.1
by replacing "2022.2" with "2022.2 and above".

In addition, this commit removes information about versions that are
no longer supported.

Fixes https://github.com/scylladb/scylladb/issues/22485

Closes scylladb/scylladb#22486

(cherry picked from commit caf598b118)

Closes scylladb/scylladb#22580
2025-01-30 13:03:47 +02:00
Nikos Dragazis
d1e8b02260 encrypted_file_test: Test reads beyond decrypted file length
Add a test to reproduce a bug in the read DMA API of
`encrypted_file_impl` (the file implementation for Encryption-at-Rest).

The test creates an encrypted file that contains padding, and then
attempts to read from an offset within the padding area. Although this
offset is invalid on the decrypted file, the `encrypted_file_impl` makes
no checks and proceeds with the decryption of padding data, which
eventually leads to bogus results.

Refs #22236.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8f936b2cbc)
(cherry picked from commit 2fb95e4e2f)
2025-01-30 09:17:31 +00:00
Calle Wilund
a51888694e encrypted_file_impl: Check for reads on or past actual file length in transform
Fixes #22236

If reading a file and not stopping on block bounds returned by `size()`, we could
allow reading from (_file_size+1-15) (block boundary) and try to decrypt this
buffer (last one).
Check on last block in `transform` would wrap around size due to us being >=
file size (l).

Simplest example:
Actual data size: 4095
Physical file size: 4095 + key block size (typically 16)
Read from 4096: -> 15 bytes (padding) -> transform return _file_size - read offset
-> wraparound -> rather larger number than we expected
(not to mention the data in question is junk/zero).

Just do an early bounds check and return zero if we're past the actual data limit.

v2:
* Moved check to a min expression instead
* Added lengthy comment
* Added unit test

v3:
* Fixed read_dma_bulk handling of short, unaligned read
* Added test for unaligned read

v4:
* Added another unaligned test case

(cherry picked from commit e96cc52668)
2025-01-30 09:17:31 +00:00
Botond Dénes
68f134ee23 Merge '[Backport 2025.1] Do not update topology on address change' from Scylladb[bot]
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated. The series factors out peers table update code from
sync_raft_topology_nodes() and calls it on topology and ip address
updates. As a side effect it fixes #22293 since now topology loading
does not require IP do be present, so the assert that is triggered in
this bug is removed.

Fixes: scylladb/scylladb#22293

- (cherry picked from commit ef929c5def)

- (cherry picked from commit fbfef6b28a)

Parent PR: #22519

Closes scylladb/scylladb#22543

* github.com:scylladb/scylladb:
  topology coordinator: do not update topology on address change
  topology coordinator: split out the peer table update functionality from raft state application
2025-01-30 11:14:19 +02:00
Jenkins Promoter
b623c237bc Update ScyllaDB version to: 2025.1.0-rc1 2025-01-30 01:25:18 +02:00
Calle Wilund
8379d545c5 docs: Remove configuration_encryptor
Fixes #21993

Removes configuration_encryptor mention from docs.
The tool itself (java) is not included in the main branch
java tools, thus need not remove from there. Only the words.

Closes scylladb/scylladb#22427

(cherry picked from commit bae5b44b97)

Closes scylladb/scylladb#22556
2025-01-29 20:17:36 +02:00
Michael Litvak
58d13d0daf cdc: fix handling of new generation during raft upgrade
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.

Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.

What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.

To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.

Fixes scylladb/scylladb#21227

Closes scylladb/scylladb#22093

(cherry picked from commit 4f5550d7f2)

Closes scylladb/scylladb#22546
2025-01-29 20:06:18 +02:00
Anna Stuchlik
4def507b1b doc: add OS support for 2025.1 and reorganize the page
This commit adds the OS support information for version 2025.1.
In addition, the OS support page is reorganized so that:
- The content is moved from the include page _common/os-support-info.rst
  to the regular os-support.rst page. The include page was necessary
  to document different support for OSS and Enterprise versions, so
  we don't need it anymore.
- I skipped the entries for versions that won't be supported when 2025.1
  is released: 6.1 and 2023.1.
- I moved the definition of "supported" to the end of the page for better
  readability.
- I've renamed the index entry to "OS Support" to be shorter on the left menu.

Fixes https://github.com/scylladb/scylladb/issues/22474

Closes scylladb/scylladb#22476

(cherry picked from commit 61c822715c)

Closes scylladb/scylladb#22538
2025-01-29 19:48:32 +02:00
Anna Stuchlik
69ad9350cc doc: remove Enterprise labels and directives
This PR removes the now redundant Enterprise labels and directives
from the ScyllDB documentation.

Fixes https://github.com/scylladb/scylladb/issues/22432

Closes scylladb/scylladb#22434

(cherry picked from commit b2a718547f)

Closes scylladb/scylladb#22539
2025-01-29 19:48:11 +02:00
Anna Stuchlik
29e5f5f54d doc: enable the FIPS note in the ScyllaDB docs
This commit removes the information about FIPS out of the '.. only:: enterprise' directive.
As a result, the information will now show in the doc in the ScyllaDB repo
(previously, the directive included the note in the Entrprise docs only).

Refs https://github.com/scylladb/scylla-enterprise/issues/5020

Closes scylladb/scylladb#22374

(cherry picked from commit 1d5ef3dddb)

Closes scylladb/scylladb#22550
2025-01-29 19:47:37 +02:00
Avi Kivity
379b3fa46c Merge '[Backport 2025.1] repair: handle no_such_keyspace in repair preparation phase' from null
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

Fixes: #22073.

Requires backport to 6.1 and 6.2 as they contain the bug

- (cherry picked from commit bfb1704afa)

- (cherry picked from commit 54e7f2819c)

Parent PR: #22473

Closes scylladb/scylladb#22542

* github.com:scylladb/scylladb:
  test: add test to check if repair handles no_such_keyspace
  repair: handle keyspace dropped
2025-01-29 14:09:23 +02:00
Ferenc Szili
fe869fd902 test: add reproducer and test for fix to split ready CG creation
This adds a reproducer for #22431

In cases where a tablet storage group manager had more than one storage
group, it was possible to create compaction groups outside the group0
guard, which could create problems with operations which should exclude
with compaction group creation.

(cherry picked from commit 8bff7786a8)
2025-01-29 10:10:28 +00:00
Ferenc Szili
dc55a566fa table: run set_split_mode() on all storage groups during all_storage_groups_split()
tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode()
for each of its storage groups to create split ready compaction groups. It does
this by iterating through storage groups using std::ranges::all_of() which is
not guaranteed to iterate through the entire range, and will stop iterating on
the first occurance of the predicate (set_split_mode()) returning false.
set_split_mode() creates the split compaction groups and returns false if the
storage group's main compaction group or merging groups are not empty. This
means that in cases where the tablet storage group manager has non-empty
storage groups, we could have a situation where split compaction groups are not
created for all storage groups.

The missing split compaction groups are later created in
tablet_storage_group_manager::split_all_storage_groups() which also calls
set_split_mode(), and that is the reason why split completes successfully. The
problem is that tablet_storage_group_manager::all_storage_groups_split() runs
under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups()
does not. This can cause problems with operations which should exclude with
compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

(cherry picked from commit 24e8d2a55c)
2025-01-29 10:10:28 +00:00
Ferenc Szili
3bb8039359 truncate: trigger truncate logic from transition state instead of global
request handler

Before this change, the logic of truncate for tablets was triggered from
topology_coordinator::handle_global_request(). This was done without
using a topology transition state which remained empty throughout the
truncate handler's execution.

This change moves the truncate logic to a new method
topology_coordinator::handle_truncate_table(). This method is now called
as a handler of the truncate_table topology transition state instead of
a handler of the trunacate_table global topology request.
2025-01-29 10:48:34 +01:00
Ferenc Szili
9f3838e614 truncate: add truncate_table transition state
Truncate table for tablets is implemented as a global topology operation.
However, it does not have a transition state associated with it, and
performs the truncate logic in handle_global_request() while
topology::tstate remains empty. This creates problems because
topology::is_busy() uses transition_state to determine if the topology
state machine is busy, and will return false even though a truncate
operation is ongoing.

This change adds a new transition state: truncate_table
2025-01-29 10:47:15 +01:00
Gleb Natapov
366212f997 topology coordinator: do not update topology on address change
Since now topology does not contain ip addresses there is no need to
create topology on an ip address change. Only peers table has to be
updated, so call a function that does peers table update only.

(cherry picked from commit fbfef6b28a)
2025-01-28 21:51:11 +00:00
Gleb Natapov
c0637aff81 topology coordinator: split out the peer table update functionality from raft state application
Raft topology state application does two things: re-creates token metadata
and updates peers table if needed. The code for both task is intermixed
now. The patch separates it into separate functions. Will be needed in
the next patch.

(cherry picked from commit ef929c5def)
2025-01-28 21:51:11 +00:00
Aleksandra Martyniuk
dcf436eb84 test: add test to check if repair handles no_such_keyspace
(cherry picked from commit 54e7f2819c)
2025-01-28 21:50:35 +00:00
Aleksandra Martyniuk
8e754e9d41 repair: handle keyspace dropped
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

(cherry picked from commit bfb1704afa)
2025-01-28 21:50:35 +00:00
Yaron Kaikov
f407799f25 Update ScyllaDB version to: 2025.1.0-rc0 2025-01-27 11:29:45 +02:00
348 changed files with 10491 additions and 4471 deletions

View File

@@ -0,0 +1,27 @@
name: Mark PR as Ready When Conflicts Label is Removed
on:
pull_request_target:
types:
- unlabeled
env:
DEFAULT_BRANCH: 'master'
jobs:
mark-ready:
if: github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 1
- name: Mark pull request as ready for review
run: gh pr ready "${{ github.event.pull_request.number }}"
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.1.0-dev
VERSION=2025.1.2
if test -f version
then

View File

@@ -24,7 +24,7 @@ static constexpr uint64_t KB = 1024ULL;
static constexpr uint64_t RCU_BLOCK_SIZE_LENGTH = 4*KB;
static constexpr uint64_t WCU_BLOCK_SIZE_LENGTH = 1*KB;
static bool should_add_capacity(const rjson::value& request) {
bool consumed_capacity_counter::should_add_capacity(const rjson::value& request) {
const rjson::value* return_consumed = rjson::find(request, "ReturnConsumedCapacity");
if (!return_consumed) {
return false;
@@ -62,9 +62,12 @@ static uint64_t calculate_half_units(uint64_t unit_block_size, uint64_t total_by
rcu_consumed_capacity_counter::rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum) :
consumed_capacity_counter(should_add_capacity(request)),_is_quorum(is_quorum) {
}
uint64_t rcu_consumed_capacity_counter::get_half_units(uint64_t total_bytes, bool is_quorum) noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, total_bytes, is_quorum);
}
uint64_t rcu_consumed_capacity_counter::get_half_units() const noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, _total_bytes, _is_quorum);
return get_half_units(_total_bytes, _is_quorum);
}
uint64_t wcu_consumed_capacity_counter::get_half_units() const noexcept {

View File

@@ -42,15 +42,18 @@ public:
*/
virtual uint64_t get_half_units() const noexcept = 0;
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {
virtual uint64_t get_half_units() const noexcept;
bool _is_quorum = false;
public:
rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum);
rcu_consumed_capacity_counter(): consumed_capacity_counter(false), _is_quorum(false){}
virtual uint64_t get_half_units() const noexcept;
static uint64_t get_half_units(uint64_t total_bytes, bool is_quorum) noexcept;
};
class wcu_consumed_capacity_counter : public consumed_capacity_counter {

View File

@@ -88,6 +88,9 @@ public:
static api_error table_not_found(std::string msg) {
return api_error("TableNotFoundException", std::move(msg));
}
static api_error limit_exceeded(std::string msg) {
return api_error("LimitExceededException", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);
}

View File

@@ -7,6 +7,7 @@
*/
#include <fmt/ranges.h>
#include <seastar/core/on_internal_error.hh>
#include "alternator/executor.hh"
#include "alternator/consumed_capacity.hh"
#include "auth/permission.hh"
@@ -55,6 +56,9 @@
#include "utils/error_injection.hh"
#include "db/schema_tables.hh"
#include "utils/rjson.hh"
#include "alternator/extract_from_attrs.hh"
#include "types/types.hh"
#include "db/system_keyspace.hh"
using namespace std::chrono_literals;
@@ -215,7 +219,7 @@ static void validate_table_name(const std::string& name) {
// instead of each component individually as DynamoDB does.
// The view_name() function assumes the table_name has already been validated
// but validates the legality of index_name and the combination of both.
static std::string view_name(const std::string& table_name, std::string_view index_name, const std::string& delim = ":") {
static std::string view_name(std::string_view table_name, std::string_view index_name, const std::string& delim = ":") {
if (index_name.length() < 3) {
throw api_error::validation("IndexName must be at least 3 characters long");
}
@@ -223,7 +227,7 @@ static std::string view_name(const std::string& table_name, std::string_view ind
throw api_error::validation(
fmt::format("IndexName '{}' must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", index_name));
}
std::string ret = table_name + delim + std::string(index_name);
std::string ret = std::string(table_name) + delim + std::string(index_name);
if (ret.length() > max_table_name_length) {
throw api_error::validation(
fmt::format("The total length of TableName ('{}') and IndexName ('{}') cannot exceed {} characters",
@@ -232,7 +236,7 @@ static std::string view_name(const std::string& table_name, std::string_view ind
return ret;
}
static std::string lsi_name(const std::string& table_name, std::string_view index_name) {
static std::string lsi_name(std::string_view table_name, std::string_view index_name) {
return view_name(table_name, index_name, "!:");
}
@@ -469,7 +473,90 @@ static rjson::value generate_arn_for_index(const schema& schema, std::string_vie
schema.ks_name(), schema.cf_name(), index_name));
}
static rjson::value fill_table_description(schema_ptr schema, table_status tbl_status, service::storage_proxy const& proxy)
// The following function checks if a given view has finished building.
// We need this for describe_table() to know if a view is still backfilling,
// or active.
//
// Currently we don't have in view_ptr the knowledge whether a view finished
// building long ago - so checking this involves a somewhat inefficient, but
// still node-local, process:
// We need a table that can accurately tell that all nodes have finished
// building this view. system.built_views is not good enough because it only
// knows the view building status in the current node. In recent versions,
// after PR #19745, we have a local table system.view_build_status_v2 with
// global information, replacing the old system_distributed.view_build_status.
// In theory, there can be a period during upgrading an old cluster when this
// table is not yet available. However, since the IndexStatus is a new feature
// too, it is acceptable that it doesn't yet work in the middle of the update.
static future<bool> is_view_built(
view_ptr view,
service::storage_proxy& proxy,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit) {
auto schema = proxy.data_dictionary().find_table(
"system", db::system_keyspace::VIEW_BUILD_STATUS_V2).schema();
// The table system.view_build_status_v2 has "keyspace_name" and
// "view_name" as the partition key, and each clustering row has
// "host_id" as clustering key and a string "status". We need to
// read a single partition:
partition_key pk = partition_key::from_exploded(*schema,
{utf8_type->decompose(view->ks_name()),
utf8_type->decompose(view->cf_name())});
dht::partition_range_vector partition_ranges{
dht::partition_range(dht::decorate_key(*schema, pk))};
auto selection = cql3::selection::selection::wildcard(schema); // only for get_query_options()!
auto partition_slice = query::partition_slice(
{query::clustering_range::make_open_ended_both_sides()},
{}, // static columns
{schema->get_column_definition("status")->id}, // regular columns
selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(
schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice),
query::tombstone_limit(proxy.get_tombstone_limit()));
service::storage_proxy::coordinator_query_result qr =
co_await proxy.query(
schema, std::move(command), std::move(partition_ranges),
db::consistency_level::LOCAL_ONE,
service::storage_proxy::coordinator_query_options(
executor::default_timeout(), std::move(permit), client_state, trace_state));
query::result_set rs = query::result_set::from_raw_result(
schema, partition_slice, *qr.query_result);
std::unordered_map<locator::host_id, sstring> statuses;
for (auto&& r : rs.rows()) {
auto host_id = r.get<utils::UUID>("host_id");
auto status = r.get<sstring>("status");
if (host_id && status) {
statuses.emplace(locator::host_id(*host_id), *status);
}
}
// A view is considered "built" if all nodes reported SUCCESS in having
// built this view. Note that we need this "SUCCESS" for all nodes in the
// cluster - even those that are temporarily down (their success is known
// by this node, even if they are down). Conversely, we don't care what is
// the recorded status for any node which is no longer in the cluster - it
// is possible we forgot to erase the status of nodes that left the
// cluster, but here we just ignore them and look at the nodes actually
// in the topology.
bool all_built = true;
auto token_metadata = proxy.get_token_metadata_ptr();
token_metadata->get_topology().for_each_node(
[&] (const locator::node& node) {
// Note: we could skip nodes in DCs which have no replication of
// this view. However, in practice even those nodes would run
// the view building (and just see empty content) so we don't
// need to bother with this skipping.
auto it = statuses.find(node.host_id());
if (it == statuses.end() || it->second != "SUCCESS") {
all_built = false;
}
});
co_return all_built;
}
static future<rjson::value> fill_table_description(schema_ptr schema, table_status tbl_status, service::storage_proxy& proxy, service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit)
{
rjson::value table_description = rjson::empty_object();
auto tags_ptr = db::get_tags_of_table(schema);
@@ -548,7 +635,22 @@ static rjson::value fill_table_description(schema_ptr schema, table_status tbl_s
// FIXME: we have to get ProjectionType from the schema when it is added
rjson::add(view_entry, "Projection", std::move(projection));
// Local secondary indexes are marked by an extra '!' sign occurring before the ':' delimiter
rjson::value& index_array = (delim_it > 1 && cf_name[delim_it-1] == '!') ? lsi_array : gsi_array;
bool is_lsi = (delim_it > 1 && cf_name[delim_it-1] == '!');
// Add IndexStatus and Backfilling flags, but only for GSIs -
// LSIs can only be created with the table itself and do not
// have a status. Alternator schema operations are synchronous
// so only two combinations of these flags are possible: ACTIVE
// (for a built view) or CREATING+Backfilling (if view building
// is in progress).
if (!is_lsi) {
if (co_await is_view_built(vptr, proxy, client_state, trace_state, permit)) {
rjson::add(view_entry, "IndexStatus", "ACTIVE");
} else {
rjson::add(view_entry, "IndexStatus", "CREATING");
rjson::add(view_entry, "Backfilling", rjson::value(true));
}
}
rjson::value& index_array = is_lsi ? lsi_array : gsi_array;
rjson::push_back(index_array, std::move(view_entry));
}
if (!lsi_array.Empty()) {
@@ -572,7 +674,7 @@ static rjson::value fill_table_description(schema_ptr schema, table_status tbl_s
executor::supplement_table_stream_info(table_description, *schema, proxy);
// FIXME: still missing some response fields (issue #5026)
return table_description;
co_return table_description;
}
bool is_alternator_keyspace(const sstring& ks_name) {
@@ -591,11 +693,11 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
rjson::value table_description = fill_table_description(schema, table_status::active, _proxy);
rjson::value table_description = co_await fill_table_description(schema, table_status::active, _proxy, client_state, trace_state, permit);
rjson::value response = rjson::empty_object();
rjson::add(response, "Table", std::move(table_description));
elogger.trace("returning {}", response);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(response)));
co_return make_jsonable(std::move(response));
}
// Check CQL's Role-Based Access Control (RBAC) permission_to_check (MODIFY,
@@ -656,7 +758,7 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
auto& p = _proxy.container();
schema_ptr schema = get_table(_proxy, request);
rjson::value table_description = fill_table_description(schema, table_status::deleting, _proxy);
rjson::value table_description = co_await fill_table_description(schema, table_status::deleting, _proxy, client_state, trace_state, permit);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::DROP);
co_await _mm.container().invoke_on(0, [&, cs = client_state.move_to_other_shard()] (service::migration_manager& mm) -> future<> {
// FIXME: the following needs to be in a loop. If mm.announce() below
@@ -704,7 +806,7 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
co_return make_jsonable(std::move(response));
}
static data_type parse_key_type(const std::string& type) {
static data_type parse_key_type(std::string_view type) {
// Note that keys are only allowed to be string, blob or number (S/B/N).
// The other types: boolean and various lists or sets - are not allowed.
if (type.length() == 1) {
@@ -719,7 +821,7 @@ static data_type parse_key_type(const std::string& type) {
}
static void add_column(schema_builder& builder, const std::string& name, const rjson::value& attribute_definitions, column_kind kind) {
static void add_column(schema_builder& builder, const std::string& name, const rjson::value& attribute_definitions, column_kind kind, bool computed_column=false) {
// FIXME: Currently, the column name ATTRS_COLUMN_NAME is not allowed
// because we use it for our untyped attribute map, and we can't have a
// second column with the same name. We should fix this, by renaming
@@ -731,7 +833,16 @@ static void add_column(schema_builder& builder, const std::string& name, const r
const rjson::value& attribute_info = *it;
if (attribute_info["AttributeName"].GetString() == name) {
auto type = attribute_info["AttributeType"].GetString();
builder.with_column(to_bytes(name), parse_key_type(type), kind);
data_type dt = parse_key_type(type);
if (computed_column) {
// Computed column for GSI (doesn't choose a real column as-is
// but rather extracts a single value from the ":attrs" map)
alternator_type at = type_info_from_string(type).atype;
builder.with_computed_column(to_bytes(name), dt, kind,
std::make_unique<extract_from_attrs_column_computation>(to_bytes(name), at));
} else {
builder.with_column(to_bytes(name), dt, kind);
}
return;
}
}
@@ -1072,6 +1183,87 @@ static std::unordered_set<std::string> validate_attribute_definitions(const rjso
return seen_attribute_names;
}
// The following "extract_from_attrs_column_computation" implementation is
// what allows Alternator GSIs to use in a materialized view's key a member
// from the ":attrs" map instead of a real column in the schema:
const bytes extract_from_attrs_column_computation::MAP_NAME = executor::ATTRS_COLUMN_NAME;
column_computation_ptr extract_from_attrs_column_computation::clone() const {
return std::make_unique<extract_from_attrs_column_computation>(*this);
}
// Serialize the *definition* of this column computation into a JSON
// string with a unique "type" string - TYPE_NAME - which then causes
// column_computation::deserialize() to create an object from this class.
bytes extract_from_attrs_column_computation::serialize() const {
rjson::value ret = rjson::empty_object();
rjson::add(ret, "type", TYPE_NAME);
rjson::add(ret, "attr_name", rjson::from_string(to_string_view(_attr_name)));
rjson::add(ret, "desired_type", represent_type(_desired_type).ident);
return to_bytes(rjson::print(ret));
}
// Construct an extract_from_attrs_column_computation object based on the
// saved output of serialize(). Calls on_internal_error() if the string
// doesn't match the expected output format of serialize(). "type" is not
// checked - we assume the caller (column_computation::deserialize()) won't
// call this constructor if "type" doesn't match.
extract_from_attrs_column_computation::extract_from_attrs_column_computation(const rjson::value &v) {
const rjson::value* attr_name = rjson::find(v, "attr_name");
if (attr_name->IsString()) {
_attr_name = bytes(to_bytes_view(rjson::to_string_view(*attr_name)));
const rjson::value* desired_type = rjson::find(v, "desired_type");
if (desired_type->IsString()) {
_desired_type = type_info_from_string(rjson::to_string_view(*desired_type)).atype;
switch (_desired_type) {
case alternator_type::S:
case alternator_type::B:
case alternator_type::N:
// We're done
return;
default:
// Fall through to on_internal_error below.
break;
}
}
}
on_internal_error(elogger, format("Improperly formatted alternator::extract_from_attrs_column_computation computed column definition: {}", v));
}
regular_column_transformation::result extract_from_attrs_column_computation::compute_value(
const schema& schema,
const partition_key& key,
const db::view::clustering_or_static_row& row) const
{
const column_definition* attrs_col = schema.get_column_definition(MAP_NAME);
if (!attrs_col || !attrs_col->is_regular() || !attrs_col->is_multi_cell()) {
on_internal_error(elogger, "extract_from_attrs_column_computation::compute_value() on a table without an attrs map");
}
// Look for the desired attribute _attr_name in the attrs_col map in row:
const atomic_cell_or_collection* attrs = row.cells().find_cell(attrs_col->id);
if (!attrs) {
return regular_column_transformation::result();
}
collection_mutation_view cmv = attrs->as_collection_mutation();
return cmv.with_deserialized(*attrs_col->type, [this] (const collection_mutation_view_description& cmvd) {
for (auto&& [key, cell] : cmvd.cells) {
if (key == _attr_name) {
return regular_column_transformation::result(cell,
std::bind(serialized_value_if_type, std::placeholders::_1, _desired_type));
}
}
return regular_column_transformation::result();
});
}
// extract_from_attrs_column_computation needs the whole row to compute
// value, it cann't use just the partition key.
bytes extract_from_attrs_column_computation::compute_value(const schema&, const partition_key&) const {
on_internal_error(elogger, "extract_from_attrs_column_computation::compute_value called without row");
}
static future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, bool enforce_authorization) {
SCYLLA_ASSERT(this_shard_id() == 0);
@@ -1110,67 +1302,15 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
schema_ptr partial_schema = builder.build();
// Parse GlobalSecondaryIndexes parameters before creating the base
// table, so if we have a parse errors we can fail without creating
// Parse Local/GlobalSecondaryIndexes parameters before creating the
// base table, so if we have a parse errors we can fail without creating
// any table.
const rjson::value* gsi = rjson::find(request, "GlobalSecondaryIndexes");
std::vector<schema_builder> view_builders;
std::unordered_set<std::string> index_names;
if (gsi) {
if (!gsi->IsArray()) {
co_return api_error::validation("GlobalSecondaryIndexes must be an array.");
}
for (const rjson::value& g : gsi->GetArray()) {
const rjson::value* index_name_v = rjson::find(g, "IndexName");
if (!index_name_v || !index_name_v->IsString()) {
co_return api_error::validation("GlobalSecondaryIndexes IndexName must be a string.");
}
std::string_view index_name = rjson::to_string_view(*index_name_v);
auto [it, added] = index_names.emplace(index_name);
if (!added) {
co_return api_error::validation(fmt::format("Duplicate IndexName '{}', ", index_name));
}
std::string vname(view_name(table_name, index_name));
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
// require the MV code to copy just parts of the attrs map.
schema_builder view_builder(keyspace_name, vname);
auto [view_hash_key, view_range_key] = parse_key_schema(g);
if (partial_schema->get_column_definition(to_bytes(view_hash_key)) == nullptr) {
// A column that exists in a global secondary index is upgraded from being a map entry
// to having a regular column definition in the base schema
add_column(builder, view_hash_key, attribute_definitions, column_kind::regular_column);
}
add_column(view_builder, view_hash_key, attribute_definitions, column_kind::partition_key);
unused_attribute_definitions.erase(view_hash_key);
if (!view_range_key.empty()) {
if (partial_schema->get_column_definition(to_bytes(view_range_key)) == nullptr) {
// A column that exists in a global secondary index is upgraded from being a map entry
// to having a regular column definition in the base schema
if (partial_schema->get_column_definition(to_bytes(view_hash_key)) == nullptr) {
// FIXME: this is alternator limitation only, because Scylla's materialized views
// we use underneath do not allow more than 1 base regular column to be part of the MV key
elogger.warn("Only 1 regular column from the base table should be used in the GSI key in order to ensure correct liveness management without assumptions");
}
add_column(builder, view_range_key, attribute_definitions, column_kind::regular_column);
}
add_column(view_builder, view_range_key, attribute_definitions, column_kind::clustering_key);
unused_attribute_definitions.erase(view_range_key);
}
// Base key columns which aren't part of the index's key need to
// be added to the view nonetheless, as (additional) clustering
// key(s).
if (hash_key != view_hash_key && hash_key != view_range_key) {
add_column(view_builder, hash_key, attribute_definitions, column_kind::clustering_key);
}
if (!range_key.empty() && range_key != view_hash_key && range_key != view_range_key) {
add_column(view_builder, range_key, attribute_definitions, column_kind::clustering_key);
}
// GSIs have no tags:
view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>());
view_builders.emplace_back(std::move(view_builder));
}
}
// Remember the attributes used for LSI keys. Since LSI must be created
// with the table, we make these attributes real schema columns, and need
// to remember this below if the same attributes are used as GSI keys.
std::unordered_set<std::string> lsi_range_keys;
const rjson::value* lsi = rjson::find(request, "LocalSecondaryIndexes");
if (lsi) {
@@ -1228,9 +1368,68 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
std::map<sstring, sstring> tags_map = {{db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY, "true"}};
view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map));
view_builders.emplace_back(std::move(view_builder));
lsi_range_keys.emplace(view_range_key);
}
}
const rjson::value* gsi = rjson::find(request, "GlobalSecondaryIndexes");
if (gsi) {
if (!gsi->IsArray()) {
co_return api_error::validation("GlobalSecondaryIndexes must be an array.");
}
for (const rjson::value& g : gsi->GetArray()) {
const rjson::value* index_name_v = rjson::find(g, "IndexName");
if (!index_name_v || !index_name_v->IsString()) {
co_return api_error::validation("GlobalSecondaryIndexes IndexName must be a string.");
}
std::string_view index_name = rjson::to_string_view(*index_name_v);
auto [it, added] = index_names.emplace(index_name);
if (!added) {
co_return api_error::validation(fmt::format("Duplicate IndexName '{}', ", index_name));
}
std::string vname(view_name(table_name, index_name));
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
// require the MV code to copy just parts of the attrs map.
schema_builder view_builder(keyspace_name, vname);
auto [view_hash_key, view_range_key] = parse_key_schema(g);
// If an attribute is already a real column in the base table
// (i.e., a key attribute) or we already made it a real column
// as an LSI key above, we can use it directly as a view key.
// Otherwise, we need to add it as a "computed column", which
// extracts and deserializes the attribute from the ":attrs" map.
bool view_hash_key_real_column =
partial_schema->get_column_definition(to_bytes(view_hash_key)) ||
lsi_range_keys.contains(view_hash_key);
add_column(view_builder, view_hash_key, attribute_definitions, column_kind::partition_key, !view_hash_key_real_column);
unused_attribute_definitions.erase(view_hash_key);
if (!view_range_key.empty()) {
bool view_range_key_real_column =
partial_schema->get_column_definition(to_bytes(view_range_key)) ||
lsi_range_keys.contains(view_range_key);
add_column(view_builder, view_range_key, attribute_definitions, column_kind::clustering_key, !view_range_key_real_column);
if (!partial_schema->get_column_definition(to_bytes(view_range_key)) &&
!partial_schema->get_column_definition(to_bytes(view_hash_key))) {
// FIXME: This warning should go away. See issue #6714
elogger.warn("Only 1 regular column from the base table should be used in the GSI key in order to ensure correct liveness management without assumptions");
}
unused_attribute_definitions.erase(view_range_key);
}
// Base key columns which aren't part of the index's key need to
// be added to the view nonetheless, as (additional) clustering
// key(s).
if (hash_key != view_hash_key && hash_key != view_range_key) {
add_column(view_builder, hash_key, attribute_definitions, column_kind::clustering_key);
}
if (!range_key.empty() && range_key != view_hash_key && range_key != view_range_key) {
add_column(view_builder, range_key, attribute_definitions, column_kind::clustering_key);
}
// GSIs have no tags:
view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>());
view_builders.emplace_back(std::move(view_builder));
}
}
if (!unused_attribute_definitions.empty()) {
co_return api_error::validation(fmt::format(
"AttributeDefinitions defines spurious attributes not used by any KeySchema: {}",
@@ -1371,12 +1570,37 @@ future<executor::request_return_type> executor::create_table(client_state& clien
});
}
// When UpdateTable adds a GSI, the type of its key columns must be specified
// in a AttributeDefinitions. If one of these key columns are *already* key
// columns of the base table or any of its prior GSIs or LSIs, the type
// given in AttributeDefinitions must match the type of the existing key -
// otherise Alternator will not know which type to enforce in new writes.
// This function checks for such conflicts. It assumes that the structure of
// the given attribute_definitions was already validated (with
// validate_attribute_definitions()).
// This function should be called multiple times - once for the base schema
// and once for each of its views (existing GSIs and LSIs on this table).
static void check_attribute_definitions_conflicts(const rjson::value& attribute_definitions, const schema& schema) {
for (auto& def : schema.primary_key_columns()) {
std::string def_type = type_to_string(def.type);
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value& attribute_info = *it;
if (attribute_info["AttributeName"].GetString() == def.name_as_text()) {
auto type = attribute_info["AttributeType"].GetString();
if (type != def_type) {
throw api_error::validation(fmt::format("AttributeDefinitions redefined {} to {} already a key attribute of type {} in this table", def.name_as_text(), type, def_type));
}
break;
}
}
}
}
future<executor::request_return_type> executor::update_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
_stats.api_operations.update_table++;
elogger.trace("Updating table {}", request);
static const std::vector<sstring> unsupported = {
"GlobalSecondaryIndexUpdates",
"ProvisionedThroughput",
"ReplicaUpdates",
"SSESpecification",
@@ -1388,11 +1612,14 @@ future<executor::request_return_type> executor::update_table(client_state& clien
}
}
bool empty_request = true;
if (rjson::find(request, "BillingMode")) {
empty_request = false;
verify_billing_mode(request);
}
co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), enforce_authorization = bool(_enforce_authorization), client_state_other_shard = client_state.move_to_other_shard()]
co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), enforce_authorization = bool(_enforce_authorization), client_state_other_shard = client_state.move_to_other_shard(), empty_request]
(service::migration_manager& mm) mutable -> future<executor::request_return_type> {
// FIXME: the following needs to be in a loop. If mm.announce() below
// fails, we need to retry the whole thing.
@@ -1412,6 +1639,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
rjson::value* stream_specification = rjson::find(request, "StreamSpecification");
if (stream_specification && stream_specification->IsObject()) {
empty_request = false;
add_stream_options(*stream_specification, builder, p.local());
// Alternator Streams doesn't yet work when the table uses tablets (#16317)
auto stream_enabled = rjson::find(*stream_specification, "StreamEnabled");
@@ -1423,8 +1651,162 @@ future<executor::request_return_type> executor::update_table(client_state& clien
}
auto schema = builder.build();
std::vector<view_ptr> new_views;
std::vector<std::string> dropped_views;
rjson::value* gsi_updates = rjson::find(request, "GlobalSecondaryIndexUpdates");
if (gsi_updates) {
if (!gsi_updates->IsArray()) {
co_return api_error::validation("GlobalSecondaryIndexUpdates must be an array");
}
if (gsi_updates->Size() > 1) {
// Although UpdateTable takes an array of operations and could
// support multiple Create and/or Delete operations in one
// command, DynamoDB doesn't actually allows this, and throws
// a LimitExceededException if this is attempted.
co_return api_error::limit_exceeded("GlobalSecondaryIndexUpdates only allows one index creation or deletion");
}
if (gsi_updates->Size() == 1) {
empty_request = false;
if (!(*gsi_updates)[0].IsObject() || (*gsi_updates)[0].MemberCount() != 1) {
co_return api_error::validation("GlobalSecondaryIndexUpdates array must contain one object with a Create, Delete or Update operation");
}
auto it = (*gsi_updates)[0].MemberBegin();
const std::string_view op = rjson::to_string_view(it->name);
if (!it->value.IsObject()) {
co_return api_error::validation("GlobalSecondaryIndexUpdates entries must be objects");
}
const rjson::value* index_name_v = rjson::find(it->value, "IndexName");
if (!index_name_v || !index_name_v->IsString()) {
co_return api_error::validation("GlobalSecondaryIndexUpdates operation must have IndexName");
}
std::string_view index_name = rjson::to_string_view(*index_name_v);
std::string_view table_name = schema->cf_name();
std::string_view keyspace_name = schema->ks_name();
std::string vname(view_name(table_name, index_name));
if (op == "Create") {
const rjson::value* attribute_definitions = rjson::find(request, "AttributeDefinitions");
if (!attribute_definitions) {
co_return api_error::validation("GlobalSecondaryIndexUpdates Create needs AttributeDefinitions");
}
std::unordered_set<std::string> unused_attribute_definitions =
validate_attribute_definitions(*attribute_definitions);
check_attribute_definitions_conflicts(*attribute_definitions, *schema);
for (auto& view : p.local().data_dictionary().find_column_family(tab).views()) {
check_attribute_definitions_conflicts(*attribute_definitions, *view);
}
if (p.local().data_dictionary().has_schema(keyspace_name, vname)) {
// Surprisingly, DynamoDB uses validation error here, not resource_in_use
co_return api_error::validation(fmt::format(
"GSI {} already exists in table {}", index_name, table_name));
}
if (p.local().data_dictionary().has_schema(keyspace_name, lsi_name(table_name, index_name))) {
co_return api_error::validation(fmt::format(
"LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
}
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
// require the MV code to copy just parts of the attrs map.
schema_builder view_builder(keyspace_name, vname);
auto [view_hash_key, view_range_key] = parse_key_schema(it->value);
// If an attribute is already a real column in the base
// table (i.e., a key attribute in the base table or LSI),
// we can use it directly as a view key. Otherwise, we
// need to add it as a "computed column", which extracts
// and deserializes the attribute from the ":attrs" map.
bool view_hash_key_real_column =
schema->get_column_definition(to_bytes(view_hash_key));
add_column(view_builder, view_hash_key, *attribute_definitions, column_kind::partition_key, !view_hash_key_real_column);
unused_attribute_definitions.erase(view_hash_key);
if (!view_range_key.empty()) {
bool view_range_key_real_column =
schema->get_column_definition(to_bytes(view_range_key));
add_column(view_builder, view_range_key, *attribute_definitions, column_kind::clustering_key, !view_range_key_real_column);
if (!schema->get_column_definition(to_bytes(view_range_key)) &&
!schema->get_column_definition(to_bytes(view_hash_key))) {
// FIXME: This warning should go away. See issue #6714
elogger.warn("Only 1 regular column from the base table should be used in the GSI key in order to ensure correct liveness management without assumptions");
}
unused_attribute_definitions.erase(view_range_key);
}
// Surprisingly, although DynamoDB checks for unused
// AttributeDefinitions in CreateTable, it does not
// check it in UpdateTable. We decided to check anyway.
if (!unused_attribute_definitions.empty()) {
co_return api_error::validation(fmt::format(
"AttributeDefinitions defines spurious attributes not used by any KeySchema: {}",
unused_attribute_definitions));
}
// Base key columns which aren't part of the index's key need to
// be added to the view nonetheless, as (additional) clustering
// key(s).
for (auto& def : schema->primary_key_columns()) {
if (def.name_as_text() != view_hash_key && def.name_as_text() != view_range_key) {
view_builder.with_column(def.name(), def.type, column_kind::clustering_key);
}
}
// GSIs have no tags:
view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>());
// Note below we don't need to add virtual columns, as all
// base columns were copied to view. TODO: reconsider the need
// for virtual columns when we support Projection.
for (const column_definition& regular_cdef : schema->regular_columns()) {
if (!view_builder.has_column(*cql3::to_identifier(regular_cdef))) {
view_builder.with_column(regular_cdef.name(), regular_cdef.type, column_kind::regular_column);
}
}
const bool include_all_columns = true;
view_builder.with_view_info(*schema, include_all_columns, ""/*where clause*/);
new_views.emplace_back(view_builder.build());
} else if (op == "Delete") {
elogger.trace("Deleting GSI {}", index_name);
if (!p.local().data_dictionary().has_schema(keyspace_name, vname)) {
co_return api_error::resource_not_found(fmt::format("No GSI {} in table {}", index_name, table_name));
}
dropped_views.emplace_back(vname);
} else if (op == "Update") {
co_return api_error::validation("GlobalSecondaryIndexUpdates Update not yet supported");
} else {
co_return api_error::validation(fmt::format("GlobalSecondaryIndexUpdates supports a Create, Delete or Update operation, saw '{}'", op));
}
}
}
if (empty_request) {
co_return api_error::validation("UpdateTable requires one of GlobalSecondaryIndexUpdates, StreamSpecification or BillingMode to be specified");
}
co_await verify_permission(enforce_authorization, client_state_other_shard.get(), schema, auth::permission::ALTER);
auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, std::vector<view_ptr>(), group0_guard.write_timestamp());
auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, std::vector<view_ptr>(), group0_guard.write_timestamp());
for (view_ptr view : new_views) {
auto m2 = co_await service::prepare_new_view_announcement(p.local(), view, group0_guard.write_timestamp());
std::move(m2.begin(), m2.end(), std::back_inserter(m));
}
for (const std::string& view_name : dropped_views) {
auto m2 = co_await service::prepare_view_drop_announcement(p.local(), schema->ks_name(), view_name, group0_guard.write_timestamp());
std::move(m2.begin(), m2.end(), std::back_inserter(m));
}
// If a role is allowed to create a GSI, we should give it permissions
// to read the GSI it just created. This is known as "auto-grant".
// Also, when we delete a GSI we should revoke any permissions set on
// it - so if it's ever created again the old permissions wouldn't be
// remembered for the new GSI. This is known as "auto-revoke"
if (client_state_other_shard.get().user() && (!new_views.empty() || !dropped_views.empty())) {
service::group0_batch mc(std::move(group0_guard));
mc.add_mutations(std::move(m));
for (view_ptr view : new_views) {
auto resource = auth::make_data_resource(view->ks_name(), view->cf_name());
co_await auth::grant_applicable_permissions(
*client_state_other_shard.get().get_auth_service(), *client_state_other_shard.get().user(), resource, mc);
}
for (const auto& view_name : dropped_views) {
auto resource = auth::make_data_resource(schema->ks_name(), view_name);
co_await auth::revoke_all(*client_state_other_shard.get().get_auth_service(), resource, mc);
}
std::tie(m, group0_guard) = co_await std::move(mc).extract();
}
co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: update {} table", tab->cf_name()));
@@ -1546,7 +1928,7 @@ public:
struct delete_item {};
struct put_item {};
put_or_delete_item(const rjson::value& key, schema_ptr schema, delete_item);
put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item);
put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item, std::unordered_map<bytes, std::string> key_attributes);
// put_or_delete_item doesn't keep a reference to schema (so it can be
// moved between shards for LWT) so it needs to be given again to build():
mutation build(schema_ptr schema, api::timestamp_type ts) const;
@@ -1578,7 +1960,75 @@ static inline const column_definition* find_attribute(const schema& schema, cons
return cdef;
}
put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item)
// Get a list of all attributes that serve as a key attributes for any of the
// GSIs or LSIs of this table, and the declared type for each (can be only
// "S", "B", or "N"). The implementation below will also list the base table's
// key columns (they are the views' clustering keys).
std::unordered_map<bytes, std::string> si_key_attributes(data_dictionary::table t) {
std::unordered_map<bytes, std::string> ret;
for (const view_ptr& v : t.views()) {
for (const column_definition& cdef : v->partition_key_columns()) {
ret[cdef.name()] = type_to_string(cdef.type);
}
for (const column_definition& cdef : v->clustering_key_columns()) {
ret[cdef.name()] = type_to_string(cdef.type);
}
}
return ret;
}
// When an attribute is a key (hash or sort) of one of the GSIs on a table,
// DynamoDB refuses an update to that attribute with an unsuitable value.
// Unsuitable values are:
// 1. An empty string (those are normally allowed as values, but not allowed
// as keys, including GSI keys).
// 2. A value with a type different than that declared for the GSI key.
// Normally non-key attributes can take values of any type (DynamoDB is
// schema-less), but as soon as an attribute is used as a GSI key, it
// must be set only to the specific type declared for that key.
// (Note that a missing value for an GSI key attribute is fine - the update
// will happen on the base table, but won't reach the view table. In this
// case, this function simply won't be called for this attribute.)
//
// This function checks if the given attribute update is an update to some
// GSI's key, and if the value is unsuitable, a api_error::validation is
// thrown. The checking here is similar to the checking done in
// get_key_from_typed_value() for the base table's key columns.
//
// validate_value_if_gsi_key() should only be called after validate_value()
// already validated that the value itself has a valid form.
static inline void validate_value_if_gsi_key(
std::unordered_map<bytes, std::string> key_attributes,
const bytes& attribute,
const rjson::value& value) {
if (key_attributes.empty()) {
return;
}
auto it = key_attributes.find(attribute);
if (it == key_attributes.end()) {
// Given attribute is not a key column with a fixed type, so no
// more validation to do.
return;
}
const std::string& expected_type = it->second;
// We assume that validate_value() was previously called on this value,
// so value is known to be of the proper format (an object with one
// member, whose key and value are strings)
std::string_view value_type = rjson::to_string_view(value.MemberBegin()->name);
if (expected_type != value_type) {
throw api_error::validation(fmt::format(
"Type mismatch: expected type {} for GSI key attribute {}, got type {}",
expected_type, to_string_view(attribute), value_type));
}
std::string_view value_content = rjson::to_string_view(value.MemberBegin()->value);
if (value_content.empty()) {
throw api_error::validation(fmt::format(
"GSI key attribute {} cannot be set to an empty string", to_string_view(attribute)));
}
}
put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item, std::unordered_map<bytes, std::string> key_attributes)
: _pk(pk_from_json(item, schema)), _ck(ck_from_json(item, schema)) {
_cells = std::vector<cell>();
_cells->reserve(item.MemberCount());
@@ -1588,6 +2038,9 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
const column_definition* cdef = find_attribute(*schema, column_name);
_length_in_bytes += column_name.size();
if (!cdef) {
// This attribute may be a key column of one of the GSI, in which
// case there are some limitations on the value
validate_value_if_gsi_key(key_attributes, column_name, it->value);
bytes value = serialize_item(it->value);
if (value.size()) {
// ScyllaDB uses one extra byte compared to DynamoDB for the bytes length
@@ -1595,7 +2048,7 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
}
_cells->push_back({std::move(column_name), serialize_item(it->value)});
} else if (!cdef->is_primary_key()) {
// Fixed-type regular column can be used for GSI key
// Fixed-type regular column can be used for LSI key
bytes value = get_key_from_typed_value(it->value, *cdef);
_cells->push_back({std::move(column_name),
value});
@@ -1954,7 +2407,8 @@ public:
parsed::condition_expression _condition_expression;
put_item_operation(service::storage_proxy& proxy, rjson::value&& request)
: rmw_operation(proxy, std::move(request))
, _mutation_builder(rjson::get(_request, "Item"), schema(), put_or_delete_item::put_item{}) {
, _mutation_builder(rjson::get(_request, "Item"), schema(), put_or_delete_item::put_item{},
si_key_attributes(proxy.data_dictionary().find_table(schema()->ks_name(), schema()->cf_name()))) {
_pk = _mutation_builder.pk();
_ck = _mutation_builder.ck();
if (_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::ALL_OLD) {
@@ -2315,7 +2769,8 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
const rjson::value& put_request = r->value;
const rjson::value& item = put_request["Item"];
mutation_builders.emplace_back(schema, put_or_delete_item(
item, schema, put_or_delete_item::put_item{}));
item, schema, put_or_delete_item::put_item{},
si_key_attributes(_proxy.data_dictionary().find_table(schema->ks_name(), schema->cf_name()))));
auto mut_key = std::make_pair(mutation_builders.back().second.pk(), mutation_builders.back().second.ck());
if (used_keys.contains(mut_key)) {
co_return api_error::validation("Provided list of item keys contains duplicates");
@@ -2751,14 +3206,17 @@ future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schem
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get) {
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
describe_single_item(*selection, result_row, *attrs_to_get, item);
rcu_consumed_capacity_counter consumed_capacity;
describe_single_item(*selection, result_row, *attrs_to_get, item, &consumed_capacity._total_bytes);
rcu_half_units += consumed_capacity.get_half_units();
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
@@ -2859,6 +3317,10 @@ public:
// them by top-level attribute, and detects forbidden overlaps/conflicts.
attribute_path_map<parsed::update_expression::action> _update_expression;
// Saved list of GSI keys in the table being updated, used for
// validate_value_if_gsi_key()
std::unordered_map<bytes, std::string> _key_attributes;
parsed::condition_expression _condition_expression;
update_item_operation(service::storage_proxy& proxy, rjson::value&& request);
@@ -2950,6 +3412,9 @@ update_item_operation::update_item_operation(service::storage_proxy& proxy, rjso
if (expression_attribute_values) {
_consumed_capacity._total_bytes += estimate_value_size(*expression_attribute_values);
}
_key_attributes = si_key_attributes(proxy.data_dictionary().find_table(
_schema->ks_name(), _schema->cf_name()));
}
// These are the cases where update_item_operation::apply() needs to use
@@ -3247,6 +3712,9 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
bytes column_value = get_key_from_typed_value(json_value, *cdef);
row.cells().apply(*cdef, atomic_cell::make_live(*cdef->type, ts, column_value));
} else {
// This attribute may be a key column of one of the GSIs, in which
// case there are some limitations on the value.
validate_value_if_gsi_key(_key_attributes, column_name, json_value);
attrs_collector.put(std::move(column_name), serialize_item(json_value), ts);
}
};
@@ -3649,6 +4117,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
// listing all the request aimed at a single table. For efficiency, inside
// each table_requests we further group together all reads going to the
// same partition, so we can later send them together.
bool should_add_rcu = rcu_consumed_capacity_counter::should_add_capacity(request);
struct table_requests {
schema_ptr schema;
db::consistency_level cl;
@@ -3675,6 +4144,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
};
std::vector<table_requests> requests;
std::vector<std::vector<uint64_t>> responses_sizes;
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
table_requests rs(get_table_from_batch_request(_proxy, it));
@@ -3701,7 +4171,11 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
// If we got here, all "requests" are valid, so let's start the
// requests for the different partitions all in parallel.
std::vector<future<std::vector<rjson::value>>> response_futures;
responses_sizes.resize(requests.size());
size_t responses_sizes_pos = 0;
for (const auto& rs : requests) {
responses_sizes[responses_sizes_pos].resize(rs.requests.size());
size_t pos = 0;
for (const auto &r : rs.requests) {
auto& pk = r.first;
auto& cks = r.second;
@@ -3724,12 +4198,14 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
command->allow_limit = db::allow_per_partition_rate_limit::yes;
future<std::vector<rjson::value>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, &response_size = responses_sizes[responses_sizes_pos][pos]] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get));
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), response_size);
});
pos++;
response_futures.push_back(std::move(f));
}
responses_sizes_pos++;
}
// Wait for all requests to complete, and then return the response.
@@ -3741,10 +4217,14 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::value response = rjson::empty_object();
rjson::add(response, "Responses", rjson::empty_object());
rjson::add(response, "UnprocessedKeys", rjson::empty_object());
size_t rcu_half_units;
auto fut_it = response_futures.begin();
responses_sizes_pos = 0;
rjson::value consumed_capacity = rjson::empty_array();
for (const auto& rs : requests) {
auto table = table_name(*rs.schema);
std::string table = table_name(*rs.schema);
size_t pos = 0;
rcu_half_units = 0;
for (const auto &r : rs.requests) {
auto& pk = r.first;
auto& cks = r.second;
@@ -3759,6 +4239,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
for (rjson::value& json : results) {
rjson::push_back(response["Responses"][table], std::move(json));
}
rcu_half_units += rcu_consumed_capacity_counter::get_half_units(responses_sizes[responses_sizes_pos][pos], rs.cl == db::consistency_level::LOCAL_QUORUM);
} catch(...) {
eptr = std::current_exception();
// This read of potentially several rows in one partition,
@@ -3782,7 +4263,20 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::push_back(response["UnprocessedKeys"][table]["Keys"], std::move(*ck.second));
}
}
pos++;
}
_stats.rcu_total += rcu_half_units;
if (should_add_rcu) {
rjson::value entry = rjson::empty_object();
rjson::add(entry, "TableName", table);
rjson::add(entry, "CapacityUnits", rcu_half_units*0.5);
rjson::push_back(consumed_capacity, std::move(entry));
}
responses_sizes_pos++;
}
if (should_add_rcu) {
rjson::add(response, "ConsumedCapacity", std::move(consumed_capacity));
}
elogger.trace("Unprocessed keys: {}", response["UnprocessedKeys"]);
if (!some_succeeded && eptr) {

View File

@@ -241,7 +241,8 @@ public:
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,

View File

@@ -0,0 +1,73 @@
/*
* Copyright 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <string>
#include <string_view>
#include "utils/rjson.hh"
#include "serialization.hh"
#include "column_computation.hh"
#include "db/view/regular_column_transformation.hh"
namespace alternator {
// An implementation of a "column_computation" which extracts a specific
// non-key attribute from the big map (":attrs") of all non-key attributes,
// and deserializes it if it has the desired type. GSI will use this computed
// column as a materialized-view key when the view key attribute isn't a
// full-fledged CQL column but rather stored in ":attrs".
class extract_from_attrs_column_computation : public regular_column_transformation {
// The name of the CQL column name holding the attribute map. It is a
// constant defined in executor.cc (as ":attrs"), so doesn't need
// to be specified when constructing the column computation.
static const bytes MAP_NAME;
// The top-level attribute name to extract from the ":attrs" map.
bytes _attr_name;
// The type we expect for the value stored in the attribute. If the type
// matches the expected type, it is decoded from the serialized format
// we store in the map's values) into the raw CQL type value that we use
// for keys, and returned by compute_value(). Only the types "S" (string),
// "B" (bytes) and "N" (number) are allowed as keys in DynamoDB, and
// therefore in desired_type.
alternator_type _desired_type;
public:
virtual column_computation_ptr clone() const override;
// TYPE_NAME is a unique string that distinguishes this class from other
// column_computation subclasses. column_computation::deserialize() will
// construct an object of this subclass if it sees a "type" TYPE_NAME.
static inline const std::string TYPE_NAME = "alternator_extract_from_attrs";
// Serialize the *definition* of this column computation into a JSON
// string with a unique "type" string - TYPE_NAME - which then causes
// column_computation::deserialize() to create an object from this class.
virtual bytes serialize() const override;
// Construct this object based on the previous output of serialize().
// Calls on_internal_error() if the string doesn't match the output format
// of serialize(). "type" is not checked column_computation::deserialize()
// won't call this constructor if "type" doesn't match.
extract_from_attrs_column_computation(const rjson::value &v);
extract_from_attrs_column_computation(bytes_view attr_name, alternator_type desired_type)
: _attr_name(attr_name), _desired_type(desired_type)
{}
// Implement regular_column_transformation's compute_value() that
// accepts the full row:
result compute_value(const schema& schema, const partition_key& key,
const db::view::clustering_or_static_row& row) const override;
// But do not implement column_computation's compute_value() that
// accepts only a partition key - that's not enough so our implementation
// of this function does on_internal_error().
bytes compute_value(const schema& schema, const partition_key& key) const override;
// This computed column does depend on a non-primary key column, so
// its result may change in the update and we need to compute it
// before and after the update.
virtual bool depends_on_non_primary_key_column() const override {
return true;
}
};
} // namespace alternator

View File

@@ -245,6 +245,27 @@ rjson::value deserialize_item(bytes_view bv) {
return deserialized;
}
// This function takes a bytes_view created earlier by serialize_item(), and
// if has the type "expected_type", the function returns the value as a
// raw Scylla type. If the type doesn't match, returns an unset optional.
// This function only supports the key types S (string), B (bytes) and N
// (number) - serialize_item() serializes those types as a single-byte type
// followed by the serialized raw Scylla type, so all this function needs to
// do is to remove the first byte. This makes this function much more
// efficient than deserialize_item() above because it avoids transformation
// to/from JSON.
std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type) {
if (bv.empty() || alternator_type(bv[0]) != expected_type) {
return std::nullopt;
}
// Currently, serialize_item() for types in alternator_type (notably S, B
// and N) are nothing more than Scylla's raw format for these types
// preceded by a type byte. So we just need to skip that byte and we are
// left by exactly what we need to return.
bv.remove_prefix(1);
return bytes(bv);
}
std::string type_to_string(data_type type) {
static thread_local std::unordered_map<data_type, std::string> types = {
{utf8_type, "S"},

View File

@@ -43,6 +43,7 @@ type_representation represent_type(alternator_type atype);
bytes serialize_item(const rjson::value& item);
rjson::value deserialize_item(bytes_view bv);
std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type);
std::string type_to_string(data_type type);

View File

@@ -808,6 +808,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
if (limit > 1000) {
throw api_error::validation("Limit must be less than or equal to 1000");
}
auto db = _proxy.data_dictionary();
schema_ptr schema, base;

View File

@@ -2836,7 +2836,7 @@
"nickname":"repair_tablet",
"method":"POST",
"summary":"Repair a tablet",
"type":"void",
"type":"tablet_repair_result",
"produces":[
"application/json"
],
@@ -2864,6 +2864,30 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"hosts_filter",
"description":"Repair replicas listed in the comma-separated host_id list.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"dcs_filter",
"description":"Repair replicas listed in the comma-separated DC list",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"await_completion",
"description":"Set true to wait for the repair to complete. Set false to skip waiting for the repair to complete. When the option is not provided, it defaults to false.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -3287,6 +3311,15 @@
}
}
}
},
"tablet_repair_result":{
"id":"tablet_repair_result",
"description":"Tablet repair result",
"properties":{
"tablet_task_id":{
"type":"string"
}
}
}
}
}

View File

@@ -253,6 +253,30 @@
]
}
]
},
{
"path":"/task_manager/drain/{module}",
"operations":[
{
"method":"POST",
"summary":"Drain finished local tasks",
"type":"void",
"nickname":"drain_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to drain",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{

View File

@@ -6,6 +6,8 @@
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "build_mode.hh"
#ifndef SCYLLA_BUILD_MODE_RELEASE
#include <seastar/core/coroutine.hh>

View File

@@ -1543,6 +1543,11 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
}
auto ks = req->get_query_param("ks");
auto table = req->get_query_param("table");
bool await_completion = false;
auto await = req->get_query_param("await_completion");
if (!await.empty()) {
await_completion = validate_bool(await);
}
validate_table(ctx, ks, table);
auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();
std::variant<utils::chunked_vector<dht::token>, service::storage_service::all_tokens_tag> tokens_variant;
@@ -1551,8 +1556,22 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
} else {
tokens_variant = tokens;
}
auto hosts = req->get_query_param("hosts_filter");
auto dcs = req->get_query_param("dcs_filter");
auto res = co_await ss.local().add_repair_tablet_request(table_id, tokens_variant);
std::unordered_set<locator::host_id> hosts_filter;
if (!hosts.empty()) {
std::string delim = ",";
hosts_filter = std::ranges::views::split(hosts, delim) | std::views::transform([](auto&& h) {
try {
return locator::host_id(utils::UUID(std::string_view{h}));
} catch (...) {
throw httpd::bad_param_exception(fmt::format("Wrong host_id format {}", h));
}
}) | std::ranges::to<std::unordered_set>();
}
auto dcs_filter = locator::tablet_task_info::deserialize_repair_dcs_filter(dcs);
auto res = co_await ss.local().add_repair_tablet_request(table_id, tokens_variant, hosts_filter, dcs_filter, await_completion);
co_return json::json_return_type(res);
});

View File

@@ -232,6 +232,32 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
uint32_t user_ttl = cfg.user_task_ttl_seconds();
co_return json::json_return_type(user_ttl);
});
tm::drain_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await tm.invoke_on_all([&req] (tasks::task_manager& tm) -> future<> {
tasks::task_manager::module_ptr module;
try {
module = tm.find_module(req->get_path_param("module"));
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
const auto& local_tasks = module->get_local_tasks();
std::vector<tasks::task_id> ids;
ids.reserve(local_tasks.size());
std::transform(begin(local_tasks), end(local_tasks), std::back_inserter(ids), [] (const auto& task) {
return task.second->is_complete() ? task.first : tasks::task_id::create_null_id();
});
for (auto&& id : ids) {
if (id) {
module->unregister_task(id);
}
co_await maybe_yield();
}
});
co_return json_void();
});
}
void unset_task_manager(http_context& ctx, routes& r) {
@@ -243,6 +269,7 @@ void unset_task_manager(http_context& ctx, routes& r) {
tm::get_task_status_recursively.unset(r);
tm::get_and_update_ttl.unset(r);
tm::get_ttl.unset(r);
tm::drain_tasks.unset(r);
}
}

View File

@@ -6,6 +6,9 @@
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "build_mode.hh"
#ifndef SCYLLA_BUILD_MODE_RELEASE
#include <seastar/core/coroutine.hh>

View File

@@ -33,20 +33,6 @@ namespace audit {
namespace {
future<> syslog_send_helper(net::datagram_channel& sender,
const socket_address& address,
const sstring& msg) {
return sender.send(address, net::packet{msg.data(), msg.size()}).handle_exception([address](auto&& exception_ptr) {
auto error_msg = seastar::format(
"Syslog audit backend failed (sending a message to {} resulted in {}).",
address,
exception_ptr
);
logger.error("{}", error_msg);
throw audit_exception(std::move(error_msg));
});
}
static auto syslog_address_helper(const db::config& cfg)
{
return cfg.audit_unix_socket_path.is_set()
@@ -56,9 +42,26 @@ static auto syslog_address_helper(const db::config& cfg)
}
future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {
try {
auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));
co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});
}
catch (const std::exception& e) {
auto error_msg = seastar::format(
"Syslog audit backend failed (sending a message to {} resulted in {}).",
_syslog_address,
e
);
logger.error("{}", error_msg);
throw audit_exception(std::move(error_msg));
}
}
audit_syslog_storage_helper::audit_syslog_storage_helper(cql3::query_processor& qp, service::migration_manager&) :
_syslog_address(syslog_address_helper(qp.db().get_config())),
_sender(make_unbound_datagram_channel(AF_UNIX)) {
_sender(make_unbound_datagram_channel(AF_UNIX)),
_semaphore(1) {
}
audit_syslog_storage_helper::~audit_syslog_storage_helper() {
@@ -73,10 +76,10 @@ audit_syslog_storage_helper::~audit_syslog_storage_helper() {
*/
future<> audit_syslog_storage_helper::start(const db::config& cfg) {
if (this_shard_id() != 0) {
return make_ready_future();
co_return;
}
return syslog_send_helper(_sender, _syslog_address, "Initializing syslog audit backend.");
co_await syslog_send_helper("Initializing syslog audit backend.");
}
future<> audit_syslog_storage_helper::stop() {
@@ -106,7 +109,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
audit_info->table(),
username);
return syslog_send_helper(_sender, _syslog_address, msg);
co_await syslog_send_helper(msg);
}
future<> audit_syslog_storage_helper::write_login(const sstring& username,
@@ -125,7 +128,7 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,
username,
(error ? "true" : "false"));
co_await syslog_send_helper(_sender, _syslog_address, msg.c_str());
co_await syslog_send_helper(msg.c_str());
}
using registry = class_registrator<storage_helper, audit_syslog_storage_helper, cql3::query_processor&, service::migration_manager&>;

View File

@@ -24,6 +24,9 @@ namespace audit {
class audit_syslog_storage_helper : public storage_helper {
socket_address _syslog_address;
net::datagram_channel _sender;
seastar::semaphore _semaphore;
future<> syslog_send_helper(const sstring& msg);
public:
explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);
virtual ~audit_syslog_storage_helper();

View File

@@ -123,6 +123,9 @@ class cache_mutation_reader final : public mutation_reader::impl {
gc_clock::time_point _read_time;
gc_clock::time_point _gc_before;
api::timestamp_type _max_purgeable_timestamp = api::missing_timestamp;
api::timestamp_type _max_purgeable_timestamp_shadowable = api::missing_timestamp;
future<> do_fill_buffer();
future<> ensure_underlying();
void copy_from_cache_to_buffer();
@@ -207,6 +210,11 @@ class cache_mutation_reader final : public mutation_reader::impl {
return gc_clock::time_point::min();
}
bool can_gc(tombstone t, is_shadowable is) const {
const auto max_purgeable = is ? _max_purgeable_timestamp_shadowable : _max_purgeable_timestamp;
return t.timestamp < max_purgeable;
}
public:
cache_mutation_reader(schema_ptr s,
dht::decorated_key dk,
@@ -228,8 +236,19 @@ public:
, _read_time(get_read_time())
, _gc_before(get_gc_before(*_schema, dk, _read_time))
{
clogger.trace("csm {}: table={}.{}, reversed={}, snap={}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name(), _read_context.is_reversed(),
fmt::ptr(&*_snp));
_max_purgeable_timestamp = ctx.get_max_purgeable(dk, is_shadowable::no);
_max_purgeable_timestamp_shadowable = ctx.get_max_purgeable(dk, is_shadowable::yes);
clogger.trace("csm {}: table={}.{}, dk={}, gc-before={}, max-purgeable-regular={}, max-purgeable-shadowable={}, reversed={}, snap={}",
fmt::ptr(this),
_schema->ks_name(),
_schema->cf_name(),
dk,
_gc_before,
_max_purgeable_timestamp,
_max_purgeable_timestamp_shadowable,
_read_context.is_reversed(),
fmt::ptr(&*_snp));
push_mutation_fragment(*_schema, _permit, partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_mutation_reader(schema_ptr s,
@@ -787,12 +806,12 @@ void cache_mutation_reader::copy_from_cache_to_buffer() {
t.apply(range_tomb);
auto row_tomb_expired = [&](row_tombstone tomb) {
return (tomb && tomb.max_deletion_time() < _gc_before);
return (tomb && tomb.max_deletion_time() < _gc_before && can_gc(tomb.tomb(), tomb.is_shadowable()));
};
auto is_row_dead = [&](const deletable_row& row) {
auto& m = row.marker();
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before);
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before && can_gc(tombstone(m.timestamp(), m.deletion_time()), is_shadowable::no));
};
if (row_tomb_expired(t) || is_row_dead(row)) {
@@ -800,9 +819,11 @@ void cache_mutation_reader::copy_from_cache_to_buffer() {
_read_context.cache()._tracker.on_row_compacted();
auto mutation_can_gc = can_gc_fn([this] (tombstone t, is_shadowable is) { return can_gc(t, is); });
with_allocator(_snp->region().allocator(), [&] {
deletable_row row_copy(row_schema, row);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, always_gc, _gc_before, nullptr);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, mutation_can_gc, _gc_before, nullptr);
std::swap(row, row_copy);
});
remove_row = row.empty();

View File

@@ -1112,7 +1112,9 @@ future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(fmt::format(
// This may happen during raft upgrade when a node gossips about a generation that
// was propagated through raft and we didn't apply it yet.
throw generation_handling_nonfatal_exception(fmt::format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
gen_id, db_clock::now()));

View File

@@ -186,7 +186,7 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
auto ts = to_ts(tp);
auto emplaced = _gens.emplace(to_ts(tp), std::nullopt).second;
auto [it, emplaced] = _gens.emplace(to_ts(tp), std::nullopt);
if (_last_stream_timestamp != api::missing_timestamp) {
auto last_correct_gen = gen_used_at(_last_stream_timestamp);
@@ -201,5 +201,5 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
}
return emplaced;
return !it->second;
}

View File

@@ -15,6 +15,7 @@
#include "sstables/sstables_manager.hh"
#include <memory>
#include <fmt/ranges.h>
#include <seastar/core/future.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/switch_to.hh>
@@ -503,7 +504,7 @@ public:
virtual ~sstables_task_executor() = default;
virtual void release_resources() noexcept override;
virtual future<> release_resources() noexcept override;
virtual future<tasks::task_manager::task::progress> get_progress() const override {
return compaction_task_impl::get_progress(_compaction_data, _progress_monitor);
@@ -788,9 +789,10 @@ compaction::compaction_state::~compaction_state() {
compaction_done.broken();
}
void sstables_task_executor::release_resources() noexcept {
future<> sstables_task_executor::release_resources() noexcept {
_cm._stats.pending_tasks -= _sstables.size() - (_state == state::pending);
_sstables = {};
return make_ready_future();
}
future<compaction_manager::compaction_stats_opt> compaction_task_executor::run_compaction() noexcept {
@@ -1565,10 +1567,10 @@ public:
, _can_purge(can_purge)
{}
virtual void release_resources() noexcept override {
virtual future<> release_resources() noexcept override {
_compacting.release_all();
_owned_ranges_ptr = nullptr;
sstables_task_executor::release_resources();
co_await sstables_task_executor::release_resources();
}
protected:
@@ -1846,11 +1848,12 @@ public:
virtual ~cleanup_sstables_compaction_task_executor() = default;
virtual void release_resources() noexcept override {
virtual future<> release_resources() noexcept override {
_cm._stats.pending_tasks -= _pending_cleanup_jobs.size();
_pending_cleanup_jobs = {};
_compacting.release_all();
_owned_ranges_ptr = nullptr;
return make_ready_future();
}
virtual future<tasks::task_manager::task::progress> get_progress() const override {

View File

@@ -677,7 +677,9 @@ maintenance_socket: ignore
# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.
# enable_create_table_with_compact_storage: false
# Enable tablets for new keyspaces.
# Control tablets for new keyspaces.
# Can be set to: disabled|enabled
#
# When enabled, newly created keyspaces will have tablets enabled by default.
# That can be explicitly disabled in the CREATE KEYSPACE query
# by using the `tablets = {'enabled': false}` replication option.
@@ -686,6 +688,15 @@ maintenance_socket: ignore
# unless tablets are explicitly enabled in the CREATE KEYSPACE query
# by using the `tablets = {'enabled': true}` replication option.
#
# When set to `enforced`, newly created keyspaces will always have tablets enabled by default.
# This prevents explicitly disabling tablets in the CREATE KEYSPACE query
# using the `tablets = {'enabled': false}` replication option.
# It also mandates a replication strategy supporting tablets, like
# NetworkTopologyStrategy
#
# Note that creating keyspaces with tablets enabled or disabled is irreversible.
# The `tablets` option cannot be changed using `ALTER KEYSPACE`.
enable_tablets: true
tablets_mode_for_new_keyspaces: enabled
# Enforce RF-rack-valid keyspaces.
rf_rack_valid_keyspaces: false

View File

@@ -813,6 +813,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/rjson.cc',
'utils/human_readable.cc',
'utils/histogram_metrics_helper.cc',
'utils/io-wrappers.cc',
'utils/on_internal_error.cc',
'utils/pretty_printers.cc',
'utils/stream_compressor.cc',
@@ -1099,7 +1100,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/lister.cc',
'repair/repair.cc',
'repair/row_level.cc',
'repair/table_check.cc',
'streaming/table_check.cc',
'exceptions/exceptions.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
@@ -1564,7 +1565,7 @@ deps['test/boost/linearizing_input_stream_test'] = [
"test/boost/linearizing_input_stream_test.cc",
"test/lib/log.cc",
]
deps['test/boost/expr_test'] = ['test/boost/expr_test.cc', 'test/lib/expr_test_utils.cc'] + scylla_core
deps['test/boost/expr_test'] = ['test/boost/expr_test.cc', 'test/lib/expr_test_utils.cc'] + scylla_core + alternator
deps['test/boost/rate_limiter_test'] = ['test/boost/rate_limiter_test.cc', 'db/rate_limiter.cc']
deps['test/boost/exceptions_optimized_test'] = ['test/boost/exceptions_optimized_test.cc', 'utils/exceptions.cc']
deps['test/boost/exceptions_fallback_test'] = ['test/boost/exceptions_fallback_test.cc', 'utils/exceptions.cc']
@@ -1581,8 +1582,8 @@ deps['test/raft/many_test'] = ['test/raft/many_test.cc', 'test/raft/replication.
deps['test/raft/fsm_test'] = ['test/raft/fsm_test.cc', 'test/raft/helpers.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/etcd_test'] = ['test/raft/etcd_test.cc', 'test/raft/helpers.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/raft_sys_table_storage_test'] = ['test/raft/raft_sys_table_storage_test.cc'] + \
scylla_core + scylla_tests_generic_dependencies
deps['test/boost/address_map_test'] = ['test/boost/address_map_test.cc'] + scylla_core
scylla_core + alternator + scylla_tests_generic_dependencies
deps['test/boost/address_map_test'] = ['test/boost/address_map_test.cc'] + scylla_core + alternator
deps['test/raft/discovery_test'] = ['test/raft/discovery_test.cc',
'test/raft/helpers.cc',
'test/lib/log.cc',

View File

@@ -709,17 +709,23 @@ batchStatement returns [std::unique_ptr<cql3::statements::raw::batch_statement>
: K_BEGIN
( K_UNLOGGED { type = btype::UNLOGGED; } | K_COUNTER { type = btype::COUNTER; } )?
K_BATCH ( usingClause[attrs] )?
( s=batchStatementObjective ';'? { statements.push_back(std::move(s)); } )*
( s=batchStatementObjective ';'?
{
auto&& stmt = *$s.statement;
stmt->add_raw(sstring{$s.text});
statements.push_back(std::move(stmt));
} )*
K_APPLY K_BATCH
{
$expr = std::make_unique<cql3::statements::raw::batch_statement>(type, std::move(attrs), std::move(statements));
}
;
batchStatementObjective returns [std::unique_ptr<cql3::statements::raw::modification_statement> statement]
: i=insertStatement { $statement = std::move(i); }
| u=updateStatement { $statement = std::move(u); }
| d=deleteStatement { $statement = std::move(d); }
batchStatementObjective returns [::lw_shared_ptr<std::unique_ptr<cql3::statements::raw::modification_statement>> statement]
@init { using original_ret_type = std::unique_ptr<cql3::statements::raw::modification_statement>; }
: i=insertStatement { $statement = make_lw_shared<original_ret_type>(std::move(i)); }
| u=updateStatement { $statement = make_lw_shared<original_ret_type>(std::move(u)); }
| d=deleteStatement { $statement = make_lw_shared<original_ret_type>(std::move(d)); }
;
dropAggregateStatement returns [std::unique_ptr<cql3::statements::drop_aggregate_statement> expr]

View File

@@ -13,6 +13,7 @@
#include <seastar/core/on_internal_error.hh>
#include <stdexcept>
#include "alter_keyspace_statement.hh"
#include "locator/tablets.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
@@ -25,6 +26,9 @@
#include "create_keyspace_statement.hh"
#include "gms/feature_service.hh"
#include "replica/database.hh"
#include "db/config.hh"
using namespace std::string_literals;
static logging::logger mylogger("alter_keyspace");
@@ -193,9 +197,9 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
event::schema_change::target_type target_type = event::schema_change::target_type::KEYSPACE;
auto ks = qp.db().find_keyspace(_name);
auto ks_md = ks.metadata();
const auto& tm = *qp.proxy().get_token_metadata_ptr();
const auto tmptr = qp.proxy().get_token_metadata_ptr();
const auto& feat = qp.proxy().features();
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, tm, feat);
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, *tmptr, feat);
std::vector<mutation> muts;
std::vector<sstring> warnings;
bool include_tablet_options = _attrs->get_map(_attrs->KW_TABLETS).has_value();
@@ -206,6 +210,25 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
auto ts = mc.write_timestamp();
auto global_request_id = mc.new_group0_state_id();
// #22688 - filter out any dc*:0 entries - consider these
// null and void (removed). Migration planning will treat it
// as dc*=0 still.
std::erase_if(ks_options, [](const auto& i) {
static constexpr std::string replication_prefix = ks_prop_defs::KW_REPLICATION + ":"s;
// Flattened map, replication entries starts with "replication:".
// Only valid options are replication_factor, class and per-dc rf:s. We want to
// filter out any dcN=0 entries.
auto& [key, val] = i;
if (key.starts_with(replication_prefix) && val == "0") {
std::string_view v(key);
v.remove_prefix(replication_prefix.size());
return v != ks_prop_defs::REPLICATION_FACTOR_KEY
&& v != ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY
;
}
return false;
});
// we only want to run the tablets path if there are actually any tablets changes, not only schema changes
// TODO: the current `if (changes_tablets(qp))` is insufficient: someone may set the same RFs as before,
// and we'll unnecessarily trigger the processing path for ALTER tablets KS,
@@ -246,6 +269,36 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
muts.insert(muts.begin(), schema_mutations.begin(), schema_mutations.end());
}
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to perform a schema change that
// would lead to an RF-rack-valid keyspace. Verify that this change does not.
// For more context, see: scylladb/scylladb#23071.
if (qp.db().get_config().rf_rack_valid_keyspaces()) {
auto rs = locator::abstract_replication_strategy::create_replication_strategy(
ks_md_update->strategy_name(),
locator::replication_strategy_params(ks_md_update->strategy_options(), ks_md_update->initial_tablets()));
try {
// There are two things to note here:
// 1. We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
// 2. The replication strategy we use here does NOT represent the actual state
// we will arrive at after applying the schema change. For instance, if the user
// did not specify the RF for some of the DCs, it's equal to 0 in the replication
// strategy we pass to this function, while in reality that means that the RF
// will NOT change. That is not a problem:
// - RF=0 is valid for all DCs, so it won't trigger an exception on its own,
// - the keyspace must've been RF-rack-valid before this change. We check that
// condition for all keyspaces at startup.
// The second hyphen is not really true because currently topological changes can
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
}
}
auto ret = ::make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
target_type,

View File

@@ -87,6 +87,9 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
"Secondary indexes are not supported on COMPACT STORAGE tables that have clustering columns");
}
if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
throw exceptions::invalid_request_exception(format("Secondary indexes are not supported on base tables with tablets (keyspace '{}')", keyspace()));
}
validate_for_local_index(*schema);
std::vector<::shared_ptr<index_target>> targets;

View File

@@ -11,6 +11,8 @@
#include <seastar/core/coroutine.hh>
#include "cql3/statements/create_keyspace_statement.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "exceptions/exceptions.hh"
#include "locator/tablets.hh"
#include "prepared_statement.hh"
#include "data_dictionary/data_dictionary.hh"
#include "data_dictionary/keyspace_metadata.hh"
@@ -90,14 +92,14 @@ void create_keyspace_statement::validate(query_processor& qp, const service::cli
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>> create_keyspace_statement::prepare_schema_mutations(query_processor& qp, const query_options&, api::timestamp_type ts) const {
using namespace cql_transport;
const auto& tm = *qp.proxy().get_token_metadata_ptr();
const auto tmptr = qp.proxy().get_token_metadata_ptr();
const auto& feat = qp.proxy().features();
const auto& cfg = qp.db().get_config();
std::vector<mutation> m;
std::vector<sstring> warnings;
try {
auto ksm = _attrs->as_ks_metadata(_name, tm, feat, cfg);
auto ksm = _attrs->as_ks_metadata(_name, *tmptr, feat, cfg);
m = service::prepare_new_keyspace_announcement(qp.db().real_database(), ksm, ts);
// If the new keyspace uses tablets, as long as there are features
// which aren't supported by tablets we want to warn the user that
@@ -116,6 +118,21 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector
"without tablets by adding AND TABLETS = {'enabled': false} "
"to the CREATE KEYSPACE statement.");
}
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to create an RF-rack-invalid keyspace.
// Verify that it's RF-rack-valid.
// For more context, see: scylladb/scylladb#23071.
if (cfg.rf_rack_valid_keyspaces()) {
try {
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
}
}
} catch (const exceptions::already_exists_exception& e) {
if (!_if_not_exists) {
co_return coroutine::exception(std::current_exception());
@@ -217,9 +234,6 @@ std::vector<sstring> check_against_restricted_replication_strategies(
// We ignore errors (non-number, negative number, etc.) here,
// these are checked and reported elsewhere.
for (auto opt : attrs.get_replication_options()) {
if (opt.first == sstring("initial_tablets")) {
continue;
}
try {
auto rf = std::stol(opt.second);
if (rf > 0) {

View File

@@ -140,6 +140,9 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
schema_ptr schema = validation::validate_column_family(db, _base_name.get_keyspace(), _base_name.get_column_family());
if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on base tables with tablets"));
}
if (schema->is_counter()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on counter tables"));
}

View File

@@ -70,6 +70,16 @@ static std::map<sstring, sstring> prepare_options(
}
}
// #22688 / #20039 - check for illegal, empty options (after above expand)
// moved to here. We want to be able to remove dc:s once rf=0,
// in which case, the options actually serialized in result mutations
// will in extreme cases in fact be empty -> cannot do this check in
// verify_options. We only want to apply this constraint on the input
// provided by the user
if (options.empty() && !tm.get_topology().get_datacenters().empty()) {
throw exceptions::configuration_exception("Configuration for at least one datacenter must be present");
}
return options;
}
@@ -140,7 +150,7 @@ data_dictionary::storage_options ks_prop_defs::get_storage_options() const {
return opts;
}
std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned> default_value) const {
std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned> default_value, bool enforce_tablets) const {
auto tablets_options = get_map(KW_TABLETS);
if (!tablets_options) {
return default_value;
@@ -155,6 +165,9 @@ std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned
if (enabled == "true") {
// nothing
} else if (enabled == "false") {
if (enforce_tablets) {
throw exceptions::configuration_exception("Cannot disable tablets for keyspace since tablets are enforced using the `tablets_mode_for_new_keyspaces: enforced` config option.");
}
return std::nullopt;
} else {
throw exceptions::configuration_exception(sstring("Tablets enabled value must be true or false; found: ") + enabled);
@@ -189,8 +202,10 @@ bool ks_prop_defs::get_durable_writes() const {
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat, const db::config& cfg) {
auto sc = get_replication_strategy_class().value();
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0 for N.T.S. only
auto enable_tablets = feat.tablets && cfg.enable_tablets();
auto initial_tablets = get_initial_tablets(enable_tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy" ? std::optional<unsigned>(0) : std::nullopt);
auto enable_tablets = feat.tablets && cfg.enable_tablets_by_default();
std::optional<unsigned> default_initial_tablets = enable_tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy"
? std::optional<unsigned>(0) : std::nullopt;
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
auto options = prepare_options(sc, tm, get_replication_options());
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());

View File

@@ -60,7 +60,7 @@ public:
void validate();
std::map<sstring, sstring> get_replication_options() const;
std::optional<sstring> get_replication_strategy_class() const;
std::optional<unsigned> get_initial_tablets(std::optional<unsigned> default_value) const;
std::optional<unsigned> get_initial_tablets(std::optional<unsigned> default_value, bool enforce_tablets = false) const;
data_dictionary::storage_options get_storage_options() const;
bool get_durable_writes() const;
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata(sstring ks_name, const locator::token_metadata&, const gms::feature_service&, const db::config&);

View File

@@ -238,6 +238,13 @@ const config_type& config_type_for<enum_option<db::tri_mode_restriction_t>>() {
return ct;
}
template <>
const config_type& config_type_for<enum_option<db::tablets_mode_t>>() {
static config_type ct(
"tablets mode", printable_to_json<enum_option<db::tablets_mode_t>>);
return ct;
}
template <>
const config_type& config_type_for<db::config::hinted_handoff_enabled_type>() {
static config_type ct("hinted handoff enabled", hinted_handoff_enabled_to_json);
@@ -372,6 +379,23 @@ public:
}
};
template <>
class convert<enum_option<db::tablets_mode_t>> {
public:
static bool decode(const Node& node, enum_option<db::tablets_mode_t>& rhs) {
std::string name;
if (!convert<std::string>::decode(node, name)) {
return false;
}
try {
std::istringstream(name) >> rhs;
} catch (boost::program_options::invalid_option_value&) {
return false;
}
return true;
}
};
template<>
struct convert<db::config::error_injection_at_startup> {
static bool decode(const Node& node, db::config::error_injection_at_startup& rhs) {
@@ -536,6 +560,9 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"The directory where the schema commit log is stored. This is a special commitlog instance used for schema and system tables. For optimal write performance, it is recommended the commit log be on a separate disk partition (ideally, a separate physical device) from the data file directories.")
, data_file_directories(this, "data_file_directories", "datadir", value_status::Used, { },
"The directory location where table data (SSTables) is stored.")
, data_file_capacity(this, "data_file_capacity", liveness::LiveUpdate, value_status::Used, 0,
"Total capacity in bytes for storing data files. Used by tablet load balancer to compute storage utilization."
" If not set, will use file system's capacity.")
, hints_directory(this, "hints_directory", value_status::Used, "",
"The directory where hints files are stored if hinted handoff is enabled.")
, view_hints_directory(this, "view_hints_directory", value_status::Used, "",
@@ -1201,7 +1228,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Start serializing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_kill_limit_multiplier(this, "reader_concurrency_semaphore_kill_limit_multiplier", liveness::LiveUpdate, value_status::Used, 4,
"Start killing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_cpu_concurrency(this, "reader_concurrency_semaphore_cpu_concurrency", liveness::LiveUpdate, value_status::Used, 1,
, reader_concurrency_semaphore_cpu_concurrency(this, "reader_concurrency_semaphore_cpu_concurrency", liveness::LiveUpdate, value_status::Used, 2,
"Admit new reads while there are less than this number of requests that need CPU.")
, view_update_reader_concurrency_semaphore_serialize_limit_multiplier(this, "view_update_reader_concurrency_semaphore_serialize_limit_multiplier", liveness::LiveUpdate, value_status::Used, 2,
"Start serializing view update reads after their collective memory consumption goes above $normal_limit * $multiplier.")
@@ -1354,7 +1381,11 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, error_injections_at_startup(this, "error_injections_at_startup", error_injection_value_status, {}, "List of error injections that should be enabled on startup.")
, topology_barrier_stall_detector_threshold_seconds(this, "topology_barrier_stall_detector_threshold_seconds", value_status::Used, 2, "Report sites blocking topology barrier if it takes longer than this.")
, enable_tablets(this, "enable_tablets", value_status::Used, false, "Enable tablets for newly created keyspaces.")
, enable_tablets(this, "enable_tablets", value_status::Used, false, "Enable tablets for newly created keyspaces. (deprecated)")
, tablets_mode_for_new_keyspaces(this, "tablets_mode_for_new_keyspaces", value_status::Used, tablets_mode_t::mode::unset, "Control tablets for new keyspaces. Can be set to the following values:\n"
"\tdisabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option\n"
"\tenforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option")
, view_flow_control_delay_limit_in_ms(this, "view_flow_control_delay_limit_in_ms", liveness::LiveUpdate, value_status::Used, 1000,
"The maximal amount of time that materialized-view update flow control may delay responses "
"to try to slow down the client and prevent buildup of unfinished view updates. "
@@ -1364,6 +1395,9 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, disk_space_monitor_high_polling_interval_in_seconds(this, "disk_space_monitor_high_polling_interval_in_seconds", value_status::Used, 1, "Disk-space polling interval at or above polling threshold")
, disk_space_monitor_polling_interval_threshold(this, "disk_space_monitor_polling_interval_threshold", value_status::Used, 0.9, "Disk-space polling threshold. Polling interval is increased when disk utilization is greater than or equal to this threshold")
, enable_create_table_with_compact_storage(this, "enable_create_table_with_compact_storage", liveness::LiveUpdate, value_status::Used, false, "Enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE. This feature will eventually be removed in a future version.")
, rf_rack_valid_keyspaces(this, "rf_rack_valid_keyspaces", liveness::MustRestart, value_status::Used, false,
"Enforce RF-rack-valid keyspaces. Additionally, if there are existing RF-rack-invalid "
"keyspaces, attempting to start a node with this option ON will fail.")
, default_log_level(this, "default_log_level", value_status::Used, seastar::log_level::info, "Default log level for log messages")
, logger_log_level(this, "logger_log_level", value_status::Used, {}, "Map of logger name to log level. Valid log levels are 'error', 'warn', 'info', 'debug' and 'trace'")
, log_to_stdout(this, "log_to_stdout", value_status::Used, true, "Send log output to stdout")
@@ -1579,6 +1613,16 @@ std::unordered_map<sstring, db::tri_mode_restriction_t::mode> db::tri_mode_restr
{"warn", db::tri_mode_restriction_t::mode::WARN}};
}
std::unordered_map<sstring, db::tablets_mode_t::mode> db::tablets_mode_t::map() {
return {{"disabled", db::tablets_mode_t::mode::disabled},
{"0", db::tablets_mode_t::mode::disabled},
{"enabled", db::tablets_mode_t::mode::enabled},
{"1", db::tablets_mode_t::mode::enabled},
{"enforced", db::tablets_mode_t::mode::enforced},
{"2", db::tablets_mode_t::mode::enforced}
};
}
template struct utils::config_file::named_value<seastar::log_level>;
namespace utils {

View File

@@ -130,6 +130,20 @@ struct replication_strategy_restriction_t {
constexpr unsigned default_murmur3_partitioner_ignore_msb_bits = 12;
struct tablets_mode_t {
// The `unset` mode is used internally for backward compatibility
// with the legacy `enable_tablets` option.
// It is defined as -1 as existing test code associates the value
// 0 with `false` and 1 with `true` when read from system.config.
enum class mode : int8_t {
unset = -1,
disabled = 0,
enabled = 1,
enforced = 2
};
static std::unordered_map<sstring, mode> map(); // for enum_option<>
};
class config final : public utils::config_file {
public:
config();
@@ -183,6 +197,7 @@ public:
named_value<sstring> commitlog_directory;
named_value<sstring> schema_commitlog_directory;
named_value<string_list> data_file_directories;
named_value<uint64_t> data_file_capacity;
named_value<sstring> hints_directory;
named_value<sstring> view_hints_directory;
named_value<sstring> saved_caches_directory;
@@ -527,6 +542,23 @@ public:
named_value<std::vector<error_injection_at_startup>> error_injections_at_startup;
named_value<double> topology_barrier_stall_detector_threshold_seconds;
named_value<bool> enable_tablets;
named_value<enum_option<tablets_mode_t>> tablets_mode_for_new_keyspaces;
bool enable_tablets_by_default() const noexcept {
switch (tablets_mode_for_new_keyspaces()) {
case tablets_mode_t::mode::unset:
return enable_tablets();
case tablets_mode_t::mode::disabled:
return false;
case tablets_mode_t::mode::enabled:
case tablets_mode_t::mode::enforced:
return true;
}
}
bool enforce_tablets() const noexcept {
return tablets_mode_for_new_keyspaces() == tablets_mode_t::mode::enforced;
}
named_value<uint32_t> view_flow_control_delay_limit_in_ms;
named_value<int> disk_space_monitor_normal_polling_interval_in_seconds;
@@ -535,6 +567,8 @@ public:
named_value<bool> enable_create_table_with_compact_storage;
named_value<bool> rf_rack_valid_keyspaces;
static const sstring default_tls_priority;
private:
template<typename T>

View File

@@ -146,6 +146,10 @@ future<> hint_endpoint_manager::stop(drain should_drain) noexcept {
});
}
void hint_endpoint_manager::cancel_draining() noexcept {
_sender.cancel_draining();
}
hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hint_directory, manager& shard_manager)
: _key(key)
, _shard_manager(shard_manager)

View File

@@ -102,6 +102,8 @@ public:
/// \return Ready future when all operations are complete
future<> stop(drain should_drain = drain::no) noexcept;
void cancel_draining() noexcept;
/// \brief Start the timer.
void start();
@@ -144,6 +146,10 @@ public:
return _state.contains(state::stopped);
}
bool canceled_draining() const noexcept {
return _sender.canceled_draining();
}
/// \brief Returns replay position of the most recently written hint.
///
/// If there weren't any hints written during this endpoint manager's lifetime, a zero replay_position is returned.

View File

@@ -10,6 +10,7 @@
#include "db/hints/internal/hint_sender.hh"
// Seastar features.
#include <chrono>
#include <exception>
#include <seastar/core/abort_source.hh>
#include <seastar/core/coroutine.hh>
@@ -192,6 +193,14 @@ future<> hint_sender::stop(drain should_drain) noexcept {
});
}
void hint_sender::cancel_draining() {
manager_logger.info("Draining of {} has been marked as canceled", _ep_key);
if (_state.contains(state::draining)) {
_state.remove(state::draining);
}
_state.set(state::canceled_draining);
}
void hint_sender::add_segment(sstring seg_name) {
_segments_to_replay.emplace_back(std::move(seg_name));
}
@@ -449,6 +458,8 @@ bool hint_sender::send_one_file(const sstring& fname) {
gc_clock::duration secs_since_file_mod = std::chrono::seconds(last_mod.tv_sec);
lw_shared_ptr<send_one_file_ctx> ctx_ptr = make_lw_shared<send_one_file_ctx>(_last_schema_ver_to_column_mapping);
struct canceled_draining_exception {};
try {
commitlog::read_log_file(fname, manager::FILENAME_PREFIX, [this, secs_since_file_mod, &fname, ctx_ptr] (commitlog::buffer_and_replay_position buf_rp) -> future<> {
auto& buf = buf_rp.buffer;
@@ -461,6 +472,12 @@ bool hint_sender::send_one_file(const sstring& fname) {
co_return;
}
if (canceled_draining()) {
manager_logger.debug("[{}] Exiting reading from commitlog because of canceled draining", _ep_key);
// We need to throw an exception here to cancel reading the segment.
throw canceled_draining_exception{};
}
// Break early if stop() was called or the destination node went down.
if (!can_send()) {
ctx_ptr->segment_replay_failed = true;
@@ -491,6 +508,8 @@ bool hint_sender::send_one_file(const sstring& fname) {
manager_logger.error("{}: {}. Dropping...", fname, ex.what());
ctx_ptr->segment_replay_failed = false;
++this->shard_stats().corrupted_files;
} catch (const canceled_draining_exception&) {
manager_logger.debug("[{}] Loop in send_one_file finishes due to canceled draining", _ep_key);
} catch (...) {
manager_logger.trace("sending of {} failed: {}", fname, std::current_exception());
ctx_ptr->segment_replay_failed = true;
@@ -499,6 +518,12 @@ bool hint_sender::send_one_file(const sstring& fname) {
// wait till all background hints sending is complete
ctx_ptr->file_send_gate.close().get();
// If draining was canceled, we can't say anything about the segment's state,
// so return immediately. We return false here because of that reason too.
if (canceled_draining()) {
return false;
}
// If we are draining ignore failures and drop the segment even if we failed to send it.
if (draining() && ctx_ptr->segment_replay_failed) {
manager_logger.trace("send_one_file(): we are draining so we are going to delete the segment anyway");
@@ -556,6 +581,10 @@ void hint_sender::send_hints_maybe() noexcept {
try {
while (true) {
if (canceled_draining()) {
manager_logger.debug("[{}] Exiting loop in send_hints_maybe because of canceled draining", _ep_key);
break;
}
const sstring* seg_name = name_of_current_segment();
if (!seg_name || !replay_allowed() || !can_send()) {
break;

View File

@@ -66,12 +66,14 @@ class hint_sender {
stopping, // stop() was called
ep_state_left_the_ring, // destination Node is not a part of the ring anymore - usually means that it has been decommissioned
draining, // try to send everything out and ignore errors
canceled_draining, // draining was started, but it got canceled
};
using state_set = enum_set<super_enum<state,
state::stopping,
state::ep_state_left_the_ring,
state::draining>>;
state::draining,
state::canceled_draining>>;
struct send_one_file_ctx {
send_one_file_ctx(std::unordered_map<table_schema_version, column_mapping>& last_schema_ver_to_column_mapping)
@@ -140,6 +142,12 @@ public:
/// \param should_drain if is drain::yes - drain all pending hints
future<> stop(drain should_drain) noexcept;
void cancel_draining();
bool canceled_draining() const noexcept {
return _state.contains(state::canceled_draining);
}
/// \brief Add a new segment ready for sending.
void add_segment(sstring seg_name);

View File

@@ -220,11 +220,24 @@ future<> manager::stop() {
set_stopping();
return _migrating_done.finally([this] {
const auto& node = *_proxy.get_token_metadata_ptr()->get_topology().this_node();
const bool leaving = node.is_leaving() || node.left();
return _migrating_done.finally([this, leaving] {
// We want to stop the manager as soon as possible if it's not leaving the cluster.
// Because of that, we need to cancel all ongoing drains (since that can take quite a bit of time),
// but we also need to ensure that no new drains will be started in the meantime.
if (!leaving) {
for (auto& [_, ep_man] : _ep_managers) {
ep_man.cancel_draining();
}
}
return _draining_eps_gate.close();
// At this point, all endpoint managers that were being previously drained have been deleted from the map.
// In other words, the next lambda is safe to run, i.e. we won't call `hint_endpoint_manager::stop()` twice.
}).finally([this] {
return parallel_for_each(_ep_managers | std::views::values, [] (hint_endpoint_manager& ep_man) {
return ep_man.stop();
return ep_man.stop(drain::no);
}).finally([this] {
_ep_managers.clear();
_hint_directory_manager.clear();
@@ -667,7 +680,7 @@ future<> manager::drain_for(endpoint_id host_id, gms::inet_address ip) noexcept
co_return;
}
manager_logger.trace("on_leave_cluster: {} is removed/decommissioned", host_id);
manager_logger.trace("Draining starts for {}", host_id);
const auto holder = seastar::gate::holder{_draining_eps_gate};
// As long as we hold on to this lock, no migration of hinted handoff to host IDs
@@ -677,9 +690,24 @@ future<> manager::drain_for(endpoint_id host_id, gms::inet_address ip) noexcept
// After an endpoint has been drained, we remove its directory with all of its contents.
auto drain_ep_manager = [] (hint_endpoint_manager& ep_man) -> future<> {
return ep_man.stop(drain::yes).finally([&] {
return ep_man.with_file_update_mutex([&ep_man] {
return remove_file(ep_man.hints_dir().native());
// Prevent a drain if the endpoint manager was marked to cancel it.
if (ep_man.canceled_draining()) {
return make_ready_future();
}
return ep_man.stop(drain::yes).finally([&ep_man] {
// If draining was canceled, we can't remove the hint directory yet
// because there might still be some hints that we should send.
// We'll do that when the node starts again.
// Note that canceling draining can ONLY occur when the node is simply stopping.
// That cannot happen when decommissioning the node.
if (ep_man.canceled_draining()) {
return make_ready_future();
}
return ep_man.with_file_update_mutex([&ep_man] -> future<> {
return remove_file(ep_man.hints_dir().native()).then([&ep_man] {
manager_logger.debug("Removed hint directory for {}", ep_man.end_point_key());
});
});
});
};
@@ -986,4 +1014,18 @@ future<> manager::perform_migration() {
manager_logger.info("Migration of hinted handoff to host ID has finished successfully");
}
// Technical note: This function obviously doesn't need to be a coroutine. However, it's better to impose
// this constraint early on with possible future refactors in mind. It should be easier
// to modify the function this way.
future<> manager::drain_left_nodes() {
for (const auto& [host_id, ep_man] : _ep_managers) {
if (!_proxy.get_token_metadata_ptr()->is_normal_token_owner(host_id)) {
// It's safe to discard this future. It's awaited in `manager::stop()`.
(void) drain_for(host_id, {});
}
}
co_return;
}
} // namespace db::hints

View File

@@ -382,6 +382,12 @@ private:
/// ALL requested sync points will be canceled, i.e. an exception will be issued
/// in the corresponding futures.
future<> perform_migration();
public:
/// Performs draining for all nodes that have already left the cluster.
/// This should only be called when the hint endpoint managers have been initialized
/// and the hint manager has started.
future<> drain_left_nodes();
};
} // namespace db::hints

View File

@@ -239,6 +239,15 @@ future<> resource_manager::stop() noexcept {
});
}
future<> resource_manager::drain_hints_for_left_nodes() {
for (manager& m : _shard_managers) {
// It's safe to discard the future here. It's awaited in `manager::stop()`.
(void) m.drain_left_nodes();
}
co_return;
}
future<> resource_manager::register_manager(manager& m) {
return with_semaphore(_operation_lock, 1, [this, &m] () {
return with_semaphore(_space_watchdog.update_lock(), 1, [this, &m] {

View File

@@ -188,6 +188,8 @@ public:
/// \brief Allows replaying hints for managers which are registered now or will be in the future.
void allow_replaying() noexcept;
future<> drain_hints_for_left_nodes();
/// \brief Registers the hints::manager in resource_manager, and starts it, if resource_manager is already running.
///
/// The hints::managers can be added either before or after resource_manager starts.

View File

@@ -23,6 +23,7 @@
#include "gms/feature_service.hh"
#include "system_keyspace_view_types.hh"
#include "schema/schema_builder.hh"
#include "timestamp.hh"
#include "utils/assert.hh"
#include "utils/hashers.hh"
#include "utils/log.hh"
@@ -2931,9 +2932,8 @@ future<std::optional<mutation>> system_keyspace::get_service_levels_version_muta
return get_scylla_local_mutation(_db, SERVICE_LEVELS_VERSION_KEY);
}
future<mutation> system_keyspace::make_service_levels_version_mutation(int8_t version, const service::group0_guard& guard) {
future<mutation> system_keyspace::make_service_levels_version_mutation(int8_t version, api::timestamp_type timestamp) {
static sstring query = format("INSERT INTO {}.{} (key, value) VALUES (?, ?);", db::system_keyspace::NAME, db::system_keyspace::SCYLLA_LOCAL);
auto timestamp = guard.write_timestamp();
auto muts = co_await _qp.get_mutations_internal(query, internal_system_query_state(), timestamp, {SERVICE_LEVELS_VERSION_KEY, format("{}", version)});
if (muts.size() != 1) {

View File

@@ -654,7 +654,7 @@ public:
public:
future<std::optional<int8_t>> get_service_levels_version();
future<mutation> make_service_levels_version_mutation(int8_t version, const service::group0_guard& guard);
future<mutation> make_service_levels_version_mutation(int8_t version, api::timestamp_type timestamp);
future<std::optional<mutation>> get_service_levels_version_mutation();
// Publishes a new compression dictionary to `dicts`,

View File

@@ -0,0 +1,127 @@
/*
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "column_computation.hh"
#include "mutation/atomic_cell.hh"
#include "timestamp.hh"
#include <type_traits>
class row_marker;
// In a basic column_computation defined in column_computation.hh, the
// compute_value() method is only based on the partition key, and it must
// return a value. That API has very limited applications - basically the
// only thing we can implement with it is token_column_computation which
// we used to create the token column in secondary indexes.
// The regular_column_transformation base class here is more powerful, but
// still is not a completely general computation: Its compute_value() virtual
// method can transform the value read from a single cell of a regular column
// into a new cell stored in a structure regular_column_transformation::result.
//
// In more details, the assumptions of regular_column_transformation is:
// 1. compute_value() computes the value based on a *single* column in a
// row passed to compute_value().
// This assumption means that the value or deletion of the value always
// has a single known timestamp (and the value can't be half-missing)
// and single TTL information. That would not have been possible if we
// allowed the computation to depend on multiple columns.
// 2. compute_value() computes the value based on a *regular* column in the
// base table. This means that an update can modify this value (unlike a
// base-table key column that can't change in an update), so the view
// update code needs to compute the value before and after the update,
// and potentially delete and create view rows.
// 3. compute_value() returns a column_computation::result which includes
// a value and its liveness information (timestamp and ttl/expiry) or
// is missing a value.
class regular_column_transformation : public column_computation {
public:
struct result {
// We can use "bytes" instead of "managed_bytes" here because we know
// that a column_computation is only used for generating a key value,
// and that is limited to 64K. This limitation is enforced below -
// we never linearize a cell's value if its size is more than 64K.
std::optional<bytes> _value;
// _ttl and _expiry are only defined if _value is set.
// The default values below are used when the source cell does not
// expire, and are the same values that row_marker uses for a non-
// expiring marker. This is useful when creating a row_marker from
// get_ttl() and get_expiry().
gc_clock::duration _ttl { 0 };
gc_clock::time_point _expiry { gc_clock::duration(0) };
// _ts may be set even if _value is missing, which can remember the
// timestamp of a tombstone. Note that the current view-update code
// that uses this class doesn't use _ts when _value is missing.
api::timestamp_type _ts = api::missing_timestamp;
api::timestamp_type get_ts() const {
return _ts;
}
bool has_value() const {
return _value.has_value();
}
// Should only be called if has_value() is true:
const bytes& get_value() const {
return *_value;
}
gc_clock::duration get_ttl() const {
return _ttl;
}
gc_clock::time_point get_expiry() const {
return _expiry;
}
// A missing computation result
result() { }
// Construct a computation result by copying a given atomic_cell -
// including its value, timestamp, and ttl - or deletion timestamp.
// The second parameter is an optional transformation function f -
// taking a bytes and returning an optional<bytes> - that transforms
// the value of the cell but keeps its other liveness information.
// If f returns a nullopt, it causes the view row should be deleted.
template<typename Func=std::identity>
requires std::invocable<Func, bytes> && std::convertible_to<std::invoke_result_t<Func, bytes>, std::optional<bytes>>
result(atomic_cell_view cell, Func f = {}) {
_ts = cell.timestamp();
if (cell.is_live()) {
// If the cell is larger than what a key can hold (64KB),
// return a missing value. This lets us skip this item during
// view building and avoid hanging the view build as described
// in #8627. But it doesn't prevent later inserting such a item
// to the base table, nor does it implement front-end specific
// limits (such as Alternator's 1K or 2K limits - see #10347).
// Those stricter limits should be validated in the base-table
// write code, not here - deep inside the view update code.
// Note also we assume that f() doesn't grow the value further.
if (cell.value().size() >= 65536) {
return;
}
_value = f(to_bytes(cell.value()));
if (_value) {
if (cell.is_live_and_has_ttl()) {
_ttl = cell.ttl();
_expiry = cell.expiry();
}
}
}
}
};
virtual ~regular_column_transformation() = default;
virtual result compute_value(
const schema& schema,
const partition_key& key,
const db::view::clustering_or_static_row& row) const = 0;
};

View File

@@ -36,6 +36,7 @@
#include "db/view/view_builder.hh"
#include "db/view/view_updating_consumer.hh"
#include "db/view/view_update_generator.hh"
#include "db/view/regular_column_transformation.hh"
#include "db/system_keyspace_view_types.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
@@ -506,79 +507,6 @@ size_t view_updates::op_count() const {
return _op_count;
}
row_marker view_updates::compute_row_marker(const clustering_or_static_row& base_row) const {
/*
* We need to compute both the timestamp and expiration for view rows.
*
* Below there are several distinct cases depending on how many new key
* columns the view has - i.e., how many of the view's key columns were
* regular columns in the base. base_regular_columns_in_view_pk.size():
*
* Zero new key columns:
* The view rows key is composed only from base key columns, and those
* cannot be changed in an update, so the view row remains alive as
* long as the base row is alive. We need to return the same row
* marker as the base for the view - to keep an empty view row alive
* for as long as an empty base row exists.
* Note that in this case, if there are *unselected* base columns, we
* may need to keep an empty view row alive even without a row marker
* because the base row (which has additional columns) is still alive.
* For that we have the "virtual columns" feature: In the zero new
* key columns case, we put unselected columns in the view as empty
* columns, to keep the view row alive.
*
* One new key column:
* In this case, there is a regular base column that is part of the
* view key. This regular column can be added or deleted in an update,
* or its expiration be set, and those can cause the view row -
* including its row marker - to need to appear or disappear as well.
* So the liveness of cell of this one column determines the liveness
* of the view row and the row marker that we return.
*
* Two or more new key columns:
* This case is explicitly NOT supported in CQL - one cannot create a
* view with more than one base-regular columns in its key. In general
* picking one liveness (timestamp and expiration) is not possible
* if there are multiple regular base columns in the view key, as
* those can have different liveness.
* However, we do allow this case for Alternator - we need to allow
* the case of two (but not more) because the DynamoDB API allows
* creating a GSI whose two key columns (hash and range key) were
* regular columns.
* We can support this case in Alternator because it doesn't use
* expiration (the "TTL" it does support is different), and doesn't
* support user-defined timestamps. But, the two columns can still
* have different timestamps - this happens if an update modifies
* just one of them. In this case the timestamp of the view update
* (and that of the row marker we return) is the later of these two
* updated columns.
*/
const auto& col_ids = base_row.is_clustering_row()
? _base_info->base_regular_columns_in_view_pk()
: _base_info->base_static_columns_in_view_pk();
if (!col_ids.empty()) {
auto& def = _base->column_at(base_row.column_kind(), col_ids[0]);
// Note: multi-cell columns can't be part of the primary key.
auto cell = base_row.cells().cell_at(col_ids[0]).as_atomic_cell(def);
auto ts = cell.timestamp();
if (col_ids.size() > 1){
// As explained above, this case only happens in Alternator,
// and we may need to pick a higher ts:
auto& second_def = _base->column_at(base_row.column_kind(), col_ids[1]);
auto second_cell = base_row.cells().cell_at(col_ids[1]).as_atomic_cell(second_def);
auto second_ts = second_cell.timestamp();
ts = std::max(ts, second_ts);
// Alternator isn't supposed to have TTL or more than two col_ids!
if (col_ids.size() != 2 || cell.is_live_and_has_ttl() || second_cell.is_live_and_has_ttl()) [[unlikely]] {
utils::on_internal_error(format("Unexpected col_ids length {} or has TTL", col_ids.size()));
}
}
return cell.is_live_and_has_ttl() ? row_marker(ts, cell.ttl(), cell.expiry()) : row_marker(ts);
}
return base_row.marker();
}
namespace {
// The following struct is identical to view_key_with_action, except the key
// is stored as a managed_bytes_view instead of bytes.
@@ -654,8 +582,8 @@ public:
return {_update.key()->get_component(_base, base_col->position())};
default:
if (base_col->kind != _update.column_kind()) {
on_internal_error(vlogger, format("Tried to get a {} column from a {} row update, which is impossible",
to_sstring(base_col->kind), _update.is_clustering_row() ? "clustering" : "static"));
on_internal_error(vlogger, format("Tried to get a {} column {} from a {} row update, which is impossible",
to_sstring(base_col->kind), base_col->name_as_text(), _update.is_clustering_row() ? "clustering" : "static"));
}
auto& c = _update.cells().cell_at(base_col->id);
auto value_view = base_col->is_atomic() ? c.as_atomic_cell(cdef).value() : c.as_collection_mutation().data;
@@ -676,6 +604,22 @@ private:
return handle_collection_column_computation(collection_computation);
}
// TODO: we already calculated this computation in updatable_view_key_cols,
// so perhaps we should pass it here and not re-compute it. But this will
// mean computed columns will only work for view key columns (currently
// we assume that anyway)
if (auto* c = dynamic_cast<const regular_column_transformation*>(&computation)) {
regular_column_transformation::result after =
c->compute_value(_base, _base_key, _update);
if (after.has_value()) {
return {managed_bytes_view(linearized_values.emplace_back(after.get_value()))};
}
// We only get to this function when we know the _update row
// exists and call it to read its key columns, so we don't expect
// to see a missing value for any of those columns
on_internal_error(vlogger, fmt::format("unexpected call to handle_computed_column {} missing in update", cdef.name_as_text()));
}
auto computed_value = computation.compute_value(_base, _base_key);
return {managed_bytes_view(linearized_values.emplace_back(std::move(computed_value)))};
}
@@ -727,7 +671,6 @@ view_updates::get_view_rows(const partition_key& base_key, const clustering_or_s
if (partition.partition_tombstone() && partition.partition_tombstone() == row_delete_tomb.tomb()) {
return;
}
ret.push_back({&partition.clustered_row(*_view, std::move(ckey)), action});
};
@@ -934,13 +877,12 @@ static void add_cells_to_view(const schema& base, const schema& view, column_kin
* Creates a view entry corresponding to the provided base row.
* This method checks that the base row does match the view filter before applying anything.
*/
void view_updates::create_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, gc_clock::time_point now) {
void view_updates::create_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, gc_clock::time_point now, row_marker update_marker) {
if (!matches_view_filter(db, *_base, _view_info, base_key, update, now)) {
return;
}
auto view_rows = get_view_rows(base_key, update, std::nullopt, {});
auto update_marker = compute_row_marker(update);
const auto kind = update.column_kind();
for (const auto& [r, action]: view_rows) {
if (auto rm = std::get_if<row_marker>(&action)) {
@@ -958,48 +900,28 @@ void view_updates::create_entry(data_dictionary::database db, const partition_ke
* Deletes the view entry corresponding to the provided base row.
* This method checks that the base row does match the view filter before bothering.
*/
void view_updates::delete_old_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now) {
void view_updates::delete_old_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now, api::timestamp_type deletion_ts) {
// Before deleting an old entry, make sure it was matching the view filter
// (otherwise there is nothing to delete)
if (matches_view_filter(db, *_base, _view_info, base_key, existing, now)) {
do_delete_old_entry(base_key, existing, update, now);
do_delete_old_entry(base_key, existing, update, now, deletion_ts);
}
}
void view_updates::do_delete_old_entry(const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now) {
void view_updates::do_delete_old_entry(const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now, api::timestamp_type deletion_ts) {
auto view_rows = get_view_rows(base_key, existing, std::nullopt, update.tomb());
const auto kind = existing.column_kind();
for (const auto& [r, action] : view_rows) {
const auto& col_ids = existing.is_clustering_row()
? _base_info->base_regular_columns_in_view_pk()
: _base_info->base_static_columns_in_view_pk();
if (_view_info.has_computed_column_depending_on_base_non_primary_key()) {
if (auto ts_tag = std::get_if<view_key_and_action::shadowable_tombstone_tag>(&action)) {
r->apply(ts_tag->into_shadowable_tombstone(now));
}
} else if (!col_ids.empty()) {
// We delete the old row using a shadowable row tombstone, making sure that
// the tombstone deletes everything in the row (or it might still show up).
// Note: multi-cell columns can't be part of the primary key.
auto& def = _base->column_at(kind, col_ids[0]);
auto cell = existing.cells().cell_at(col_ids[0]).as_atomic_cell(def);
auto ts = cell.timestamp();
if (col_ids.size() > 1) {
// This is the Alternator-only support for two regular base
// columns that become view key columns. See explanation in
// view_updates::compute_row_marker().
auto& second_def = _base->column_at(kind, col_ids[1]);
auto second_cell = existing.cells().cell_at(col_ids[1]).as_atomic_cell(second_def);
auto second_ts = second_cell.timestamp();
ts = std::max(ts, second_ts);
// Alternator isn't supposed to have more than two col_ids!
if (col_ids.size() != 2) [[unlikely]] {
utils::on_internal_error(format("Unexpected col_ids length {}", col_ids.size()));
}
}
if (cell.is_live()) {
r->apply(shadowable_tombstone(ts, now));
}
if (!col_ids.empty() || _view_info.has_computed_column_depending_on_base_non_primary_key()) {
// The view key could have been modified because it contains or
// depends on a non-primary-key. The fact that this function was
// called instead of update_entry() means the caller knows it
// wants to delete the old row (with the given deletion_ts) and
// will create a different one. So let's honor this.
r->apply(shadowable_tombstone(deletion_ts, now));
} else {
// "update" caused the base row to have been deleted, and !col_id
// means view row is the same - so it needs to be deleted as well
@@ -1100,15 +1022,15 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
* This method checks that the base row (before and after) matches the view filter before
* applying anything.
*/
void view_updates::update_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, const clustering_or_static_row& existing, gc_clock::time_point now) {
void view_updates::update_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, const clustering_or_static_row& existing, gc_clock::time_point now, row_marker update_marker) {
// While we know update and existing correspond to the same view entry,
// they may not match the view filter.
if (!matches_view_filter(db, *_base, _view_info, base_key, existing, now)) {
create_entry(db, base_key, update, now);
create_entry(db, base_key, update, now, update_marker);
return;
}
if (!matches_view_filter(db, *_base, _view_info, base_key, update, now)) {
do_delete_old_entry(base_key, existing, update, now);
do_delete_old_entry(base_key, existing, update, now, update_marker.timestamp());
return;
}
@@ -1117,7 +1039,7 @@ void view_updates::update_entry(data_dictionary::database db, const partition_ke
}
auto view_rows = get_view_rows(base_key, update, std::nullopt, {});
auto update_marker = compute_row_marker(update);
const auto kind = update.column_kind();
for (const auto& [r, action] : view_rows) {
if (auto rm = std::get_if<row_marker>(&action)) {
@@ -1133,6 +1055,8 @@ void view_updates::update_entry(data_dictionary::database db, const partition_ke
_op_count += view_rows.size();
}
// Note: despite the general-sounding name of this function, it is used
// just for the case of collection indexing.
void view_updates::update_entry_for_computed_column(
const partition_key& base_key,
const clustering_or_static_row& update,
@@ -1155,30 +1079,72 @@ void view_updates::update_entry_for_computed_column(
}
}
// view_updates::generate_update() is the main function for taking an update
// to a base table row - consisting of existing and updated versions of row -
// and creating from it zero or more updates to a given materialized view.
// These view updates may consist of updating an existing view row, deleting
// an old view row, and/or creating a new view row.
// There are several distinct cases depending on how many of the view's key
// columns are "new key columns", i.e., were regular key columns in the base
// or are a computed column based on a regular column (these computed columns
// are used by, for example, Alternator's GSI):
//
// Zero new key columns:
// The view rows key is composed only from base key columns, and those can't
// be changed in an update, so the view row remains alive as long as the
// base row is alive. The row marker for the view needs to be set to the
// same row marker in the base - to keep an empty view row alive for as long
// as an empty base row exists.
// Note that in this case, if there are *unselected* base columns, we may
// need to keep an empty view row alive even without a row marker because
// the base row (which has additional columns) is still alive. For that we
// have the "virtual columns" feature: In the zero new key columns case, we
// put unselected columns in the view as empty columns, to keep the view
// row alive.
//
// One new key column:
// In this case, there is a regular base column that is part of the view
// key. This regular column can be added or deleted in an update, or its
// expiration be set, and those can cause the view row - including its row
// marker - to need to appear or disappear as well. So the liveness of cell
// of this one column determines the liveness of the view row and the row
// marker that we set for it.
//
// Two or more new key columns:
// This case is explicitly NOT supported in CQL - one cannot create a view
// with more than one base-regular columns in its key. In general picking
// one liveness (timestamp and expiration) is not possible if there are
// multiple regular base columns in the view key, asthose can have different
// liveness.
// However, we do allow this case for Alternator - we need to allow the case
// of two (but not more) because the DynamoDB API allows creating a GSI
// whose two key columns (hash and range key) were regular columns. We can
// support this case in Alternator because it doesn't use expiration (the
// "TTL" it does support is different), and doesn't support user-defined
// timestamps. But, the two columns can still have different timestamps -
// this happens if an update modifies just one of them. In this case the
// timestamp of the view update (and that of the row marker) is the later
// of these two updated columns.
void view_updates::generate_update(
data_dictionary::database db,
const partition_key& base_key,
const clustering_or_static_row& update,
const std::optional<clustering_or_static_row>& existing,
gc_clock::time_point now) {
// Note that the base PK columns in update and existing are the same, since we're intrinsically dealing
// with the same base row. So we have to check 3 things:
// 1) that the clustering key doesn't have a null, which can happen for compact tables. If that's the case,
// there is no corresponding entries.
// 2) if there is a column not part of the base PK in the view PK, whether it is changed by the update.
// 3) whether the update actually matches the view SELECT filter
// FIXME: The following if() is old code which may be related to COMPACT
// STORAGE. If this is a real case, refer to a test that demonstrates it.
// If it's not a real case, remove this if().
if (update.is_clustering_row()) {
if (!update.key()->is_full(*_base)) {
return;
}
}
if (_view_info.has_computed_column_depending_on_base_non_primary_key()) {
return update_entry_for_computed_column(base_key, update, existing, now);
}
if (!_base_info->has_base_non_pk_columns_in_view_pk) {
// If the view key depends on any regular column in the base, the update
// may change the view key and may require deleting an old view row and
// inserting a new row. The other case, which we'll handle here first,
// is easier and require just modifying one view row.
if (!_base_info->has_base_non_pk_columns_in_view_pk &&
!_view_info.has_computed_column_depending_on_base_non_primary_key()) {
if (update.is_static_row()) {
// TODO: support static rows in views with pk only including columns from base pk
return;
@@ -1186,85 +1152,186 @@ void view_updates::generate_update(
// The view key is necessarily the same pre and post update.
if (existing && existing->is_live(*_base)) {
if (update.is_live(*_base)) {
update_entry(db, base_key, update, *existing, now);
update_entry(db, base_key, update, *existing, now, update.marker());
} else {
delete_old_entry(db, base_key, *existing, update, now);
delete_old_entry(db, base_key, *existing, update, now, api::missing_timestamp);
}
} else if (update.is_live(*_base)) {
create_entry(db, base_key, update, now);
create_entry(db, base_key, update, now, update.marker());
}
return;
}
const auto& col_ids = update.is_clustering_row()
? _base_info->base_regular_columns_in_view_pk()
: _base_info->base_static_columns_in_view_pk();
// The view has a non-primary-key column from the base table as its primary key.
// That means it's either a regular or static column. If we are currently
// processing an update which does not correspond to the column's kind,
// just stop here.
if (col_ids.empty()) {
// Find the view key columns that may be changed by an update.
// This case is interesting because a change to the view key means that
// we may need to delete an old view row and/or create a new view row.
// The columns we look for are view key columns that are neither base key
// columns nor computed columns based just on key columns. In other words,
// we look here for columns which were regular columns or static columns
// in the base table, or computed columns based on regular columns.
struct updatable_view_key_col {
column_id view_col_id;
regular_column_transformation::result before;
regular_column_transformation::result after;
};
std::vector<updatable_view_key_col> updatable_view_key_cols;
for (const column_definition& view_col : _view->primary_key_columns()) {
if (view_col.is_computed()) {
const column_computation& computation = view_col.get_computation();
if (computation.depends_on_non_primary_key_column()) {
// Column is a computed column that does not depend just on
// the base key, so it may change in the update.
if (auto* c = dynamic_cast<const regular_column_transformation*>(&computation)) {
updatable_view_key_cols.emplace_back(view_col.id,
existing ? c->compute_value(*_base, base_key, *existing) : regular_column_transformation::result(),
c->compute_value(*_base, base_key, update));
} else {
// The only other column_computation we have which has
// depends_on_non_primary_key_column is
// collection_column_computation, and we have a special
// function to handle that case:
return update_entry_for_computed_column(base_key, update, existing, now);
}
}
} else {
const column_definition* base_col = _base->get_column_definition(view_col.name());
if (!base_col) {
on_internal_error(vlogger, fmt::format("Column {} in view {}.{} was not found in the base table {}.{}",
view_col.name(), _view->ks_name(), _view->cf_name(), _base->ks_name(), _base->cf_name()));
}
// If the view key column was also a base primary key column, then
// it can't possibly change in this update. But the column was not
// not a primary key column - i.e., a regular column or static
// column, the update might have changed it and we need to list it
// on updatable_view_key_cols.
// We check base_col->kind == update.column_kind() instead of just
// !base_col->is_primary_key() because when update is a static row
// we know it can't possibly update a regular column (and vice
// versa).
if (base_col->kind == update.column_kind()) {
// This is view key, so we know it is atomic
std::optional<atomic_cell_view> after;
auto afterp = update.cells().find_cell(base_col->id);
if (afterp) {
after = afterp->as_atomic_cell(*base_col);
}
std::optional<atomic_cell_view> before;
if (existing) {
auto beforep = existing->cells().find_cell(base_col->id);
if (beforep) {
before = beforep->as_atomic_cell(*base_col);
}
}
updatable_view_key_cols.emplace_back(view_col.id,
before ? regular_column_transformation::result(*before) : regular_column_transformation::result(),
after ? regular_column_transformation::result(*after) : regular_column_transformation::result());
}
}
}
// If we reached here, the view has a non-primary-key column from the base
// table as its primary key. That means it's either a regular or static
// column. If we are currently processing an update which does not
// correspond to the column's kind, updatable_view_key_cols will be empty
// and we can just stop here.
if (updatable_view_key_cols.empty()) {
return;
}
const auto kind = update.column_kind();
// If one of the key columns is missing, set has_new_row = false
// meaning that after the update there will be no view row.
// If one of the key columns is missing in the existing value,
// set has_old_row = false meaning we don't have an old row to
// delete.
// Use updatable_view_key_cols - the before and after values of the
// view key columns that may have changed - to determine if the update
// changes an existing view row, deletes an old row or creates a new row.
bool has_old_row = true;
bool has_new_row = true;
bool same_row = true;
for (auto col_id : col_ids) {
auto* after = update.cells().find_cell(col_id);
auto& cdef = _base->column_at(kind, col_id);
if (existing) {
auto* before = existing->cells().find_cell(col_id);
// Note that this cell is necessarily atomic, because col_ids are
// view key columns, and keys must be atomic.
if (before && before->as_atomic_cell(cdef).is_live()) {
if (after && after->as_atomic_cell(cdef).is_live()) {
// We need to compare just the values of the keys, not
// metadata like the timestamp. This is because below,
// if the old and new view row have the same key, we need
// to be sure to reach the update_entry() case.
auto cmp = compare_unsigned(before->as_atomic_cell(cdef).value(), after->as_atomic_cell(cdef).value());
if (cmp != 0) {
same_row = false;
}
bool same_row = true; // undefined if either has_old_row or has_new_row are false
for (const auto& u : updatable_view_key_cols) {
if (u.before.has_value()) {
if (u.after.has_value()) {
if (compare_unsigned(u.before.get_value(), u.after.get_value()) != 0) {
same_row = false;
}
} else {
has_old_row = false;
has_new_row = false;
}
} else {
has_old_row = false;
}
if (!after || !after->as_atomic_cell(cdef).is_live()) {
has_new_row = false;
if (!u.after.has_value()) {
has_new_row = false;
}
}
}
// If has_new_row, calculate a row marker for this view row - i.e., a
// timestamp and ttl - based on those of the updatable view key column
// (or, in an Alternator-only extension, more than one).
row_marker new_row_rm; // only set if has_new_row
if (has_new_row) {
// Note:
// 1. By reaching here we know that updatable_view_key_cols has at
// least one member (in CQL, it's always one, in Alternator it
// may be two).
// 2. Because has_new_row, we know all elements in that array have
// after.has_value() true, so we can use after.get_ts() et al.
api::timestamp_type new_row_ts = updatable_view_key_cols[0].after.get_ts();
// This is the Alternator-only support for *two* regular base columns
// that become view key columns. The timestamp we use is the *maximum*
// of the two key columns, as explained in pull-request #17172.
if (updatable_view_key_cols.size() > 1) {
auto second_ts = updatable_view_key_cols[1].after.get_ts();
new_row_ts = std::max(new_row_ts, second_ts);
// Alternator isn't supposed to have more than two updatable view key columns!
if (updatable_view_key_cols.size() != 2) [[unlikely]] {
utils::on_internal_error(format("Unexpected updatable_view_key_col length {}", updatable_view_key_cols.size()));
}
}
// We assume that either updatable_view_key_cols has just one column
// (the only situation allowed in CQL) or if there is more then one
// they have the same expiry information (in Alternator, there is
// never a CQL TTL set).
new_row_rm = row_marker(new_row_ts, updatable_view_key_cols[0].after.get_ttl(), updatable_view_key_cols[0].after.get_expiry());
}
if (has_old_row) {
// As explained in #19977, when there is one updatable_view_key_cols
// (the only case allowed in CQL) the deletion timestamp is before's
// timestamp. As explained in #17119, if there are two of them (only
// possible in Alternator), we take the maximum.
// Note:
// 1. By reaching here we know that updatable_view_key_cols has at
// least one member (in CQL, it's always one, in Alternator it
// may be two).
// 2. Because has_old_row, we know all elements in that array have
// before.has_value() true, so we can use before.get_ts().
auto old_row_ts = updatable_view_key_cols[0].before.get_ts();
if (updatable_view_key_cols.size() > 1) {
// This is the Alternator-only support for two regular base
// columns that become view key columns. See explanation in
// view_updates::compute_row_marker().
auto second_ts = updatable_view_key_cols[1].before.get_ts();
old_row_ts = std::max(old_row_ts, second_ts);
// Alternator isn't supposed to have more than two updatable view key columns!
if (updatable_view_key_cols.size() != 2) [[unlikely]] {
utils::on_internal_error(format("Unexpected updatable_view_key_col length {}", updatable_view_key_cols.size()));
}
}
if (has_new_row) {
if (same_row) {
update_entry(db, base_key, update, *existing, now);
update_entry(db, base_key, update, *existing, now, new_row_rm);
} else {
// This code doesn't work if the old and new view row have the
// same key, because if they do we get both data and tombstone
// for the same timestamp (now) and the tombstone wins. This
// is why we need the "same_row" case above - it's not just a
// performance optimization.
delete_old_entry(db, base_key, *existing, update, now);
create_entry(db, base_key, update, now);
// The following code doesn't work if the old and new view row
// have the same key, because if they do we can get both data
// and tombstone for the same timestamp and the tombstone
// wins. This is why we need the "same_row" case above - it's
// not just a performance optimization.
delete_old_entry(db, base_key, *existing, update, now, old_row_ts);
create_entry(db, base_key, update, now, new_row_rm);
}
} else {
delete_old_entry(db, base_key, *existing, update, now);
delete_old_entry(db, base_key, *existing, update, now, old_row_ts);
}
} else if (has_new_row) {
create_entry(db, base_key, update, now);
create_entry(db, base_key, update, now, new_row_rm);
}
}
bool view_updates::is_partition_key_permutation_of_base_partition_key() const {
@@ -2995,6 +3062,12 @@ public:
_step.build_status.pop_back();
}
}
// before going back to the minimum token, advance current_key to the end
// and check for built views in that range.
_step.current_key = {_step.prange.end().value_or(dht::ring_position::max()).value().token(), partition_key::make_empty()};
check_for_built_views();
_step.current_key = {dht::minimum_token(), partition_key::make_empty()};
for (auto&& vs : _step.build_status) {
vs.next_token = dht::minimum_token();

View File

@@ -240,10 +240,10 @@ private:
};
std::vector<view_row_entry> get_view_rows(const partition_key& base_key, const clustering_or_static_row& update, const std::optional<clustering_or_static_row>& existing, row_tombstone update_tomb);
bool can_skip_view_updates(const clustering_or_static_row& update, const clustering_or_static_row& existing) const;
void create_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, gc_clock::time_point now);
void delete_old_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now);
void do_delete_old_entry(const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now);
void update_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, const clustering_or_static_row& existing, gc_clock::time_point now);
void create_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, gc_clock::time_point now, row_marker update_marker);
void delete_old_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now, api::timestamp_type deletion_ts);
void do_delete_old_entry(const partition_key& base_key, const clustering_or_static_row& existing, const clustering_or_static_row& update, gc_clock::time_point now, api::timestamp_type deletion_ts);
void update_entry(data_dictionary::database db, const partition_key& base_key, const clustering_or_static_row& update, const clustering_or_static_row& existing, gc_clock::time_point now, row_marker update_marker);
void update_entry_for_computed_column(const partition_key& base_key, const clustering_or_static_row& update, const std::optional<clustering_or_static_row>& existing, gc_clock::time_point now);
};

View File

@@ -12,15 +12,16 @@ Architecture: any
Description: Scylla database main configuration file
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.
Replaces: %{product}-server (<< 1.1)
Replaces: %{product}-server (<< 1.1), scylla-enterprise-conf (<< 2025.1.0~)
Conflicts: %{product}-server (<< 1.1)
Breaks: scylla-enterprise-conf (<< 2025.1.0~)
Package: %{product}-server
Architecture: any
Depends: ${misc:Depends}, %{product}-conf (= ${binary:Version}), %{product}-python3 (= ${binary:Version})
Replaces: %{product}-tools (<<5.5)
Breaks: %{product}-tools (<<5.5)
Description: Scylla database server binaries
Replaces: %{product}-tools (<<5.5), scylla-enterprise-tools (<< 2024.2.0~), scylla-enterprise-server (<< 2025.1.0~)
Breaks: %{product}-tools (<<5.5), scylla-enterprise-tools (<< 2024.2.0~), scylla-enterprise-server (<< 2025.1.0~)
Description: Scylla database server binaries
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.
@@ -29,6 +30,8 @@ Section: debug
Priority: extra
Architecture: any
Depends: %{product}-server (= ${binary:Version}), ${misc:Depends}
Replaces: scylla-enterprise-server-dbg (<< 2025.1.0~)
Breaks: scylla-enterprise-server-dbg (<< 2025.1.0~)
Description: debugging symbols for %{product}-server
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.
@@ -37,13 +40,17 @@ Description: debugging symbols for %{product}-server
Package: %{product}-kernel-conf
Architecture: any
Depends: procps
Replaces: scylla-enterprise-kernel-conf (<< 2025.1.0~)
Breaks: scylla-enterprise-kernel-conf (<< 2025.1.0~)
Description: Scylla kernel tuning configuration
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.
Package: %{product}-node-exporter
Architecture: any
Replaces: scylla-enterprise-node-exporter (<< 2025.1.0~)
Conflicts: prometheus-node-exporter
Breaks: scylla-enterprise-node-exporter (<< 2025.1.0~)
Description: Prometheus exporter for machine metrics
Prometheus exporter for machine metrics, written in Go with pluggable metric collectors.
@@ -54,6 +61,49 @@ Depends: %{product}-server (= ${binary:Version})
, %{product}-kernel-conf (= ${binary:Version})
, %{product}-node-exporter (= ${binary:Version})
, %{product}-cqlsh (= ${binary:Version})
Replaces: scylla-enterprise (<< 2025.1.0~)
Breaks: scylla-enterprise (<< 2025.1.0~)
Description: Scylla database metapackage
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.
Package: scylla-enterprise-conf
Depends: %{product}-conf (= ${binary:Version})
Architecture: all
Priority: optional
Section: oldlibs
Description: transitional package
This is a transitional package. It can safely be removed.
Package: scylla-enterprise-server
Depends: %{product}-server (= ${binary:Version})
Architecture: all
Priority: optional
Section: oldlibs
Description: transitional package
This is a transitional package. It can safely be removed.
Package: scylla-enterprise
Depends: %{product} (= ${binary:Version})
Architecture: all
Priority: optional
Section: oldlibs
Description: transitional package
This is a transitional package. It can safely be removed.
Package: scylla-enterprise-kernel-conf
Depends: %{product}-kernel-conf (= ${binary:Version})
Architecture: all
Priority: optional
Section: oldlibs
Description: transitional package
This is a transitional package. It can safely be removed.
Package: scylla-enterprise-node-exporter
Depends: %{product}-node-exporter (= ${binary:Version})
Architecture: all
Priority: optional
Section: oldlibs
Description: transitional package
This is a transitional package. It can safely be removed.

View File

@@ -11,6 +11,8 @@ endif
product := $(subst -server,,$(DEB_SOURCE))
libreloc_list := $(shell find scylla/libreloc/ -maxdepth 1 -type f -not -name .*.hmac -and -not -name gnutls.config -printf '-X%f ')
libexec_list := $(shell find scylla/libexec/ -maxdepth 1 -type f -not -name scylla -and -not -name iotune -printf '-X%f ')
override_dh_auto_configure:
override_dh_auto_build:
@@ -38,7 +40,7 @@ endif
override_dh_strip:
# The binaries (ethtool...patchelf) don't pass dh_strip after going through patchelf. Since they are
# already stripped, nothing is lost if we exclude them, so that's what we do.
dh_strip -Xlibprotobuf.so.15 -Xld.so -Xethtool -Xgawk -Xgzip -Xhwloc-calc -Xhwloc-distrib -Xifconfig -Xlscpu -Xnetstat -Xpatchelf --dbg-package=$(product)-server-dbg
dh_strip $(libreloc_list) $(libexec_list) --dbg-package=$(product)-server-dbg
find $(CURDIR)/debian/$(product)-server-dbg/usr/lib/debug/.build-id/ -name "*.debug" -exec objcopy --decompress-debug-sections {} \;
override_dh_makeshlibs:

View File

@@ -21,6 +21,7 @@ opt/scylladb/scyllatop/*
opt/scylladb/scripts/libexec/*
opt/scylladb/bin/*
opt/scylladb/libreloc/*
opt/scylladb/libreloc/.*.hmac
opt/scylladb/libexec/*
usr/lib/scylla/*
var/lib/scylla/data

View File

@@ -13,7 +13,8 @@ Requires: %{product}-python3 = %{version}-%{release}
Requires: %{product}-kernel-conf = %{version}-%{release}
Requires: %{product}-node-exporter = %{version}-%{release}
Requires: %{product}-cqlsh = %{version}-%{release}
Obsoletes: scylla-server < 1.1
Provides: scylla-enterprise = %{version}-%{release}
Obsoletes: scylla-enterprise < 2025.1.0
%global _debugsource_template %{nil}
%global _debuginfo_subpackages %{nil}
@@ -73,6 +74,10 @@ Requires: %{product}-python3 = %{version}-%{release}
AutoReqProv: no
Provides: %{product}-tools:%{_bindir}/nodetool
Provides: %{product}-tools:%{_sysconfigdir}/bash_completion.d/nodetool-completion
Provides: scylla-enterprise-tools:%{_bindir}/nodetool
Provides: scylla-enterprise-tools:%{_sysconfigdir}/bash_completion.d/nodetool-completion
Provides: scylla-enterprise-server = %{version}-%{release}
Obsoletes: scylla-enterprise-server < 2025.1.0
%description server
This package contains ScyllaDB server.
@@ -132,6 +137,7 @@ ln -sfT /etc/scylla /var/lib/scylla/conf
/opt/scylladb/scyllatop/*
/opt/scylladb/bin/*
/opt/scylladb/libreloc/*
/opt/scylladb/libreloc/.*.hmac
/opt/scylladb/libexec/*
%{_prefix}/lib/scylla/*
%attr(0755,scylla,scylla) %dir %{_sharedstatedir}/scylla/
@@ -156,6 +162,8 @@ ln -sfT /etc/scylla /var/lib/scylla/conf
Group: Applications/Databases
Summary: Scylla configuration package
Obsoletes: scylla-server < 1.1
Provides: scylla-enterprise-conf = %{version}-%{release}
Obsoletes: scylla-enterprise-conf < 2025.1.0
%description conf
This package contains the main scylla configuration file.
@@ -176,6 +184,8 @@ Summary: Scylla configuration package for the Linux kernel
Requires: kmod
# tuned overwrites our sysctl settings
Obsoletes: tuned >= 2.11.0
Provides: scylla-enterprise-kernel-conf = %{version}-%{release}
Obsoletes: scylla-enterprise-kernel-conf < 2025.1.0
%description kernel-conf
This package contains Linux kernel configuration changes for the Scylla database. Install this package
@@ -212,6 +222,8 @@ Group: Applications/Databases
Summary: Prometheus exporter for machine metrics
License: ASL 2.0
URL: https://github.com/prometheus/node_exporter
Provides: scylla-enterprise-node-exporter = %{version}-%{release}
Obsoletes: scylla-enterprise-node-exporter < 2025.1.0
%description node-exporter
Prometheus exporter for machine metrics, written in Go with pluggable metric collectors.

View File

@@ -80,9 +80,12 @@ class SwaggerProcessor():
def custom_pathto(app, docname, typ=None, anchor=None):
current_doc = app.env.docname
current_version = os.environ.get('SPHINX_MULTIVERSION_NAME', '')
current_version = os.environ.get('SPHINX_MULTIVERSION_NAME', '')
flag = os.environ.get('FLAG', 'manual')
if current_version:
return "/" + current_version + "/" + docname
prefix = "/manual/" if flag == 'manual' else "/"
return f"{prefix}{current_version}/{docname}"
return relative_uri(app.builder.get_target_uri(current_doc), docname) + (('#' + anchor) if anchor else '')
def setup(app):

View File

@@ -187,8 +187,8 @@ ATTACH SERVICE_LEVEL oltp TO bob;
Note that `alternator_enforce_authorization` has to be enabled in Scylla configuration.
See [Authorization](##Authorization) section to learn more about roles and authorization.
See <https://enterprise.docs.scylladb.com/stable/using-scylla/workload-prioritization.html>
to read about **Workload Prioritization** in detail.
See [Workload Prioritization](../features/workload-prioritization)
to read about Workload Prioritization in detail.
## Metrics
@@ -272,12 +272,6 @@ behave the same in Alternator. However, there are a few features which we have
not implemented yet. Unimplemented features return an error when used, so
they should be easy to detect. Here is a list of these unimplemented features:
* Currently in Alternator, a GSI (Global Secondary Index) can only be added
to a table at table creation time. DynamoDB allows adding a GSI (but not an
LSI) to an existing table using an UpdateTable operation, and similarly it
allows removing a GSI from a table.
<https://github.com/scylladb/scylla/issues/11567>
* GSI (Global Secondary Index) and LSI (Local Secondary Index) may be
configured to project only a subset of the base-table attributes to the
index. This option is not yet respected by Alternator - all attributes
@@ -319,7 +313,7 @@ they should be easy to detect. Here is a list of these unimplemented features:
RestoreTableToPointInTime
* DynamoDB's encryption-at-rest settings are not supported. The Encryption-
at-rest feature is available in Scylla Enterprise, but needs to be
at-rest feature is available in ScyllaDB, but needs to be
enabled and configured separately, not through the DynamoDB API.
* No support for throughput accounting or capping. As mentioned above, the
@@ -378,3 +372,14 @@ they should be easy to detect. Here is a list of these unimplemented features:
that can be used to forbid table deletion. This table option was added to
DynamoDB in March 2023.
<https://github.com/scylladb/scylla/issues/14482>
* Alternator does not support the table option WarmThroughput that can be
used to check or guarantee that the database has "warmed" to handle a
particular throughput. This table option was added to DynamoDB in
November 2024.
<https://github.com/scylladb/scylladb/issues/21853>
* Alternator does not support the table option MultiRegionConsistency
that can be used to achieve consistent reads on global (multi-region) tables.
This table option was added as a preview to DynamoDB in December 2024.
<https://github.com/scylladb/scylladb/issues/21852>

View File

@@ -144,3 +144,46 @@ If a certain data center or rack has no functional nodes, or doesn't even
exist, an empty list (`[]`) is returned by the `/localnodes` request.
A client should be prepared to consider expanding the node search to an
entire data center, or other data centers, in that case.
## Tablets
"Tablets" are ScyllaDB's new approach to replicating data across a cluster.
It replaces the older approach which was named "vnodes". Compared to vnodes,
tablets are smaller pieces of tables that are easier to move between nodes,
and allow for faster growing or shrinking of the cluster when needed.
In this version, tablet support is incomplete and not all of the features
which Alternator needs are supported with tablets. So currently, new
Alternator tables default to using vnodes - not tablets.
However, if you do want to create an Alternator table which uses tablets,
you can do this by specifying the `experimental:initial_tablets` tag in
the CreateTable operation. The value of this tag can be:
* Any valid integer as the value of this tag enables tablets.
Typically the number "0" is used - which tells ScyllaDB to pick a reasonable
number of initial tablets. But any other number can be used, and this
number overrides the default choice of initial number of tablets.
* Any non-integer value - e.g., the string "none" - creates the table
without tablets - i.e., using vnodes.
The `experimental:initial_tablets` tag only has any effect while creating
a new table with CreateTable - changing it later has no effect.
Because the tablets support is incomplete, when tablets are enabled for an
Alternator table, the following features will not work for this table:
* The table must have one of the write isolation modes which does not
not use LWT, because it's not supported with tablets. The allowed write
isolation modes are `forbid_rmw` or `unsafe_rmw`.
Setting the isolation mode to `always_use_lwt` will succeed, but the writes
themselves will fail with an InternalServerError. At that point you can
still change the write isolation mode of the table to a supported mode.
See <https://github.com/scylladb/scylladb/issues/18068>.
* Enabling TTL with UpdateTableToLive doesn't work (results in an error).
See <https://github.com/scylladb/scylla/issues/16567>.
* Enabling Streams with CreateTable or UpdateTable doesn't work
(results in an error).
See <https://github.com/scylladb/scylla/issues/16317>.

View File

@@ -70,8 +70,6 @@ Set the parameters for :ref:`Leveled Compaction <leveled-compaction-strategy-lcs
Incremental Compaction Strategy (ICS)
=====================================
.. versionadded:: 2019.1.4 Scylla Enterprise
ICS principles of operation are similar to those of STCS, merely replacing the increasingly larger SSTables in each tier, by increasingly longer SSTable runs, modeled after LCS runs, but using larger fragment size of 1 GB, by default.
Compaction is triggered when there are two or more runs of roughly the same size. These runs are incrementally compacted with each other, producing a new SSTable run, while incrementally releasing space as soon as each SSTable in the input run is processed and compacted. This method eliminates the high temporary space amplification problem of STCS by limiting the overhead to twice the (constant) fragment size, per shard.

View File

@@ -12,6 +12,7 @@ ScyllaDB Architecture
SSTable <sstable/index/>
Compaction Strategies <compaction/compaction-strategies>
Raft Consensus Algorithm in ScyllaDB </architecture/raft>
Zero-token Nodes </architecture/zero-token-nodes>
* :doc:`Data Distribution with Tablets </architecture/tablets/>` - Tablets in ScyllaDB
@@ -22,5 +23,6 @@ ScyllaDB Architecture
* :doc:`SSTable </architecture/sstable/index/>` - ScyllaDB SSTable 2.0 and 3.0 Format Information
* :doc:`Compaction Strategies </architecture/compaction/compaction-strategies>` - High-level analysis of different compaction strategies
* :doc:`Raft Consensus Algorithm in ScyllaDB </architecture/raft>` - Overview of how Raft is implemented in ScyllaDB.
* :doc:`Zero-token Nodes </architecture/zero-token-nodes>` - Nodes that do not replicate any data.
Learn more about these topics in the `ScyllaDB University: Architecture lesson <https://university.scylladb.com/courses/scylla-essentials-overview/lessons/architecture/>`_.

View File

@@ -15,7 +15,7 @@ SSTable Version Support
- ScyllaDB Enterprise Version
- ScyllaDB Open Source Version
* - 3.x ('me')
- 2022.2
- 2022.2 and above
- 5.1 and above
* - 3.x ('md')
- 2021.1

View File

@@ -9,11 +9,7 @@ ScyllaDB SSTable Format
.. include:: _common/sstable_what_is.rst
* In ScyllaDB 6.0 and above, *me* format is enabled by default.
* In ScyllaDB Enterprise 2021.1, ScyllaDB 4.3 and above, *md* format is enabled by default.
* In ScyllaDB 3.1 and above, *mc* format is enabled by default.
In ScyllaDB 6.0 and above, *me* format is enabled by default.
For more information on each of the SSTable formats, see below:

View File

@@ -12,17 +12,7 @@ ScyllaDB SSTable - 3.x
.. include:: ../_common/sstable_what_is.rst
* In ScyllaDB 6.0 and above, the ``me`` format is mandatory, and ``md`` format is used only when upgrading from an existing cluster using ``md``. The ``sstable_format`` parameter is ignored if it is set to ``md``.
* In ScyllaDB 5.1 and above, the ``me`` format is enabled by default.
* In ScyllaDB 4.3 to 5.0, the ``md`` format is enabled by default.
* In ScyllaDB 3.1 to 4.2, the ``mc`` format is enabled by default.
* In ScyllaDB 3.0, the ``mc`` format is disabled by default. You can enable it by adding the ``enable_sstables_mc_format`` parameter set to ``true`` in the ``scylla.yaml`` file. For example:
.. code-block:: shell
enable_sstables_mc_format: true
.. REMOVE IN FUTURE VERSIONS - Remove the note above in version 5.2.
In ScyllaDB 6.0 and above, the ``me`` format is mandatory, and ``md`` format is used only when upgrading from an existing cluster using ``md``. The ``sstable_format`` parameter is ignored if it is set to ``md``.
Additional Information
-------------------------

View File

@@ -75,15 +75,7 @@ to a new node.
File-based Streaming
========================
:label-tip:`ScyllaDB Enterprise`
File-based streaming is a ScyllaDB Enterprise-only feature that optimizes
tablet migration.
In ScyllaDB Open Source, migrating tablets is performed by streaming mutation
fragments, which involves deserializing SSTable files into mutation fragments
and re-serializing them back into SSTables on the other node.
In ScyllaDB Enterprise, migrating tablets is performed by streaming entire
Migrating tablets is performed by streaming entire
SStables, which does not require (de)serializing or processing mutation fragments.
As a result, less data is streamed over the network, and less CPU is consumed,
especially for data models that contain small cells.
@@ -98,15 +90,15 @@ Enabling Tablets
ScyllaDB now uses tablets by default for data distribution.
Enabling tablets by default when creating new keyspaces is
controlled by the :confval:`enable_tablets` option. However, tablets only work if
controlled by the :confval:`tablets_mode_for_new_keyspaces` option. However, tablets only work if
supported on all nodes within the cluster.
When creating a new keyspace with tablets enabled by default, you can still opt-out
on a per-keyspace basis. The recommended ``NetworkTopologyStrategy`` for keyspaces
remains *required* even if tablets are disabled.
on a per-keyspace basis using ``CREATE KEYSPACE <ks> WITH tablets = {'enabled': false}``,
unless the :confval:`tablets_mode_for_new_keyspaces` option is set to ``enforced``.
You can create a keyspace with tablets
disabled with the ``tablets = {'enabled': false}`` option:
Note: The recommended ``NetworkTopologyStrategy`` for keyspaces
remains *required* even if tablets are disabled.
.. code:: cql
@@ -143,9 +135,17 @@ You can create a keyspace with tablets enabled with the ``tablets = {'enabled':
the keyspace schema with ``tablets = { 'enabled': false }`` or
``tablets = { 'enabled': true }``.
.. _tablets-limitations:
Limitations and Unsupported Features
--------------------------------------
.. warning::
If a keyspace has tablets enabled, it must remain :term:`RF-rack-valid <RF-rack-valid keyspace>`
throughout its lifetime. Failing to keep that invariant satisfied may result in data inconsistencies,
performance problems, or other issues.
The following ScyllaDB features are not supported if a keyspace has tablets
enabled:
@@ -157,6 +157,15 @@ enabled:
If you plan to use any of the above features, CREATE your keyspace
:ref:`with tablets disabled <tablets-enable-tablets>`.
The following ScyllaDB features are disabled by default when used with a keyspace
that has tablets enabled:
* Materialized Views (MV)
* Secondary indexes (SI, as it depends on MV)
To enable MV and SI for tablet keyspaces, use the `--experimental-features=views-with-tablets`
configuration option. See :ref:`Views with tablets <admin-views-with-tablets>` for details.
Resharding in keyspaces with tablets enabled has the following limitations:
* ScyllaDB does not support reducing the number of shards after node restart.

View File

@@ -0,0 +1,28 @@
=========================
Zero-token Nodes
=========================
By default, all nodes in a cluster own a set of token ranges and are used to
replicate data. In certain circumstances, you may choose to add a node that
doesn't own any token. Such nodes are referred to as zero-token nodes. They
do not have a copy of the data but only participate in Raft quorum voting.
To configure a zero-token node, set the ``join_ring`` parameter to ``false``.
You can use zero-token nodes in multi-DC deployments to reduce the risk of
losing a quorum of nodes.
See :doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters </operating-scylla/procedures/cluster-management/arbiter-dc>` for details.
Note that:
* Zero-token nodes are ignored by drivers, so there is no need to change
the load balancing policy on the clients after adding zero-token nodes
to the cluster.
* Zero-token nodes never store replicated data, so running ``nodetool rebuild``,
``nodetool repair``, and ``nodetool cleanup`` can be skipped as it does not
affect zero-token nodes.
* Racks consisting solely of zero-token nodes are not taken into consideration
when deciding whether a keyspace is :term:`RF-rack-valid <RF-rack-valid keyspace>`.
However, an RF-rack-valid keyspace must have the replication factor equal to 0
in an :doc:`arbiter DC </operating-scylla/procedures/cluster-management/arbiter-dc>`.
Otherwise, it is RF-rack-invalid.

View File

@@ -1,3 +0,0 @@
By default, a keyspace is created with tablets enabled. The ``tablets`` option
is used to opt out a keyspace from tablets-based distribution; see :ref:`Enabling Tablets <tablets-enable-tablets>`
for details.

View File

@@ -170,8 +170,6 @@ LCS options
Incremental Compaction Strategy (ICS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. versionadded:: 2019.1.4 Scylla Enterprise
When using ICS, SSTable runs are put in different buckets depending on their size.
When an SSTable run is bucketed, the average size of the runs in the bucket is compared to the new run, as well as the ``bucket_high`` and ``bucket_low`` levels.

View File

@@ -203,18 +203,6 @@ An example that excludes a datacenter while using ``replication_factor``::
DESCRIBE KEYSPACE excalibur
CREATE KEYSPACE excalibur WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
.. only:: opensource
Keyspace storage options :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, SStables of a keyspace are stored locally.
As an alternative, you can configure your keyspace to be stored
on Amazon S3 or another S3-compatible object store.
See :ref:`Keyspace storage options <admin-keyspace-storage-options>` for details.
.. _tablets:
The ``tablets`` property
@@ -232,7 +220,15 @@ sub-option type description
``'initial'`` int The number of tablets to start with
===================================== ====== =============================================
.. scylladb_include_flag:: tablets-default.rst
By default, a keyspace is created with tablets enabled. You can use the ``tablets`` option
to opt out a keyspace from tablets-based distribution.
You may want to opt out if you plan to use features that are not supported for keyspaces
with tablets enabled. Keyspaces using tablets must also remain :term:`RF-rack-valid <RF-rack-valid keyspace>`
throughout their lifetime. See :ref:`Limitations and Unsupported Features <tablets-limitations>`
for details.
**The ``initial`` sub-option (deprecated)**
A good rule of thumb to calculate initial tablets is to divide the expected total storage used
by tables in this keyspace by (``replication_factor`` * 5GB). For example, if you expect a 30TB
@@ -253,6 +249,14 @@ An example that creates a keyspace with 2048 tablets per table::
See :doc:`Data Distribution with Tablets </architecture/tablets>` for more information about tablets.
Keyspace storage options :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, SStables of a keyspace are stored locally.
As an alternative, you can configure your keyspace to be stored
on Amazon S3 or another S3-compatible object store.
See :ref:`Keyspace storage options <admin-keyspace-storage-options>` for details.
.. _use-statement:
USE
@@ -285,8 +289,8 @@ For instance::
The supported options are the same as :ref:`creating a keyspace <create-keyspace-statement>`.
ALTER KEYSPACE with Tablets :label-caution:`Experimental`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ALTER KEYSPACE with Tablets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Modifying a keyspace with tablets enabled is possible and doesn't require any special CQL syntax. However, there are some limitations:
@@ -295,6 +299,7 @@ Modifying a keyspace with tablets enabled is possible and doesn't require any sp
- If there's any other ongoing global topology operation, executing the ``ALTER`` statement will fail (with an explicit and specific error) and needs to be repeated.
- The ``ALTER`` statement may take longer than the regular query timeout, and even if it times out, it will continue to execute in the background.
- The replication strategy cannot be modified, as keyspaces with tablets only support ``NetworkTopologyStrategy``.
- The ``ALTER`` statement will fail if it would make the keyspace :term:`RF-rack-invalid <RF-rack-valid keyspace>`.
.. _drop-keyspace-statement:

View File

@@ -225,7 +225,9 @@ CREATE TYPE system.tablet_task_info (
tablet_task_id uuid,
request_time timestamp,
sched_nr bigint,
sched_time timestamp
sched_time timestamp,
repair_hosts_filter text,
repair_dcs_filter text,
)
~~~
@@ -255,6 +257,8 @@ Only tables which use tablet-based replication strategy have an entry here.
* `request_time` - The time the request is created.
* `sched_nr` - Number of times the request has been scheduled by the repair scheduler.
* `sched_time` - The time the request has been scheduled by the repair scheduler.
* `repair_hosts_filter` - Repair replicas listed in the comma-separated host_id list.
* `repair_dcs_filter` - Repair replicas listed in the comma-separated DC list.
`repair_scheduler_config` contains configuration for the repair scheduler. It contains the following values:
* `auto_repair_enabled` - When set to true, auto repair is enabled. Disabled by default.

View File

@@ -64,18 +64,20 @@ Briefly:
- `/task_manager/list_module_tasks/{module}` -
lists (by default non-internal) tasks in the module;
- `/task_manager/task_status/{task_id}` -
gets the task's status, unregisters the task if it's finished;
gets the task's status;
- `/task_manager/abort_task/{task_id}` -
aborts the task if it's abortable;
- `/task_manager/wait_task/{task_id}` -
waits for the task and gets its status;
- `/task_manager/task_status_recursive/{task_id}` -
gets statuses of the task and all its descendants in BFS
order, unregisters the task;
order;
- `/task_manager/ttl` -
gets or sets new ttl.
- `/task_manager/user_ttl` -
gets or sets new user ttl.
- `/task_manager/drain/{module}` -
unregisters all finished local tasks in the module.
# Virtual tasks

View File

@@ -124,6 +124,9 @@ Additionally to specific node states, there entire topology can also be in a tra
it from group 0. We also use this state to rollback a failed bootstrap or decommission.
- `rollback_to_normal` - the decommission or removenode operation failed. Rollback the operation by
moving the node we tried to decommission/remove back to the normal state.
- `lock` - the topology stays in this state until externally changed (to null state), preventing topology
requests from starting. Intended to be used in tests which want to prevent internally-triggered topology
operations during the test.
When a node bootstraps, we create new tokens for it and a new CDC generation
and enter the `commit_cdc_generation` state. Once the generation is committed,
@@ -239,6 +242,8 @@ globally driven by the topology change coordinator and serialized per-tablet. Tr
- rebuild - new tablet replica is rebuilt from existing ones, possibly dropping old replica afterwards (on node removal or replace)
- rebuild_v2 - same as rebuild, but repairs a tablet and streams data from one replica, instead of streaming data from all replicas
- repair - tablet replicas are repaired
Each tablet has its own state machine for keeping state of transition stored in group0 which is part of the tablet state. It involves
@@ -329,6 +334,32 @@ stateDiagram-v2
```
The above state transition state machine is the same for those tablet transition kinds: migration, intranode_migration, rebuild.
In rebuild_v2 transition kind streaming stage is followed by the rebuild_repair stage:
```mermaid
stateDiagram-v2
state if_state <<choice>>
[*] --> allow_write_both_read_old
allow_write_both_read_old --> write_both_read_old
write_both_read_old --> rebuild_repair
rebuild_repair --> streaming
streaming --> write_both_read_new
write_both_read_new --> use_new
use_new --> cleanup
cleanup --> end_migration
end_migration --> [*]
allow_write_both_read_old --> cleanup_target: error
write_both_read_old --> cleanup_target: error
rebuild_repair --> cleanup_target: error
streaming --> cleanup_target: error
write_both_read_new --> if_state: error
if_state --> use_new: more new replicas
if_state --> cleanup_target: more old replicas
cleanup_target --> revert_migration
revert_migration --> [*]
```
The repair tablet transition kind is different. It transits only to the repair and end_repair stage because no token ownership is changed.
The behavioral difference between "migration" and "intranode_migration" transitions is in the way "streaming" stage

View File

@@ -193,6 +193,8 @@ ScyllaDB comes with its own version of the Apache Cassandra client tools, in the
We recommend uninstalling Apache Cassandra before installing :code:`scylla-tools`.
.. TODO Update the example below then a patch release for 2025.1 is available
.. _faq-pinning:
Can I install or upgrade to a patch release other than latest on Debian or Ubuntu?

View File

@@ -18,7 +18,7 @@ For example, consider the following two workloads:
- Slow queries
- In essence - Latency agnostic
Using Service Level CQL commands, database administrators (working on Scylla Enterprise) can set different workload prioritization levels (levels of service) for each workload without sacrificing latency or throughput.
Using Service Level CQL commands, database administrators (working on ScyllaDB) can set different workload prioritization levels (levels of service) for each workload without sacrificing latency or throughput.
By assigning each service level to the different roles within your organization, DBAs ensure that each role_ receives the level of service the role requires.
.. _`role` : /operating-scylla/security/rbac_usecase/
@@ -425,7 +425,7 @@ In order for workload prioritization to take effect, application users need to b
Limits
======
Scylla Enterprise is limited to 8 service levels, including the default one; this means you can create up to 7 service levels.
ScyllaDB is limited to 8 service levels, including the default one; this means you can create up to 7 service levels.
Additional References

View File

@@ -1,21 +0,0 @@
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
+----------------------------+--------------------+-------+---------------+
| Linux Distributions |Ubuntu | Debian|Rocky / CentOS |
| | | |/ RHEL |
+----------------------------+------+------+------+-------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 |24.04 | 11 | 8 | 9 |
+============================+======+======+======+=======+=======+=======+
| 6.2 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
| 6.1 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
* The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
* All releases are available as a Docker container and EC2 AMI, GCP, and Azure images.
Supported Architecture
-----------------------------
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build.
In particular, aarch64 support includes AWS EC2 Graviton.

View File

@@ -110,7 +110,7 @@ Google Compute Engine (GCE)
-----------------------------------
Pick a zone where Haswell CPUs are found. Local SSD performance offers, according to Google, less than 1 ms of latency and up to 680,000 read IOPS and 360,000 write IOPS.
Image with NVMe disk interface is recommended, CentOS 7 for ScyllaDB Enterprise 2020.1 and older, and Ubuntu 20 for 2021.1 and later.
Image with NVMe disk interface is recommended.
(`More info <https://cloud.google.com/compute/docs/disks/local-ssd>`_)
Recommended instances types are `n1-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n1_machines>`_ and `n2-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n2_machines>`_

View File

@@ -4,7 +4,7 @@ ScyllaDB Web Installer for Linux
ScyllaDB Web Installer is a platform-agnostic installation script you can run with ``curl`` to install ScyllaDB on Linux.
See `ScyllaDB Download Center <https://www.scylladb.com/download/#core>`_ for information on manually installing ScyllaDB with platform-specific installation packages.
See :doc:`Install ScyllaDB Linux Packages </getting-started/install-scylla/install-on-linux/>` for information on manually installing ScyllaDB with platform-specific installation packages.
Prerequisites
--------------
@@ -20,44 +20,50 @@ To install ScyllaDB with Web Installer, run:
curl -sSf get.scylladb.com/server | sudo bash
By default, running the script installs the latest official version of ScyllaDB Open Source. You can use the following
options to install a different version or ScyllaDB Enterprise:
.. list-table::
:widths: 20 25 55
:header-rows: 1
* - Option
- Acceptable values
- Description
* - ``--scylla-product``
- ``scylla`` | ``scylla-enterprise``
- Specifies the ScyllaDB product to install: Open Source (``scylla``) or Enterprise (``scylla-enterprise``) The default is ``scylla``.
* - ``--scylla-version``
- ``<version number>``
- Specifies the ScyllaDB version to install. You can specify the major release (``x.y``) to install the latest patch for that version or a specific patch release (``x.y.x``). The default is the latest official version.
By default, running the script installs the latest official version of ScyllaDB.
You can run the command with the ``-h`` or ``--help`` flag to print information about the script.
Examples
===========
Installing a Non-default Version
---------------------------------------
Installing ScyllaDB Open Source 6.0.1:
You can install a version other than the default.
Versions 2025.1 and Later
==============================
Run the command with the ``--scylla-version`` option to specify the version
you want to install.
**Example**
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 2025.1.1
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 6.0.1
Installing the latest patch release for ScyllaDB Open Source 6.0:
Versions Earlier than 2025.1
================================
To install a supported version of *ScyllaDB Enterprise*, run the command with:
* ``--scylla-product scylla-enterprise`` to specify that you want to install
ScyllaDB Entrprise.
* ``--scylla-version`` to specify the version you want to install.
For example:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 6.0
To install a supported version of *ScyllaDB Open Source*, run the command with
the ``--scylla-version`` option to specify the version you want to install.
Installing ScyllaDB Enterprise 2024.1:
For example:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 6.2.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -1,13 +1,38 @@
OS Support by Linux Distributions and Version
==============================================
The following matrix shows which Linux distributions, containers, and images are supported with which versions of ScyllaDB.
The following matrix shows which Linux distributions, containers, and images
are :ref:`supported <os-support-definition>` with which versions of ScyllaDB.
Where *supported* in this scope means:
+-------------------------------+--------------------------+-------+------------------+---------------+
| Linux Distributions |Ubuntu | Debian| Rocky / Centos / | Amazon Linux |
| | | | RHEL | |
+-------------------------------+------+------+------------+-------+-------+----------+---------------+
| ScyllaDB Version / OS Version |20.04 |22.04 |24.04 | 11 | 8 | 9 | 2023 |
+===============================+======+======+============+=======+=======+==========+===============+
| Enterprise 2025.1 | |v| | |v| | |v| | |v| | |v| | |v| | |v| |
+-------------------------------+------+------+------------+-------+-------+----------+---------------+
| Enterprise 2024.2 | |v| | |v| | |v| | |v| | |v| | |v| | |v| |
+-------------------------------+------+------+------------+-------+-------+----------+---------------+
| Enterprise 2024.1 | |v| | |v| | |v| ``*`` | |v| | |v| | |v| | |x| |
+-------------------------------+------+------+------------+-------+-------+----------+---------------+
| Open Source 6.2 | |v| | |v| | |v| | |v| | |v| | |v| | |v| |
+-------------------------------+------+------+------------+-------+-------+----------+---------------+
``*`` 2024.1.9 and later
All releases are available as a Docker container, EC2 AMI, GCP, and Azure images.
.. _os-support-definition:
By *supported*, it is meant that:
- A binary installation package is available to `download <https://www.scylladb.com/download/>`_.
- The download and install procedures are tested as part of ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for latest versions)
- The download and install procedures are tested as part of the ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for the latest versions).
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_
on other x86_64 or aarch64 platforms, without any guarantees.
.. scylladb_include_flag:: os-support-info.rst

View File

@@ -8,7 +8,7 @@ ScyllaDB Requirements
:hidden:
system-requirements
os-support
OS Support <os-support>
Cloud Instance Recommendations <cloud-instance-recommendations>
scylla-in-a-shared-environment

View File

@@ -2,19 +2,6 @@
ScyllaDB Seed Nodes
===================
**Topic: ScyllaDB Seed Nodes Overview**
**Learn: What a seed node is, and how they should be used in a ScyllaDB Cluster**
**Audience: ScyllaDB Administrators**
What is the Function of a Seed Node in ScyllaDB?
------------------------------------------------
.. note::
Seed nodes function was changed in ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1; if you are running an older version, see :ref:`Older Version Of ScyllaDB <seeds-older-versions>`.
A ScyllaDB seed node is a node specified with the ``seeds`` configuration parameter in ``scylla.yaml``. It is used by new node joining as the first contact point.
It allows nodes to discover the cluster ring topology on startup (when joining the cluster). This means that any time a node is joining the cluster, it needs to learn the cluster ring topology, meaning:
@@ -22,27 +9,8 @@ It allows nodes to discover the cluster ring topology on startup (when joining t
- Which token ranges are available
- Which nodes will own which tokens when a new node joins the cluster
**Once the nodes have joined the cluster, seed node has no function.**
**Once the nodes have joined the cluster, the seed node has no function.**
The first node in a new cluster needs to be a seed node.
.. _seeds-older-versions:
Older Version Of ScyllaDB
-------------------------
In ScyllaDB releases older than ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1, seed node has one more function: it assists with :doc:`gossip </kb/gossip>` convergence.
Gossiping with other nodes ensures that any update to the cluster is propagated across the cluster. This includes detecting and alerting whenever a node goes down, comes back, or is removed from the cluster.
This functions was removed, as described in `Seedless NoSQL: Getting Rid of Seed Nodes in ScyllaDB <https://www.scylladb.com/2020/09/22/seedless-nosql-getting-rid-of-seed-nodes-in-scylla/>`_.
If you run an older ScyllaDB release, we recommend upgrading to version 4.3 (ScyllaDB Open Source) or 2021.1 (ScyllaDB Enterprise) or later. If you choose to run an older version, it is good practice to follow these guidelines:
* The first node in a new cluster needs to be a seed node.
* Ensure that all nodes in the cluster have the same seed nodes listed in each node's scylla.yaml.
* To maintain resiliency of the cluster, it is recommended to have more than one seed node in the cluster.
* If you have more than one seed in a DC with multiple racks (or availability zones), make sure to put your seeds in different racks.
* You must have at least one node that is not a seed node. You cannot create a cluster where all nodes are seed nodes.
* You should have more than one seed node.
The first node in a new cluster must be a seed node. In typical scenarios,
there's no need to configure more than one seed node.

View File

@@ -8,7 +8,6 @@
* :doc:`cassandra-stress </operating-scylla/admin-tools/cassandra-stress/>` A tool for benchmarking and load testing a ScyllaDB and Cassandra clusters.
* :doc:`SSTabledump </operating-scylla/admin-tools/sstabledump>`
* :doc:`SSTableMetadata </operating-scylla/admin-tools/sstablemetadata>`
* configuration_encryptor - :doc:`encrypt at rest </operating-scylla/security/encryption-at-rest>` sensitive scylla configuration entries using system key.
* scylla local-file-key-generator - Generate a local file (system) key for :doc:`encryption at rest </operating-scylla/security/encryption-at-rest>`, with the provided length, Key algorithm, Algorithm block mode and Algorithm padding method.
* `scyllatop <https://www.scylladb.com/2016/03/22/scyllatop/>`_ - A terminal base top-like tool for scylladb collectd/prometheus metrics.
* :doc:`scylla_dev_mode_setup</getting-started/installation-common/dev-mod>` - run ScyllaDB in Developer Mode.

View File

@@ -3,275 +3,6 @@ Cassandra Stress
The cassandra-stress tool is used for benchmarking and load testing both ScyllaDB and Cassandra clusters. The cassandra-stress tool also supports testing arbitrary CQL tables and queries to allow users to benchmark their data model.
This documentation focuses on user mode as this allows the testing of your actual schema.
Usage
-----
There are several operation types:
* write-only, read-only, and mixed workloads of standard data
* write-only and read-only workloads for counter columns
* user configured workloads, running custom queries on custom schemas
* The syntax is cassandra-stress <command> [options]. If you want more information on a given command or options, just run cassandra-stress help
Commands:
read: Multiple concurrent reads - the cluster must first be populated by a write test.
write: Multiple concurrent writes against the cluster.
mixed: Interleaving of any basic commands, with configurable ratio and distribution - the cluster must first be populated by a write test.
counter_write: Multiple concurrent updates of counters.
counter_read: Multiple concurrent reads of counters. The cluster must first be populated by a counterwrite test.
user: Interleaving of user provided queries, with configurable ratio and distribution.
help: Print help for a command or option.
print: Inspect the output of a distribution definition.
legacy: Legacy support mode.
Primary Options:
-pop: Population distribution and intra-partition visit order.
-insert: Insert specific options relating to various methods for batching and splitting partition updates.
-col: Column details such as size and count distribution, data generator, names, comparator and if super columns should be used.
-rate: Thread count, rate limit or automatic mode (default is auto).
-mode: CQL with options.
-errors: How to handle errors when encountered during stress.
-sample: Specify the number of samples to collect for measuring latency.
-schema: Replication settings, compression, compaction, etc.
-node: Nodes to connect to.
-log: Where to log progress to, and the interval at which to do it.
-transport: Custom transport factories.
-port: The port to connect to cassandra nodes on.
-sendto: Specify a stress server to send this command to.
-graph: Graph recorded metrics.
-tokenrange: Token range settings.
User mode
---------
User mode allows you to use your stress your own schemas. This can save time in the long run rather than building an application and then realising your schema doesnt scale.
Profile
.......
User mode requires a profile defined in YAML. Multiple YAML files may be specified in which case operations in the ops argument are referenced as specname.opname.
An identifier for the profile:
.. code-block:: cql
specname: staff_activities
The keyspace for the test:
.. code-block:: cql
keyspace: staff
CQL for the keyspace. Optional if the keyspace already exists:
.. code-block:: cql
keyspace_definition: |
CREATE KEYSPACE stresscql WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
The table to be stressed:
.. code-block:: cql
table: staff_activities
CQL for the table. Optional if the table already exists:
.. code-block:: cql
table_definition: |
CREATE TABLE staff_activities (
name text,
when timeuuid,
what text,
PRIMARY KEY(name, when, what)
)
Optional meta information on the generated columns in the above table. The min and max only apply to text and blob types. The distribution field represents the total unique population distribution of that column across rows:
.. code-block:: cql
columnspec:
- name: name
size: uniform(5..10) # The names of the staff members are between 5-10 characters
population: uniform(1..10) # 10 possible staff members to pick from
- name: when
cluster: uniform(20..500) # Staff members do between 20 and 500 events
- name: what
size: normal(10..100,50)
Supported types are:
An exponential distribution over the range [min..max]:
.. code-block:: cql
EXP(min..max)
An extreme value (Weibull) distribution over the range [min..max]:
.. code-block:: cql
EXTREME(min..max,shape)
A gaussian/normal distribution, where mean=(min+max)/2, and stdev is (mean-min)/stdvrng:
.. code-block:: cql
GAUSSIAN(min..max,stdvrng)
A gaussian/normal distribution, with explicitly defined mean and stdev:
.. code-block:: cql
GAUSSIAN(min..max,mean,stdev)
A uniform distribution over the range [min, max]:
.. code-block:: cql
UNIFORM(min..max)
A fixed distribution, always returning the same value:
.. code-block:: cql
FIXED(val)
If preceded by ~, the distribution is inverted
Defaults for all columns are size: uniform(4..8), population: uniform(1..100B), cluster: fixed(1)
Insert distributions:
.. code-block:: cql
insert:
# How many partition to insert per batch
partitions: fixed(1)
# How many rows to update per partition
select: fixed(1)/500
# UNLOGGED or LOGGED batch for insert
batchtype: UNLOGGED
Currently all inserts are done inside batches.
Read statements to use during the test:
.. code-block:: cql
queries:
events:
cql: select * from staff_activities where name = ?
fields: samerow
latest_event:
cql: select * from staff_activities where name = ? LIMIT 1
fields: samerow
Running a user mode test:
.. code-block:: cql
cassandra-stress user profile=./example.yaml duration=1m "ops(insert=1,latest_event=1,events=1)" truncate=once
This will create the schema then run tests for 1 minute with an equal number of inserts, latest_event queries and events queries. Additionally the table will be truncated once before the test.
The full example can be found here yaml
Running a user mode test with multiple yaml files:
.. code-block::
cassandra-stress user profile=./example.yaml,./example2.yaml duration=1m “ops(ex1.insert=1,ex1.latest_event=1,ex2.insert=2)” truncate=once
This will run operations as specified in both the example.yaml and example2.yaml files. example.yaml and example2.yaml can reference the same table
although care must be taken that the table definition is identical (data generation specs can be different).
.. Lightweight transaction support
.. ...............................
.. cassandra-stress supports lightweight transactions. In this it will first read current data from Cassandra and then uses read value(s) to fulfill lightweight transaction condition(s).
.. Lightweight transaction update query:
.. .. code-block:: cql
.. queries:
.. regularupdate:
.. cql: update blogposts set author = ? where domain = ? and published_date = ?
.. fields: samerow
.. updatewithlwt:
.. cql: update blogposts set author = ? where domain = ? and published_date = ? IF body = ? AND url = ?
.. fields: samerow
Graphing
........
Graphs can be generated for each run of stress.
.. image:: example-stress-graph.png
To create a new graph:
.. code-block:: cql
cassandra-stress user profile=./stress-example.yaml "ops(insert=1,latest_event=1,events=1)" -graph file=graph.html title="Awesome graph"
To add a new run to an existing graph point to an existing file and add a revision name:
.. code-block:: cql
cassandra-stress user profile=./stress-example.yaml duration=1m "ops(insert=1,latest_event=1,events=1)" -graph file=graph.html title="Awesome graph" revision="Second run"
FAQ
...
How do you use NetworkTopologyStrategy for the keyspace?
Use the schema option making sure to either escape the parenthesis or enclose in quotes:
.. code-block:: cql
cassandra-stress write -schema "replication(strategy=NetworkTopologyStrategy,datacenter1=3)"
How do you use SSL?
Use the transport option:
.. code-block:: cql
cassandra-stress "write n=100k cl=ONE no-warmup" -transport "truststore=$HOME/jks/truststore.jks truststore-password=cassandra"
Cassandra Stress is not part of ScyllaDB and it is not distributed along side it anymore. It has it's own seperate repository and release cycle. More information about it can be found on `GitHub <https://github.com/scylladb/cassandra-stress>`_ or on `DockerHub <https://hub.docker.com/r/scylladb/cassandra-stress>`_.
.. include:: /rst_include/apache-copyrights.rst

View File

@@ -5,7 +5,7 @@ Bulk loads SSTables from a directory to a ScyllaDB cluster via the **CQL API**.
.. warning::
SSTableLoader is deprecated since ScyllaDB 6.2 and will be removed in the next release.
SSTableLoader is deprecated and will be removed in a future release.
Please consider switching to :doc:`nodetool refresh --load-and-stream </operating-scylla/nodetool-commands/refresh>`.
.. note::

View File

@@ -74,13 +74,13 @@ API calls
- *keyspace* - if set, tasks are filtered to contain only the ones working on this keyspace;
- *table* - if set, tasks are filtered to contain only the ones working on this table;
* ``/task_manager/task_status/{task_id}`` - gets the task's status, unregisters the task if it's finished;
* ``/task_manager/task_status/{task_id}`` - gets the task's status;
* ``/task_manager/abort_task/{task_id}`` - aborts the task if it's abortable, otherwise 403 status code is returned;
* ``/task_manager/wait_task/{task_id}`` - waits for the task and gets its status (does not unregister the tasks); query params:
* ``/task_manager/wait_task/{task_id}`` - waits for the task and gets its status; query params:
- *timeout* - timeout in seconds; if set - 408 status code is returned if waiting times out;
* ``/task_manager/task_status_recursive/{task_id}`` - gets statuses of the task and all its descendants in BFS order, unregisters the root task;
* ``/task_manager/task_status_recursive/{task_id}`` - gets statuses of the task and all its descendants in BFS order;
* ``/task_manager/ttl`` - gets or sets new ttl; query params (if setting):
- *ttl* - new ttl value.
@@ -89,6 +89,8 @@ API calls
- *user_ttl* - new user ttl value.
* ``/task_manager/drain/{module}`` - unregisters all finished local tasks in the module.
Cluster tasks are not unregistered from task manager with API calls.
Tasks API

View File

@@ -257,8 +257,6 @@ ScyllaDB uses experimental flags to expose non-production-ready features safely.
In recent ScyllaDB versions, these features are controlled by the ``experimental_features`` list in scylla.yaml, allowing one to choose which experimental to enable.
Use ``scylla --help`` to get the list of experimental features.
ScyllaDB Enterprise and ScyllaDB Cloud do not officially support experimental Features.
.. _admin-keyspace-storage-options:
Keyspace storage options
@@ -286,6 +284,24 @@ Before creating keyspaces with object storage, you also need to
:ref:`configure <object-storage-configuration>` the object storage
credentials and endpoint.
.. _admin-views-with-tablets:
Views with tablets
------------------
By default, Materialized Views (MV) and Secondary Indexes (SI)
are disabled in keyspaces that use tablets.
Support for MV and SI with tablets is experimental and must be explicitly
enabled in the ``scylla.yaml`` configuration file by specifying
the ``views-with-tablets`` option:
.. code-block:: yaml
experimental_features:
- views-with-tablets
Monitoring
==========
ScyllaDB exposes interfaces for online monitoring, as described below.

View File

@@ -0,0 +1,14 @@
Nodetool cluster
================
.. toctree::
:hidden:
repair <repair>
**cluster** - Nodetool supercommand for running cluster operations.
Supported cluster suboperations
-------------------------------
* :doc:`repair </operating-scylla/nodetool-commands/cluster/repair>` :code:`<keyspace>` :code:`<table>` - Repair one or more tablet tables.

View File

@@ -0,0 +1,73 @@
Nodetool cluster repair
=======================
**cluster repair** - A process that runs in the background and synchronizes the data between nodes. It only repairs keyspaces with tablets enabled (default).
To repair keyspaces with tablets disabled (vnodes-based), see :doc:`nodetool repair </operating-scylla/nodetool-commands/repair/>`.
Running ``cluster repair`` on a **single node** synchronizes all data on all nodes in the cluster.
To synchronize all data in clusters that have both tablets-based and vnodes-based keyspaces, run :doc:`nodetool repair -pr </operating-scylla/nodetool-commands/repair/>` on **all**
of the nodes in the cluster, and :doc:`nodetool cluster repair </operating-scylla/nodetool-commands/cluster/repair/>` on **any** of the nodes in the cluster.
To check if a keyspace enables tablets, use:
.. code-block:: cql
DESCRIBE KEYSPACE `keyspace_name`
ScyllaDB node will ensure the exclusivity of tablet repair and maintenance operations (add/remove/decommission/replace/rebuild).
ScyllaDB nodetool cluster repair command supports the following options:
- ``-dc`` ``--in-dc`` syncs data between all nodes in a list of Data Centers (DCs).
.. warning:: This command leaves part of the data subset (all remaining DCs) out of sync.
For example:
::
nodetool cluster repair -dc US_DC
nodetool cluster repair --in-dc US_DC, EU_DC
- ``-hosts`` ``--in-hosts`` syncs the data only between a list of nodes, using host ID.
.. warning:: this command leaves part of the data subset (on nodes that are *not* listed) out of sync.
For example:
::
nodetool cluster repair -hosts cdc295d7-c076-4b07-af69-1385fefdb40b,2dbdf288-9e73-11ea-bb37-0242ac130002
nodetool cluster repair --in-hosts cdc295d7-c076-4b07-af69-1385fefdb40b,2dbdf288-9e73-11ea-bb37-0242ac130002,3a5993f8-9e73-11ea-bb37-0242ac130002
- ``--tablet-tokens`` selects which tablets to repair. When the listed token belongs to a tablet, the whole tablet that owns the token will be repaired. By default, all tablets are repaired.
.. warning:: this command leaves part of the data subset (on nodes that are *not* listed) out of sync.
For example:
::
nodetool cluster repair --tablet-tokens 1,10474535988
- ``keyspace`` executes a repair on a specific keyspace. The default is all keyspaces.
For example:
::
nodetool cluster repair <my_keyspace>
- ``table`` executes a repair on a specific table or a list of space-delimited table names. The default is all tables.
For example:
::
nodetool cluster repair <my_keyspace> <my_table>
See also `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_.

View File

@@ -1,13 +1,23 @@
Nodetool repair
===============
**Repair** - a process that runs in the background and synchronizes the data between nodes.
**Repair** - A process that runs in the background and synchronizes the data between nodes. It only repairs keyspaces with tablets disabled (vnode-based).
To repair keyspaces with tablets, see :doc:`nodetool cluster repair </operating-scylla/nodetool-commands/cluster/repair/>`.
When running ``nodetool repair`` on a **single node**, it acts as the **repair master**. Only the data contained in the master node and its replications will be repaired.
Typically, this subset of data is replicated on many nodes in the cluster, often all, and the repair process syncs between all the replicas until the master data subset is in-sync.
To repair **all** of the data in the cluster, you need to run a repair on **all** of the nodes in the cluster, or let `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_ do it for you.
To synchronize all data in clusters that have both tablets- and vnodes-based keyspaces, run :doc:`nodetool repair -pr </operating-scylla/nodetool-commands/repair/>` on **all**
of the nodes in the cluster, and :doc:`nodetool cluster repair </operating-scylla/nodetool-commands/cluster/repair/>` on **any** of the nodes in the cluster.
To check if a keyspace enables tablets, use:
.. code-block:: cql
DESCRIBE KEYSPACE `keyspace_name`
.. note:: Run the :doc:`nodetool repair </operating-scylla/nodetool-commands/repair/>` command regularly. If you delete data frequently, it should be more often than the value of ``gc_grace_seconds`` (by default: 10 days), for example, every week. Use the **nodetool repair -pr** on each node in the cluster, sequentially.

View File

@@ -0,0 +1,21 @@
Nodetool tasks drain
====================
**tasks drain** - Unregisters all finished local tasks from the module.
If a module is not specified, finished tasks in all modules are unregistered.
Syntax
-------
.. code-block:: console
nodetool tasks drain [--module <module>]
Options
-------
* ``--module`` - if set, only the specified module is drained.
For example:
.. code-block:: shell
> nodetool tasks drain --module repair

View File

@@ -5,6 +5,7 @@ Nodetool tasks
:hidden:
abort <abort>
drain <drain>
user-ttl <user-ttl>
list <list>
modules <modules>
@@ -23,15 +24,12 @@ Task Status Retention
* When a task completes, its status is temporarily stored on the executing node
* Status information is retained for up to :confval:`task_ttl_in_seconds` seconds
* The status information of a completed task is automatically removed after being queried with ``tasks status`` or ``tasks tree``
* ``tasks wait`` returns the status, but it does not remove the task information of the queried task
.. note:: Multiple status queries using ``tasks status`` and ``tasks tree`` for the same completed task will only receive a response for the first query, since the status is removed after being retrieved.
Supported tasks suboperations
-----------------------------
* :doc:`abort </operating-scylla/nodetool-commands/tasks/abort>` - Aborts the task.
* :doc:`drain </operating-scylla/nodetool-commands/tasks/drain>` - Unregisters all finished local tasks.
* :doc:`user-ttl </operating-scylla/nodetool-commands/tasks/user-ttl>` - Gets or sets user_task_ttl value.
* :doc:`list </operating-scylla/nodetool-commands/tasks/list>` - Lists tasks in the module.
* :doc:`modules </operating-scylla/nodetool-commands/tasks/modules>` - Lists supported modules.

View File

@@ -1,6 +1,6 @@
Nodetool tasks status
=========================
**tasks status** - Gets the status of a task manager task. If the task was finished it is unregistered.
**tasks status** - Gets the status of a task manager task.
Syntax
-------
@@ -23,10 +23,10 @@ Example output
type: repair
kind: node
scope: keyspace
state: done
state: running
is_abortable: true
start_time: 2024-07-29T15:48:55Z
end_time: 2024-07-29T15:48:55Z
end_time:
error:
parent_id: none
sequence_number: 5

View File

@@ -1,7 +1,7 @@
Nodetool tasks tree
=======================
**tasks tree** - Gets the statuses of a task manager task and all its descendants.
The statuses are listed in BFS order. If the task was finished it is unregistered.
The statuses are listed in BFS order.
If task_id isn't specified, trees of all non-internal tasks are printed
(internal tasks are the ones that have a parent or cover an operation that

View File

@@ -13,6 +13,7 @@ Nodetool
nodetool-commands/checkandrepaircdcstreams
nodetool-commands/cleanup
nodetool-commands/clearsnapshot
nodetool-commands/cluster/index
nodetool-commands/compactionhistory
nodetool-commands/compactionstats
nodetool-commands/compact
@@ -85,6 +86,7 @@ Operations that are not listed below are currently not available.
* :doc:`checkandrepaircdcstreams </operating-scylla/nodetool-commands/checkandrepaircdcstreams/>` - Checks and fixes CDC streams.
* :doc:`cleanup </operating-scylla/nodetool-commands/cleanup/>` - Triggers the immediate cleanup of keys no longer belonging to a node.
* :doc:`clearsnapshot </operating-scylla/nodetool-commands/clearsnapshot/>` - This command removes snapshots.
* :doc:`cluster <nodetool-commands/cluster/index>` - Run a cluster operation.
* :doc:`compactionhistory </operating-scylla/nodetool-commands/compactionhistory/>` - Provides the history of compactions.
* :doc:`compactionstats </operating-scylla/nodetool-commands/compactionstats/>`- Print statistics on compactions.
* :doc:`compact </operating-scylla/nodetool-commands/compact/>`- Force a (major) compaction on one or more column families.
@@ -115,7 +117,7 @@ Operations that are not listed below are currently not available.
* :doc:`rebuild </operating-scylla/nodetool-commands/rebuild/>` :code:`[<src-dc-name>]`- Rebuild data by streaming from other nodes
* :doc:`refresh </operating-scylla/nodetool-commands/refresh/>`- Load newly placed SSTables to the system without restart
* :doc:`removenode </operating-scylla/nodetool-commands/removenode/>`- Remove node with the provided ID
* :doc:`repair <nodetool-commands/repair/>` :code:`<keyspace>` :code:`<table>` - Repair one or more tables
* :doc:`repair <nodetool-commands/repair/>` :code:`<keyspace>` :code:`<table>` - Repair one or more vnode tables.
* :doc:`restore </operating-scylla/nodetool-commands/restore/>` - Load SSTables from a designated bucket in object store into a specified keyspace or table
* :doc:`resetlocalschema </operating-scylla/nodetool-commands/resetlocalschema/>` - Reset the node's local schema.
* :doc:`ring <nodetool-commands/ring/>` - The nodetool ring command display the token ring information.

View File

@@ -7,8 +7,8 @@ Even though ScyllaDB is a fault-tolerant system, it is recommended to regularly
* Backup is a per-node procedure. Make sure to back up each node in your
cluster. For cluster-wide backup and restore, see `ScyllaDB Manager <https://manager.docs.scylladb.com/stable/restore/>`_.
* Backup works the same for non-encrypted and encrypted SStables. You can use
`Encryption at Rest <https://enterprise.docs.scylladb.com/stable/operating-scylla/security/encryption-at-rest.html>`_
available in ScyllaDB Enterprise without affecting the backup procedure.
:doc:`Encryption at Rest </operating-scylla/security/encryption-at-rest>`
without affecting the backup procedure.
You can choose one of the following:

View File

@@ -77,7 +77,7 @@ Procedure
.. note::
ScyllaDB Open Source 3.0 and later and ScyllaDB Enterprise 2019.1 and later support :doc:`Materialized View(MV) </features/materialized-views>` and :doc:`Secondary Index(SI) </features/secondary-indexes>`.
ScyllaDB supports :doc:`Materialized View(MV) </features/materialized-views>` and :doc:`Secondary Index(SI) </features/secondary-indexes>`.
When migrating data from Apache Cassandra with MV or SI, you can either:

View File

@@ -1,10 +1,13 @@
.. Note::
Make sure to use the same ScyllaDB **patch release** on the new/replaced node, to match the rest of the cluster. It is not recommended to add a new node with a different release to the cluster.
For example, use the following for installing ScyllaDB patch release (use your deployed version)
For example, use the following for installing ScyllaDB patch release (use your deployed version):
.. code::
sudo yum install scylla-2025.1.0
* ScyllaDB Enterprise - ``sudo yum install scylla-enterprise-2018.1.9``
* ScyllaDB open source - ``sudo yum install scylla-3.0.3``

View File

@@ -202,6 +202,7 @@ Add New DC
#. If you are using ScyllaDB Monitoring, update the `monitoring stack <https://monitoring.docs.scylladb.com/stable/install/monitoring_stack.html#configure-scylla-nodes-from-files>`_ to monitor it. If you are using ScyllaDB Manager, make sure you install the `Manager Agent <https://manager.docs.scylladb.com/stable/install-scylla-manager-agent.html>`_ and Manager can access the new DC.
.. _add-dc-to-existing-dc-not-connect-clients:
Configure the Client not to Connect to the New DC
-------------------------------------------------

View File

@@ -0,0 +1,57 @@
=========================================================
Preventing Quorum Loss in Symmetrical Multi-DC Clusters
=========================================================
ScyllaDB requires at least a quorum (majority) of nodes in a cluster to be up
and communicate with each other. A cluster that loses a quorum can handle reads
and writes of user data, but cluster management operations, such as schema and
topology updates, are impossible.
In clusters that are symmetrical, i.e., have two (DCs) with the same number of
nodes, losing a quorum may occur if one of the DCs becomes unavailable.
For example, if one DC fails in a 2-DC cluster where each DC has three nodes,
only three out of six nodes are available, and the quorum is lost.
Adding another DC would mitigate the risk of losing a quorum, but it comes
with network and storage costs. To prevent the quorum loss with minimum costs,
you can configure an arbiter (tie-breaker) DC.
An arbiter DC is a datacenter with a :doc:`zero-token node </architecture/zero-token-nodes>`
-- a node that doesn't replicate any data but is only used for Raft quorum
voting. An arbiter DC maintains the cluster quorum if one of the other DCs
fails, while it doesn't incur extra network and storage costs as it has no
user data.
Adding an Arbiter DC
-----------------------
To set up an arbiter DC, follow the procedure to
:doc:`add a new datacenter to an existing cluster </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
When editing the *scylla.yaml* file, set the ``join_ring`` parameter to
``false`` following these guidelines:
* Set ``join_ring=false`` before you start the node(s). If you set that
parameter on a node that has already been bootstrapped and owns a token
range, the node startup will fail. In such a case, you'll need to
:doc:`decommission </operating-scylla/procedures/cluster-management/decommissioning-data-center>`
the node, :doc:`wipe it clean </operating-scylla/procedures/cluster-management/clear-data>`,
and add it back to the arbiter DC properly following
the :doc:`procedure </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
* As a rule, one node is sufficient for an arbiter to serve as a tie-breaker.
In case you add more than one node to the arbiter DC, ensure that you set
``join_ring=false`` on all the nodes in that DC.
Follow-up steps:
^^^^^^^^^^^^^^^^^^^
* An arbiter DC has a replication factor of 0 (RF=0) for all keyspaces. You
need to ``ALTER`` the keyspaces to update their RF.
* Since zero-token nodes are ignored by drivers, you can skip
:ref:`configuring the client not to connect to the new DC <add-dc-to-existing-dc-not-connect-clients>`.
References
----------------
* :doc:`Zero-token Nodes </architecture/zero-token-nodes>`
* :doc:`Raft Consensus Algorithm in ScyllaDB </architecture/raft>`
* :doc:`Handling Node Failures </troubleshooting/handling-node-failures>`
* :doc:`Adding a New Data Center Into an Existing ScyllaDB Cluster </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`

View File

@@ -209,6 +209,17 @@ In this example, we will show how to install a nine nodes cluster.
UN 54.187.142.201 109.54 KB 256 ? d99967d6-987c-4a54-829d-86d1b921470f RACK1
UN 54.187.168.20 109.54 KB 256 ? 2329c2e0-64e1-41dc-8202-74403a40f851 RACK1
See also:
--------------------------
Preventing Quorum Loss
--------------------------
If your cluster is symmetrical, i.e., it has an even number of datacenters
with the same number of nodes, consider adding an arbiter DC to mitigate
the risk of losing a quorum at a minimum cost.
See :doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters </operating-scylla/procedures/cluster-management/arbiter-dc>`
for details.
------------
See also:
------------
:doc:`Create a ScyllaDB Cluster - Single Data Center (DC) </operating-scylla/procedures/cluster-management/create-cluster>`

View File

@@ -55,7 +55,7 @@ Procedure
cqlsh> DESCRIBE <KEYSPACE_NAME>
cqlsh> CREATE KEYSPACE <KEYSPACE_NAME> WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', '<DC_NAME1>' : 3, '<DC_NAME2>' : 3, '<DC_NAME3>' : 3};
cqlsh> ALTER KEYSPACE <KEYSPACE_NAME> WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', '<DC_NAME1>' : 3, '<DC_NAME2>' : 3};
cqlsh> ALTER KEYSPACE <KEYSPACE_NAME> WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', '<DC_NAME1>' : 3, '<DC_NAME2>' : 3, '<DC_NAME3>' : 0};
For example:
@@ -71,7 +71,7 @@ Procedure
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'EUROPE-DC' : 3};
cqlsh> ALTER KEYSPACE nba WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 0, 'EUROPE-DC' : 3};
#. Run :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` on every node in the data center that is to be removed.
Refer to :doc:`Remove a Node from a ScyllaDB Cluster - Down Scale </operating-scylla/procedures/cluster-management/remove-node>` for further information.

View File

@@ -26,6 +26,8 @@ Cluster Management Procedures
Safely Restart Your Cluster <safe-start>
Handling Membership Change Failures <handling-membership-change-failures>
repair-based-node-operation
Prevent Quorum Loss in Symmetrical Multi-DC Clusters <arbiter-dc>
.. panel-box::
:title: Cluster and DC Creation
@@ -84,6 +86,8 @@ Cluster Management Procedures
* :doc:`Repair Based Node Operations (RBNO) </operating-scylla/procedures/cluster-management/repair-based-node-operation>`
* :doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters <arbiter-dc>`
.. panel-box::
:title: Topology Changes
:id: "getting-started"

View File

@@ -5,9 +5,10 @@ Remove a Seed Node from Seed List
This procedure describes how to remove a seed node from the seed list.
.. note::
The seed concept in gossip has been removed. A seed node
is only used by a new node during startup to learn about the cluster topology. As a result, you only need to configure one
seed node in a node's ``scylla.yaml`` file.
A seed node is only used by a new node during startup to learn about the cluster topology.
This means it is sufficient to configure one seed node in a node's ``scylla.yaml`` file.
The first node in a new cluster must be a seed node.
Prerequisites

View File

@@ -3,13 +3,28 @@
Replacing a Dead Seed Node
===========================
.. note::
The seed concept in gossip has been removed.
A seed node is only used by a new node during startup to learn about the cluster topology. As a result, there's no need
to replace the node configured with the ``seeds`` parameter in the ``scylla.yaml`` file.
In ScyllaDB, it is not possible to bootstrap a seed node. The following steps describe how to replace a dead seed node.
.. note::
A seed node is only used by a new node during startup to learn about
the cluster topology.
Once the nodes have joined the cluster, the seed node has no function.
In typical scenarios, there's no need to replace the node
configured with the ``seeds`` parameter in the ``scylla.yaml`` file.
* The first node in a new cluster must be a seed node.
* It is sufficient to configure one seed node in a node's ``scylla.yaml`` file.
You may choose to configure two or three seed nodes if your cluster is large.
* Its not recommended that all the nodes in the cluster be defined as seed nodes.
* If you update the IP address of a seed node or remove it from the cluster,
you should update configuration files on all the remaining nodes to keep the
configuration consistent.
Once a node has joined the cluster and has all the peer information saved
locally in the ``system.peers`` system table, seed nodes are no longer used,
but they are still contacted on each restart. To avoid configuration errors
and to be able to reach out to the cluster if the seed IP address changes,
the seed configuration should be valid.
Prerequisites
-------------
@@ -35,4 +50,4 @@ Procedure
Use ``nodetool status`` to verify that restarted nodes are online before restarting more nodes. If too many nodes are offline, the cluster may suffer temporary service degradation or outage.
#. Replace the dead node using the :doc:`dead node replacement procedure </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
Your cluster should have more than one seed node, but it's not allowed to define all the nodes in the cluster to be seed nodes.

View File

@@ -5,7 +5,7 @@ Advanced Internode (RPC) Compression
Internode (RPC) compression controls whether traffic between nodes is
compressed. If enabled, it reduces network bandwidth usage.
To further reduce network traffic, you can configure ScyllaDB Enterprise to use
To further reduce network traffic, you can configure ScyllaDB to use
ZSTD-based compression and shared dictionary compression. You can enable one or
both of these features to limit network throughput and reduce network transfer costs.

Some files were not shown because too many files have changed in this diff Show More