Commit Graph

4155 Commits

Author SHA1 Message Date
Pavel Emelyanov
3710eadb93 Merge '[Backport 2025.1] db/hints: Improve logs' from Scylladb[bot]
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

Fixes scylladb/scylladb#25466

Backport:
There is no risk in backporting these changes. They only have
impact on the logs. On the other hand, they might prove helpful
when debugging an issue in hinted handoff.

- (cherry picked from commit 2327d4dfa3)

- (cherry picked from commit d7bc9edc6c)

- (cherry picked from commit 6f1fb7cfb5)

Parent PR: #25470

Closes scylladb/scylladb#25536

* github.com:scylladb/scylladb:
  db/hints: Add new logs
  db/hints: Adjust log levels
  db/hints: Improve logs
2025-08-25 10:26:42 +03:00
Sergey Zolotukhin
0cb64577fd generic_server: Two-step connection shutdown.
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.

Fixes scylladb/scylladb#24481

(cherry picked from commit ea311be12b)
2025-08-20 10:30:09 +02:00
Dawid Mędrek
dc15d64c50 db/hints: Add new logs
We're adding new logs in just a few places that may however prove
important when debugging issues in hinted handoff in the future.

(cherry picked from commit 6f1fb7cfb5)
2025-08-18 15:59:42 +02:00
Dawid Mędrek
b1ecfe6ce4 db/hints: Adjust log levels
Some of the logs could be clogging Scylla's logs, so we demote their
level to a lower one.

On the other hand, some of the logs would most likely not do that,
and they could be useful when debugging -- we promote them to debug
level.

(cherry picked from commit d7bc9edc6c)
2025-08-18 15:59:42 +02:00
Dawid Mędrek
ebd4355255 db/hints: Improve logs
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

(cherry picked from commit 2327d4dfa3)
2025-08-18 15:59:39 +02:00
Dawid Mędrek
69307eaf2d db/commitlog: Extend error messages for corrupted data
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.

Closes scylladb/scylladb#25190

(cherry picked from commit 408b45fa7e)

Closes scylladb/scylladb#25459
2025-08-13 09:22:50 +03:00
Botond Dénes
f453b5bfa3 Merge '[Backport 2025.1] sstables: Fix quadratic space complexity in partitioned_sstable_set' from Scylladb[bot]
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

- (cherry picked from commit 494ed6b887)

- (cherry picked from commit 59dad2121f)

- (cherry picked from commit 21d1e78457)

- (cherry picked from commit c77f710a0c)

- (cherry picked from commit d5bee4c814)

Parent PR: #23806

Closes scylladb/scylladb#24012

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-08-06 09:56:43 +03:00
Asias He
478d02ce83 storage_service: Use utils::chunked_vector to avoid big allocation
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

Closes scylladb/scylladb#24561

(cherry picked from commit c5a136c3b5)
2025-07-16 15:39:51 +08:00
Lakshmi Narayanan Sreethar
46dfe09e64 db/corrupt_data_handler: guard stop() against null _fragment_semaphore
The `system_table_corrupt_data_handler::_fragment_semaphore` member is
initialized only when the `system_keyspace` sharded service is
initialized by `main`. If the server shuts down before that due to an
unrelated reason, `_fragment_semaphore` remains default-initialized to
`nullptr`. When the shutdown process later attempts to call `stop()` on
`system_table_corrupt_data_handler`, it tries to call `stop()` on
`_fragment_semaphore`, leading to a segfault.

Fix this by checking if `_fragment_semaphore` is null before invoking
`stop()` on it.

Although `corrupt_data_handler` was backported to 2025.1, this issue
does not occur in 2025.2 and master. The recent versions include #23113,
which changes how the system keyspace is stopped and PR #24492, which
originally introduced `corrupt_data_handler`, builds on that change to
ensure `_fragment_semaphore` is stopped only if it has been created.

Fixes #24920

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24931
2025-07-14 12:06:29 +03:00
Raphael S. Carvalho
63bdbebdef sstables: Fix quadratic space complexity in partitioned_sstable_set
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.

Fixes #23634.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c77f710a0c)
2025-07-11 10:05:30 -03:00
Ferenc Szili
3e3147c03a logging: Add row count to large partition warning message
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125

This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.

The warning for large partitions and row counts will now look like this:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

Closes scylladb/scylladb#22010

(cherry picked from commit 96267960f8)

Closes scylladb/scylladb#24681
2025-07-10 16:20:23 +02:00
Michael Litvak
6d45cb3d5c test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.

(cherry picked from commit a9b476e057)
2025-07-08 12:32:26 +03:00
Michael Litvak
0e95704df1 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.

(cherry picked from commit 74a3fa9671)
2025-07-08 06:24:30 +00:00
Michael Litvak
9199c15813 batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.

(cherry picked from commit 7150632cf2)
2025-07-08 06:24:30 +00:00
Michael Litvak
d161a0bb35 batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.

(cherry picked from commit fc5ba4a1ea)
2025-07-08 06:24:29 +00:00
Botond Dénes
820b6a3a3f db: introduce corrupt_data_handler
Similar to large_data_handler, this interface allows sstable writers to
delegate the handling of corrupt data.
Two implementations are provided:
* system_table_corrupt_data_handler - saved corrupt data in
  system.corrupt_data, with a TTL=10days (non-configurable for now)
* nop_corrupt_data_handler - drops corrupt data

(cherry picked from commit 3e1c50e9a7)
2025-07-02 13:57:27 +03:00
Botond Dénes
16d039a04f db/system_keyspace: add apply_mutation()
Allow applying writes in the form of mutations directly to the keyspace.
Allows lower-level mutation API to build writes. Advantageous if writes
can contain large cells that would otherwise possibly cause large
allocation warnings if used via the internal CQL API.

(cherry picked from commit 0753643606)
2025-07-01 15:37:01 +00:00
Botond Dénes
85c3f12039 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.

(cherry picked from commit 92b5fe8983)
2025-07-01 15:37:01 +00:00
Nadav Har'El
6585a056cf alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827

(cherry picked from commit 3ce7e250cc)
2025-06-12 14:20:26 +03:00
Wojciech Mitros
ec41601929 mv: add a test for dropping an index while it's building
Dropping an index is a schema change of its base table and
a schema drop of the index's materialized view. This combination
of schema changes used to cause issues during view building, because
when a view schema was dropped, it wasn't getting updated with the
new version of the base schema, and while the view building was
in progress, we would update the base schema for the base table
mutation reader and try generating updates with a view schema that
wasn't compatible with the base schema, failing on an `on_internal_error`.

In this patch we add a test for this scenario. We create an index,
halt its view building process using an injection, and drop it.
If no errors are thrown, the test succeeds.

The test was failing before https://github.com/scylladb/scylladb/pull/23337
and is passing afterwards.

(cherry picked from commit bf7bba9634)
2025-05-27 21:42:56 +02:00
Wojciech Mitros
f1fd053572 base_info: remove the lw_shared_ptr variant
The base_dependent_view_info is no longer needed to be shared or
modified in the view_info, so we no longer need to keep it as
a shared pointer.

(cherry picked from commit d77f11d436)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
70b21012cd view_info: don't re-set base_info after construction
In the previous commits we made sure that the base info is not dependent
on the base schema version, and the info dependent on the base schema
version is calculated when it's needed. In this patch we remove the
unnecessary re-setting of the base_info.

The set_base_info method isn't removed completely, because it also has
a secondary function - zeroing the view_info fields other than base_info.
Because of this, in this patch we rename it accordingly and limit its
use to the updates caused by a base schema change.

(cherry picked from commit d7bd86591e)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
4f789bf4c7 base_info: remove base_info snapshot semantics
The base info in view schemas no longer changes on base schema
updates, so saving the base info with a view schema from a specific
point in time doesn't provide any additional benefits.
In this patch we remove the code using the base_and_view snapshots
as it's no longer useful.

(cherry picked from commit ea462efa3d)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
d922a415b5 base_info: remove base schema from the base_info
The base info now only contains values which are not reliant on the
base schema version. We remove the the base schema from the base info
to make it immutable regardless of base schema version, at the point
of this patch it's also not needed anywhere - the new base info can
replace the base schema in most places, and in the few (view_updates)
where we need it, we pull the most recent base schema version from
the database.

After this change, the base info no longer changes in a view schema
after creation, so we'll no longer get errors when we try generating
view updates with a base_info that's incompatible with a specific
base schema version.

Fixes #9059
Fixes #21292
Fixes #22410

(cherry picked from commit ad55935411)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
d9fe006a20 schema_registry: store base info instead of base schema for view entries
In the following patch we plan to remove the base schema from the base_info
to make the base_info immutable. To do that, we first prepare the schema
registry for the change; we need to be able to create view schemas from
frozen schemas there and frozen schemas have no information about the base
table. Unless we do this change, after base schemas are removed from the
base info, we'll no longer be able to load a view schema to the schema registry
without looking up the base schema in the database.

This change also required some updates to schema building:
* we add a method for unfreezing a view schema with base info instead of
a base schema
* we make it possible to use schema_builder with a base info instead of
a base schema
* we add a method for creating a view schema from mutations with a base info
instead of a base schema
* we add a view_info constructor withat base info instead of a base schema
* we update the naming in schema_registry to reflect the usage of base info
instead of base schema

(cherry picked from commit 05fce91945)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
4e2f1d0edb base_info: make members non-const
In the following patches we'll add the base info instead of the
base schema to various places (schema building, schema registry).
There, we'll sometimes need to update the base_info fields, which
we can't do with const members. There's also a place (global_schema_ptr)
where we won't be able to use the base_info_ptr (a shared pointer to the
base_info), so we can't just use the base_info_ptr everywhere instead.

In this patch we unmark these members as const.
In the following patches we'll remove the methods for changing the
base_info in the view schema, so it will remain effectively const.

(cherry picked from commit 6e539c2b4d)
2025-05-27 21:40:23 +02:00
Wojciech Mitros
da5ad10002 view_info: move the base info to a separate header
In the following commits the base_depenedent_view_info will be needed
in many more places. To avoid including the whole db/view/view.hh
or forward declaring (where possible) the base info, we move it to
a separate header which can be included anywhere at almost no cost.

(cherry picked from commit 32258d8f9a)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
6e40197c52 view_info: move computation of view pk columns not in base pk to view_updates
In preparation of making the base_info immutable, we want to get rid of
any base_dependent_view_info fields that can change when base schema
is updated.
The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk
base column_ids of corresponding base columns and they can change
(decrease) when an earlier column is dropped in the base table.
view_updates is the only location where these values are used and calculating
them is not expensive when comparing to the overall work done while performing
a view update - we iterate over all view primary key columns and look them up
in the base table.
With this in mind, we can just calculate them when creating a view_updates
object, instead of keeping them in the base_info. We do that in this patch.

(cherry picked from commit a3d2cd6b5e)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
4e0c8b7ab8 view_info: move base-dependent variables into base_info
The has_computed_column_depending_on_base_non_primary_key
and is_partition_key_permutation_of_base_partition_key variables
in the view_info depend on the base table so they should be in the
base_dependent_view_info instead of view_info.

(cherry picked from commit a33963daef)
2025-05-27 21:40:22 +02:00
Wojciech Mitros
7422796845 view_info: set base info on construction
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as well as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). The first step towards that is making
sure that all newly created schemas have the base info set.
We achieve that by requiring a base schema when constructing a view
schema. Unfortunately, this adds complexity each time we're making
a view schema - we need to get the base schema as well.
In most cases, the base schema is already available. The most
problematic scenario is when we create a schema from mutations:
- when parsing system tables we can get the schema from the
database, as regular tables are parsed before views
- when loading a view schema using the schema loader tool, we need
to load the base additionally to the view schema, effectively
doubling the work
- when pulling the schema from another node - in this case we can
only get the current version of the base schema from the local
database

Additionally, we need to consider the base schema version - when
we generate view updates the version of the base schema used for
reads should match the version of the base schema in view's base
info.
This is achieved by selecting the correct (old or new) schema in
`db::schema_tables::merge_tables_and_views` and using the stored
base schema in the schema_registry.

(cherry picked from commit 900687c818)
2025-05-27 21:40:22 +02:00
Benny Halevy
736f89b31a tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 62aeba759b)
2025-04-08 08:35:14 +03:00
Benny Halevy
a49e27ac8f db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c62865df90)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-08 08:08:47 +03:00
Dawid Mędrek
c56e47f72f db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811

(cherry picked from commit 0a6137218a)

Closes scylladb/scylladb#23370
2025-04-03 09:09:05 +02:00
Tomasz Grabiec
51ee15f02d Merge '[Backport 2025.1] tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Fixes #23042

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config

Closes scylladb/scylladb#23443

* github.com:scylladb/scylladb:
  Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
2025-04-01 20:31:05 +02:00
Patryk Jędrzejczak
39c20144e5 Merge '[Backport 2025.1] raft topology: Add support for raft topology init to happen before group0 initialization' from Scylladb[bot]
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.

This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
anner.

Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also performs
a refactoring to populate all init data to system tables even before
group0 creation(via `raft_initialize_discovery_leader` function). Now
because `raft_initialize_discovery_leader` happens before the group 0
creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.

Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.

By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#21114

- (cherry picked from commit 4748125a48)

- (cherry picked from commit e491950c47)

- (cherry picked from commit 623e01344b)

- (cherry picked from commit d7884cf651)

Parent PR: #22484

Closes scylladb/scylladb#22966

* https://github.com/scylladb/scylladb:
  storage_service: Remove the variable _manage_topology_change_kind_from_group0
  storage_service: fix indentation after the previous commit
  raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
  service/raft: Refactor mutation writing helper functions.
2025-04-01 11:46:15 +02:00
Avi Kivity
cff90755d8 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats

(cherry picked from commit b1d9f80d85)
2025-03-25 23:16:35 +01:00
Dawid Mędrek
864528eb9b db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356

(cherry picked from commit 32879ec0d5)
2025-03-21 12:27:04 +00:00
Abhinav Jha
98977e9465 raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
In the current scenario, topology_change_kind variable, was been handled using
 _manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)

Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.

Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.

By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.

Fixes: scylladb/scylladb#21114
(cherry picked from commit e491950c47)
2025-02-20 21:21:31 +00:00
Pavel Emelyanov
aa5cb15166 Merge 'Alternator: implement UpdateTable operation to add or delete GSI' from Nadav Har'El
In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is **not** the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes #11567.

We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes #11471.

This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. **It should be backported to 2025.1**, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1.

Closes scylladb/scylladb#21989

* github.com:scylladb/scylladb:
  alternator: fix view build on oversized GSI key attribute
  mv: clean up do_delete_old_entry
  test/alternator: unflake test for IndexStatus
  test/alternator: work around unrelated bug causing test flakiness
  docs/alternator: adding a GSI is no longer an unimplemented feature
  test/alternator: remove xfail from all tests for issue 11567
  alternator: overhaul implementation of GSIs and support UpdateTable
  mv: support regular_column_transformation key columns in view
  alternator: add new materialized-view computed column for item in map
  build: in cmake build, schema needs alternator
  build: build tests with Alternator
  alternator: add function serialized_value_if_type()
  mv: introduce regular_column_transformation, a new type of computed column
  alternator: add IndexStatus/Backfilling in DescribeTable
  alternator: add "LimitExceededException" error type
  docs/alternator: document two more unimplemented Alternator features

(cherry picked from commit 529ff3efa5)

Closes scylladb/scylladb#22826
2025-02-18 19:05:21 +02:00
Botond Dénes
ff7e93ddd5 db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2
This config item controls how many CPU-bound reads are allowed to run in
parallel. The effective concurrency of a single CPU core is 1, so
allowing more than one CPU-bound reads to run concurrently will just
result in time-sharing and both reads having higher latency.
However, restricting concurrency to 1 means that a CPU bound read that
takes a lot of time to complete can block other quick reads while it is
running. Increase this default setting to 2 as a compromise between not
over-using time-sharing, while not allowing such slow reads to block the
queue behind them.

Fixes: #22450

Closes scylladb/scylladb#22679

(cherry picked from commit 3d12451d1f)

Closes scylladb/scylladb#22722
2025-02-13 09:40:25 +02:00
Michael Litvak
58eda6670f view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493

(cherry picked from commit 6d34125eb7)

Closes scylladb/scylladb#22607
2025-02-02 22:29:52 +02:00
Asias He
4018dc7f0d Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22467
2025-01-26 12:51:59 +02:00
Avi Kivity
59d3a66d18 Revert "Introduce file stream for tablet"
This reverts commit 8208688178. It was
contributed from enterprise, but is too different from the original
for me to merge back.
2025-01-22 09:42:20 +02:00
Nadav Har'El
3e16b80014 Merge 'Reject create table with compact storage' from Benny Halevy
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.

Yet, there's is nothing in the code that prevents users
from creating such tables.

This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage`, set to
`false` by default, that require users to opt-in
in order to create new tables WITH COMPACT STORAGE.

Refs scylladb/scylladb#12263, scylladb/scylladb#16375

* Since this guardrail is an enhancement, no backport is needed

Closes scylladb/scylladb#16403

* github.com:scylladb/scylladb:
  docs: ddl: document the deprecation of compact tables
  test: enable_create_table_with_compact_storage for tests that need it
  config: add enable_create_table_with_compact_storage
2025-01-20 22:02:02 +02:00
Benny Halevy
88ae067ddb everywhere: add skeletal support for the in_memory_tables feature
Forward-ported from scylla-enterprise.
Note that the feature has been deprecated and the implementation
is provided only for backward compatibility with pre-existing
features and schema.

Tested manually after adding the following to feature_service:
```
    gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv };
```

Launched a single-node cluster running 2023.1.10
```
cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'};
```

log:
```
Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ...
...
INFO  2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440
```

Upgraded to this branch and started scylla.
Verified that ks.test was successfuly loaded:

log:
```
INFO  2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ...
...
WARN  2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy.
...
INFO  2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8
```

Then, tested:
```
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'InMemoryCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE';

cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'};
cqlsh> describe KEYSPACE ks;

CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false};

CREATE TABLE ks.test (
    pk int,
    val int,
    PRIMARY KEY (pk)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE'
    AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
```

log:
```
INFO  2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}]
INFO  2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22068
2025-01-20 16:55:17 +02:00
Asias He
8208688178 Introduce file stream for tablet
File based stream is a new feature that optimizes tablet movement
significantly. It streams the entire SSTable files without deserializing
SSTable files into mutation fragments and re-serializing them back into
SSTables on receiving nodes. As a result, less data is streamed over the
network, and less CPU is consumed, especially for data models that
contain small cells.

The following patches are imported from the scylla enterprise:

*) Merge 'Introduce file stream for tablet' from Asias He

    This patch uses Seastar RPC stream interface to stream sstable files on
    network for tablet migration.

    It streams sstables instead of mutation fragments. The file based
    stream has multiple advantages over the mutation streaming.

    - No serialization or deserialization for mutation fragments
    - No need to read and process each mutation fragments
    - On wire data is more compact and smaller

    In the test below, a significant speed up is observed.

    Two nodes, 1 shard per node, 1 initial_tablets:

    - Start node 1
    - Insert 10M rows of data with c-s
    - Bootstrap node 2

    Node 1 will migration data to node2 with the file stream.

    Test results:

    1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s

    [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0]
	Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds

    2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s

    [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058]
	Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s
    [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds

    Test Summary:

    File stream v.s. Mutation stream improvements

    - Stream bandwidth = 836 / 128  (MB/s)  = 6.53X

    - Stream time = 23.60 / 1.08  (Seconds) = 21.85X

    - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X

    Closes scylladb/scylla-enterprise#3438

    * github.com:scylladb/scylla-enterprise:
      tests: Add file_stream_test
      streaming: Implement file stream for tablet

*) streaming: Use new take_storage_snapshot interface

    The new take_storage_snapshot returns a file object instead of a file
    name. This allows the file stream sender to read from the file even if
    the file is deleted by compaction.

    Closes scylladb/scylla-enterprise#3728

*) streaming: Protect unsupported file types for file stream

    Currently, we assume the file streamed over the stream_blob rpc verb is
    a sstable file. This patch rejects the unsupported file types on the
    receiver side. This allows us to stream more file types later using the
    current file stream infrastructure without worrying about old nodes
    processing the new file types in the wrong way.

    - The file_ops::noop is renamed to file_ops::stream_sstables to be
      explicit about the file types

    - A missing test_file_stream_error_injection is added to the idl

    Fixes: #3846
    Tests: test_unsupported_file_ops

    Closes scylladb/scylla-enterprise#3847

*) idl: Add service::session_id id to idl

    It will be used in the next patch.

    Refs #3907

*) streaming: Protect file stream with topology_guard

    Similar to "storage_service, tablets: Use session to guard tablet
    streaming", this patch protects file stream with topology_guard.

    Fixes #3907

*) streaming: Take service topology_guard under the try block

    Taking the service::topology_guard could throw. Currently, it throws
    outside the try block, so the rpc sink will not be closed, causing the
    following assertion:

    ```
    scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual
    seastar::rpc::sink_impl<netw::serializer,
    streaming::stream_blob_cmd_data>::~sink_impl() [Serializer =
    netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion
    `this->_con->get()->sink_closed()' failed.
    ```

    To fix, move more code including the topology_guard taking code to the
    try block.

    Fixes https://github.com/scylladb/scylla-enterprise/issues/4106

    Closes scylladb/scylla-enterprise#4110

*) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho

    We're not preserving the SSTable state across file based migration, so
    staging SSTables for example are being placed into main directory, and
    consequently, we're mixing staging and non-staging data, losing the
    ability to continue from where the old replica left off.
    It's expected that the view update backlog is transferred from old
    into new replica, as migration doesn't wait for leaving replica to
    complete view update work (which can take long). Elasticity is preferred.

    So this fix guarantees that the state of the SSTable will be preserved
    by propagating it in form of subdirectory (each subdirectory is
    statically mapped with a particular state).

    The staging sstables aren't being registered into view update generator
    yet, as that's supposed to be fixed in OSS (more details can be found
    at https://github.com/scylladb/scylladb/issues/19149).

    Fixes #4265.

    Closes scylladb/scylla-enterprise#4267

    * github.com:scylladb/scylla-enterprise:
      tablet: Preserve original SSTable state with file based tablet migration
      sstables: Add get method for sstable state

*) sstable: (Re-)add shareabled_components getter

*) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund

    Fixes #4246

    Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier.

    Ensures we transfer and pre-process scylla metadata for streamed
    file blobs first, then properly apply receiving nodes local config
    by using a source and sink layer exported from sstables, which
    handles things like ordering, metadata filtering (on source) as well
    as handling metadata and proper IO paths when writing data on
    receiver node (sink).

    This implementation maintains the statelessness of the current
    design, and the delegated sink side will re-read and re-write the
    metadata for each component processed. This is a little wasteful,
    but the meta is small, and it is less error prone than trying to do
    caching cross-shards etc. The transport is isolated from the
    knowledge.

    This is an alternative/complement to #4436 and #4472, fixing the
    underlying issue. Note that while the layers/API:s here allows easy
    fixing of other fundamental problems in the feature (such as
    destination location etc), these are not included in the PR, to keep
    it as close to the current behaviour as possible.

    Closes scylladb/scylla-enterprise#4646

    * github.com:scylladb/scylla-enterprise:
      raft_tests: Copy/add a topology test with encryption
      file streaming: Use sstable source/sink to transfer snapshots
      sstables: Add source and sink objects + producers for transfering a snapshot
      sstable::types: Add remove accessor for extension info in metadata

*) The change for error injection in merge commit 966ea5955dd8760:

    File streaming now has "stream_mutation_fragments" error injection points
    so test_table_dropped_during_streaming works with file streaming.

*) doc: document file-based streaming

    This commit adds a description of the file-based streaming feature to the documentation.

    It will be displayed in the docs using the scylladb_include_flag directive after
    https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0,
    and, in turn, branch-2024.2.

    Refs https://github.com/scylladb/scylla-enterprise/issues/4585
    Refs https://github.com/scylladb/scylla-enterprise/issues/4254

    Closes scylladb/scylla-enterprise#4587

*) doc: move File-based streaming to the Tablets source file-based-streaming

    This commit moves the description of file-based streaming from a common include file
    to the regular doc source file where tablets are described.

    Closes scylladb/scylla-enterprise#4652

*) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference

Closes scylladb/scylladb#22034
2025-01-20 16:43:21 +02:00
Benny Halevy
f3ab00e61c test: enable_create_table_with_compact_storage for tests that need it
Now enable_create_table_with_compact_storage can be set
to `false` by default in db/config.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-20 08:14:37 +02:00
Benny Halevy
0110eb0506 config: add enable_create_table_with_compact_storage
As discussed in
https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813,
compact storage tables are deprecated.

Yet, there's is nothing in the code that prevents users
from creating such tables.

This patch adds a live-updateable config option:
`enable_create_table_with_compact_storage` that require users
to opt-in in order to create new tables WITH COMPACT STORAGE.

The option is currently set to `true` by default in db/config
to reduce the churn to tests and to `false` in scylla.yaml,
for new clusters.

TODO: once regressions tests that use compact storage
are converted to enable the option, change the default in
db/config to false.

A unit test was added to test/cql-pytest that
checks that the respective cql query fails as expected
with the default option or when it is explicitly set to `false`,
and that the query succeeds when the option is set to `true`.

Note that `check_restricted_table_properties` already
returns an optional warning, but it is only logged
but not returned in the `prepared_statement`.
Fixing that is out of the scope of this patch.
See https://github.com/scylladb/scylladb/issues/20945

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-20 08:03:25 +02:00
Piotr Dulikowski
6aa962f5f4 Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski
Introduces a comprehensive audit system to track database operations for security
and compliance purposes. This change includes:

Core Components:
- New audit subsystem for logging database operations
- Service level integration for proper resource management
- CQL statement tracking with operation categories
- Login process integration for tenant management

Key Features:
- Configurable audit logging (syslog/table)
- Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN)
- Selective auditing by keyspace/table
- Password sanitization in audit logs
- Service level shares support (1-1000) for workload prioritization
- Proper lifecycle management and cleanup

I ran the dtests for audit (manually enabled) and they pass.
The in-repo tests pass.

Notably, there should be no non-whitespace changes between this and scylla-enterprise

Fixes scylladb/scylla-enterprise#4999

Closes scylladb/scylladb#22147

* github.com:scylladb/scylladb:
  audit: Add shares support to service level management
  audit: Add service level support to CQL login process
  audit: Add support to CQL statements
  audit: Integrate audit subsystem into Scylla main process
  audit: Add documentation for the audit subsystem
  audit: Add the audit subsystem
2025-01-17 13:14:55 +01:00
Kamil Braun
89ee2a6834 Merge 'drop ip addresses from token metadata' from Gleb
Now that all topology related code uses host ids there is not point to
maintain ip to id (and back) mappings in the token metadata. After the
patch the mapping will be maintained in the gossiper only. The rest of
the system will use host ids and in rare cases where translation is
needed (mostly for UX compatibility reasons) the translation will be
done using gossiper.

Fixes: scylladb/scylla#21777

* 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits)
  hint manager: do not translate ip to id in case hint manager is stopped already
  locator: token_metadata: drop update_host_id() function that does nothing now
  locator: topology: drop indexing by ips
  repair: drop unneeded code
  storage_service: use host_id to look for a node in on_alive handler
  storage_proxy: translate ips to ids in forward array using gossiper
  locator: topology: remove unused functions
  storage_service: check for outdated ip in on_change notification in the peers table
  storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology
  topology coordinator: change connection dropping code to work on host ids
  cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query
  locator: drop unused function from tablet_effective_replication_map
  api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead
  locator: token_metadata: remove unused ip based functions
  locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs
  gossiper: drop get_unreachable_token_owners functions
  storage_service: use gossiper to map ip to id in node_ops operations
  storage_service: fix indentation after the last patch
  storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node
  token_metadata: drop no longer used functions
  ...
2025-01-17 11:00:52 +01:00