Commit Graph

4501 Commits

Author SHA1 Message Date
Ernest Zaslavsky
debc756794 treewide: Move transport related files to a transport directory As requested in #22112, moved the files and fixed other includes and build system.
Moved files:
- generic_server.hh
- generic_server.cc
- protocol_server.hh

Fixes: #22112

This is a cleanup, no need to backport

Closes scylladb/scylladb#25090
2025-09-29 11:46:06 +03:00
Nadav Har'El
1aef733d48 Merge 'Alternator/cache expressions' from Szymon Malewski
Before this patch, every expression in Alternator's requests was parsed from string to adequate structure.
This patch enables caching, where input expression strings are mapped to parsed template structures.
Every new valid (parsable) expression is added to the cache. The cache has limited (configurable) size - when it is reached, the least recently used entry is removed.
When requested expression is in the cache, the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values).
Caching is implemented for all expression types. The cache is per shard - shared for all operations, expression types, tables, users.
Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled).
Basic metrics (total count of hits and misses for each expression type and number of evicted enries) are implemented.
Cache features are tested in boost unit tests and overall expression caching is tested with Python tests - both mostly rely on metrics.

refs #5023
`perf-alternator` test shows improvement (median):
| test | throughput | instructions_per_op | cpu_cycles_per_op | allocs_per_op |
| ------ | ---------------- | ----------------------------- | --------------------------- | -------------------  |
| read | +6.0% | -8.5% | -7.0% | -4.9% |
| write | +13.4% | -17.6% | -14.7% | -7.4% |
| write(lwt) | +12.7% | -7.9% | -6.9% | -2.8% |
| write_rwm | +5.4% | -10.5% | -7.3% | -4.1% |

"read" had a ProjectionExpression with 10 column names, "write" had a UpdateExpression with 10 column names and "write_rmw" had both ConditionExpression and UpdateExpression.

This patch also includes minor refactoring of other expressions related tests (https://github.com/scylladb/scylladb/issues/22494) - use `test_table_ss` instead of `test_table`.
Fixes #25855.

This is new feature - no backporting.

Closes scylladb/scylladb#25176

* github.com:scylladb/scylladb:
  alternator: use expression caching
  alternator: adds expression cache implementation
  utils: extend lru_string_map
  utils: add lru_string_map
  alternator/expressions: error on parsing empty update expression
  alternator/expressions: fix single value condition expression parsing
  test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests.
2025-09-29 11:36:31 +03:00
Michael Litvak
6bc41926e2 view_builder: reduce log level for expected aborts during view creation
When draining the view builder, we abort ongoing operations using the
view builder's abort source, which may cause them to fail with
abort_requested_exception or raft::request_aborted exceptions.

Since these failures are expected during shutdown, reduce the log level
in add_new_view from 'error' to 'debug' for these specific exceptions
while keeping 'error' level for unexpected failures.

Closes scylladb/scylladb#26297
2025-09-28 22:55:07 +03:00
Avi Kivity
5b6570be52 Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.

This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.

Fixes #25195.

Closes scylladb/scylladb#26003

* github.com:scylladb/scylladb:
  test/cluster: Add tests for invalid SSTable compression options
  test/boost: Add tests for SSTable compression config options
  main: Validate SSTable compression options from config
  db/config: Add SSTable compression options for user tables
  db/config: Prepare compression_parameters for config system
  compressor: Validate presence of sstable_compression in parameters
  compressor: Add missing space in exception message
2025-09-28 20:23:23 +03:00
Szymon Malewski
6ce7843774 alternator: use expression caching
Before this patch, every expression in Alternator's requests was parsed from string to adequate structure.
This patch enables caching - all calls to parse an expression (all types) are proxied through the cache.
New expression is added to the cache, the least recently used entry (above cache size) is removed.
For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values).
The cache is per shard - shared for all operations, expression types, tables, users.
Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled).
Added Python tests are based on metrics.
2025-09-28 04:27:44 +02:00
Nikos Dragazis
e1d9c83406 db/config: Add SSTable compression options for user tables
ScyllaDB offers the `compression` DDL property for configuring
compression per user table (compression algorithm and chunk size). If
not specified, the default compression algorithm is the LZ4Compressor
with a 4KiB chunk size (refer to the default constructor for
`compression_parameters`). The same default applies to system tables as
well.

Add a new configuration option to allow customizing the default for user
tables. Use the previously hardcoded default as the new option's default
value.

Note that the option has no effect on ALTER TABLE statements. An altered
table either inherits explicit compression options from the CQL
statement, or maintains its existing options.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-09-26 12:02:00 +03:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Nikos Dragazis
a7e46974d4 db/config: Prepare compression_parameters for config system
SSTable compression is currently configurable only per table, via the
`compression` property in CREATE/ALTER TABLE statements. This is
represented internally via the `compression_parameters` class. We plan
to offer the same options via the configuration as well, to make the
default compression method for user tables configurable.

This patch prepares the ground by making the `compression_parameters`
usable as a `config_file::named_value`, namely:

* Define an extraction operator (required by `boost::program_options`
  for parsing the options from command line).
* Define a formatter (required by `named_value::operator()`).
* Define a template specialization for `config_type_for` (required by
  `named_value` constructor).
* Define a yaml converter (required for parsing the options from
  scylla.yaml).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-09-24 14:51:39 +03:00
Ernest Zaslavsky
5ba5aec1f8 treewide: Move mutation related files to a mutation directory
As requested in #22104, moved the files and fixed other includes and build system.

Moved files:
 - combine.hh
 - collection_mutation.hh
 - collection_mutation.cc
 - converting_mutation_partition_applier.hh
 - converting_mutation_partition_applier.cc
 - counters.hh
 - counters.cc
 - timestamp.hh

Fixes: #22104

This is a cleanup, no need to backport

Closes scylladb/scylladb#25085
2025-09-24 13:23:38 +03:00
Piotr Dulikowski
482ddfb3b4 Merge 'mv: handle mismatched base/view replica count caused by RF change' from Wojciech Mitros
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.

This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.

This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.

This patch also includes a test for this exact scenario, which is enforced by an injection.

Fixes https://github.com/scylladb/scylladb/issues/21492

Closes scylladb/scylladb#24396

* github.com:scylladb/scylladb:
  mv: handle mismatched base/view replica count caused by RF change
  mv: save the nodes used for pairing calculations for later reuse
  mv: move the decision about simple rack-aware pairing later
2025-09-23 08:10:08 +02:00
Dawid Mędrek
35f7d2aec6 db/batchlog: Drop batch if table has been dropped
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.

To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.

A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.

Fixes scylladb/scylladb#24806

Closes scylladb/scylladb#26057
2025-09-23 07:48:59 +02:00
Pavel Emelyanov
f6860d1de0 Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski
The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong.

Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.

No need to backport, view build coordinator is not a part of any release yet.

Closes scylladb/scylladb#26122

* github.com:scylladb/scylladb:
  mv: fix typo in start_backgroud_fibers
  mv: run view building worker fibers in streaming group
2025-09-22 15:28:38 +03:00
Wojciech Mitros
d9b8278178 mv: handle mismatched base/view replica count caused by RF change
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.

This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.

This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.

This patch also includes a test for this exact scenario, which is enforced by an injection.

Fixes https://github.com/scylladb/scylladb/issues/21492
2025-09-22 12:50:16 +02:00
Wojciech Mitros
59c40a2edd mv: save the nodes used for pairing calculations for later reuse
In get_view_natural_endpoint() we start with the list if host_ids
from the effective replication maps, which we later translate to
locator::node to get the information about racks and datacenters.
We check all replicas, but we only store the ones relevant for
pairing, so for tablets, the ones in the same DC as the replica
sending the update.
In the next patch, we'll occasionally need to send cross-dc view
updates, so to avoid computing the nodes again, in this patch
we adjust the logic to prepare them in advance and save them so
that they can be later reused.
2025-09-22 12:45:24 +02:00
Wojciech Mitros
9d4449a492 mv: move the decision about simple rack-aware pairing later
We'll need to get the lists for the whole dc when fixing replica
count mismatches caused by RF changes, so let's first get these lists,
and only filter them later if we decide to use simple rack-aware pairing.
2025-09-22 12:45:24 +02:00
Avi Kivity
1258e7c165 Revert "Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski"
This reverts commit fe7e63f109, reversing
changes made to b5f3f2f4c5. It is causing
test.py failures around cqlpy.

Fixes #26163

Closes scylladb/scylladb#26174
2025-09-22 09:32:46 +03:00
Piotr Dulikowski
591a67c7e7 Merge 'view_builder: register view on all shards atomically' from Michael Litvak
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes https://github.com/scylladb/scylladb/issues/22989

backport not needed - the issue is probably not common and there's a workaround

Closes scylladb/scylladb#25790

* github.com:scylladb/scylladb:
  test: mv: add a test for view build interrupt during registration
  view_builder: register view on all shards atomically
2025-09-22 08:03:44 +02:00
Karol Nowacki
eedf506be5 vector_store_client: Rename vector_store_uri to vector_store_primary_uri
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.

This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.

This change must be included before the next release.
Otherwise, users will be affected by a breaking change.

References: VECTOR-187

Closes scylladb/scylladb#26033
2025-09-21 16:33:10 +03:00
Michael Litvak
3dffb8e0dc test: mv: add a test for view build interrupt during registration
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
2025-09-21 10:39:30 +02:00
Michael Litvak
6043409c31 view_builder: register view on all shards atomically
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes scylladb/scylladb#22989
2025-09-21 10:39:05 +02:00
Michał Chojnowski
9e70df83ab db: get rid of sstables-format-selector
Our sstable format selection logic is weird, and hard to follow.

If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
   doesn't do anything, but in the past it used to disable
   cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
   versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
   etc).
3. There is a cluster feature for each version:
   ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
   (Currently all sstable version features have been grandfathered,
   and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
   latest enabled sstable version. (Why? Isn't this directly derived
   from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
   sstable version to be used for new writes.
   This field is updated by `sstables_format_selector`
   based on cluster features and the `system.scylla_local` entry.

I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
   data. For example, range tombstones with an infinite bound are only
   supported by sstables since version "mc". So if a range tombstone
   with an infinite bound exists somewhere in the dataset,
   the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
   is enabled. (Otherwise new sstables might become unreadable if they
   are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.

So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.

The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.

This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
   After the patch we no longer update or read it.
   As far as I understand, the purpose of this entry is to prevent
   unwanted format downgrades (which is something cluster features
   are designed for) and it's updated if and only if relevant
   cluster features are updated. So there's no reason to have it,
   we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
   Without the `system.scylla_local` around, it's just a glorified
   feature listener.
3. The format selection logic is moved into `sstable_manager`.
   It already sees the `db::config` and the `gms::feature_service`.
   For the foreseeable future, the knowledge of enabled cluster features
   and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
   inhibit cluster features. Instead, it is intended to select the
   format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
   (which used to be set by `sstables_format_selector`) we write
   them with the "preferred" format, which is determined by
   `sstable_manager` based on the combination of enabled features
   and the current value of `sstable_format`.

Closes scylladb/scylladb#26092

[avi: Pavel found the reason for the scylla_local entry -
      it predates stable storage for cluster features]
2025-09-19 16:17:56 +03:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00
Avi Kivity
fe7e63f109 Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski
This patch series:
 - Increases the number of allowed scheduling groups to allow creation of `sl:driver`
 - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created
 - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader`
 - Modifies `topology_coordinator` to use  create `sl:driver` after upgrades.
 - Implements using `sl:driver` for new connections in `transport/server`
 - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`.
 - Adds tests to verify the new functionality
 - Modifies existing tests to let them pass after `sl:driver` is added
 - Modifies the documentation to contain new `sl:driver`

The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)):
 - Start ScyllaDB with one node
 - Create 1000 keyspaces, 1 table in each keyspace
 - Start `cassandra-stress` (`-rate threads=50  -mode native cql3`)
 - Run connection storm with 1000 session (100 python processes, 10 sessions each)

The maximum latency during connection storm dropped **from 224.94ms to 41.43ms** (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after).

The snippet of cassandra-stress output from the moment of connection storm:
Before:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        789206,   85887,   85887,   85887,     0.6,     0.3,     2.0,     2.0,     2.5,     5.0,    9.0,  0.09679,      0,      0,       0,       0,       0,       0
total,        909322,  120116,  120116,  120116,     0.4,     0.2,     1.9,     2.0,     2.1,     3.1,   10.0,  0.09053,      0,      0,       0,       0,       0,       0
total,        964392,   55070,   55070,   55070,     0.9,     0.4,     2.0,     4.5,     7.7,    18.9,   11.0,  0.09203,      0,      0,       0,       0,       0,       0
total,        975705,   11313,   11313,   11313,     4.4,     3.5,     6.5,    24.5,    82.7,    83.0,   12.0,  0.11713,      0,      0,       0,       0,       0,       0
total,        987548,   11843,   11843,   11843,     4.2,     3.5,     6.5,    33.7,    48.6,    51.5,   13.0,  0.13366,      0,      0,       0,       0,       0,       0
total,        995422,    7874,    7874,    7874,     6.3,     4.0,     7.7,    85.6,   112.9,   113.5,   14.0,  0.14753,      0,      0,       0,       0,       0,       0
total,       1007228,   11806,   11806,   11806,     4.3,     3.5,     6.5,    29.1,    43.8,    87.1,   15.0,  0.15598,      0,      0,       0,       0,       0,       0
total,       1012840,    5612,    5612,    5612,     8.2,     5.0,    11.5,   121.8,   166.6,   170.1,   16.0,  0.16535,      0,      0,       0,       0,       0,       0
total,       1016186,    3346,    3346,    3346,    13.4,     7.4,    20.1,   204.9,   207.6,   210.4,   17.0,  0.17405,      0,      0,       0,       0,       0,       0
total,       1025462,    9276,    9276,    9276,     6.3,     3.9,     9.6,    74.6,   206.8,   210.0,   18.0,  0.17800,      0,      0,       0,       0,       0,       0
total,       1035979,   10517,   10517,   10517,     4.8,     3.5,     6.7,    38.5,    82.6,    83.0,   19.0,  0.18120,      0,      0,       0,       0,       0,       0
total,       1047488,   11509,   11509,   11509,     4.3,     3.5,     6.0,    32.6,    72.3,    74.0,   20.0,  0.18334,      0,      0,       0,       0,       0,       0
total,       1077456,   29968,   29968,   29968,     1.7,     1.6,     2.9,     3.6,     7.0,     8.2,   21.0,  0.17943,      0,      0,       0,       0,       0,       0
total,       1105490,   28034,   28034,   28034,     1.8,     1.8,     3.5,     4.6,     5.3,    13.8,   22.0,  0.17609,      0,      0,       0,       0,       0,       0
total,       1132221,   26731,   26731,   26731,     1.9,     1.8,     3.8,     5.2,     8.4,    11.1,   23.0,  0.17314,      0,      0,       0,       0,       0,       0
total,       1162149,   29928,   29928,   29928,     1.7,     1.7,     3.0,     4.5,     8.0,     9.1,   24.0,  0.16950,      0,      0,       0,       0,       0,       0
...
```

After:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        822863,   94379,   94379,   94379,     0.5,     0.3,     2.0,     2.0,     2.1,     3.7,    9.0,  0.06669,      0,      0,       0,       0,       0,       0
total,        937337,  114474,  114474,  114474,     0.4,     0.2,     2.0,     2.0,     2.1,     3.4,   10.0,  0.06301,      0,      0,       0,       0,       0,       0
total,        986630,   49293,   49293,   49293,     1.0,     1.0,     2.0,     2.1,    17.9,    19.0,   11.0,  0.07318,      0,      0,       0,       0,       0,       0
total,       1026734,   40104,   40104,   40104,     1.2,     1.0,     2.0,     2.2,     6.3,     7.1,   12.0,  0.08410,      0,      0,       0,       0,       0,       0
total,       1066124,   39390,   39390,   39390,     1.3,     1.0,     2.0,     2.2,     2.6,     3.4,   13.0,  0.09108,      0,      0,       0,       0,       0,       0
total,       1103082,   36958,   36958,   36958,     1.3,     1.1,     2.1,     2.5,     3.1,     4.2,   14.0,  0.09643,      0,      0,       0,       0,       0,       0
total,       1141987,   38905,   38905,   38905,     1.3,     1.0,     2.0,     2.4,    11.4,    12.7,   15.0,  0.09894,      0,      0,       0,       0,       0,       0
total,       1180023,   38036,   38036,   38036,     1.3,     1.0,     2.0,     3.7,     5.6,     7.1,   16.0,  0.10070,      0,      0,       0,       0,       0,       0
total,       1216481,   36458,   36458,   36458,     1.4,     1.0,     2.1,     3.6,     4.7,     5.0,   17.0,  0.10210,      0,      0,       0,       0,       0,       0
total,       1256819,   40338,   40338,   40338,     1.2,     1.0,     2.0,     2.2,     3.5,     5.4,   18.0,  0.10173,      0,      0,       0,       0,       0,       0
total,       1295122,   38303,   38303,   38303,     1.3,     1.0,     2.0,     2.4,    21.0,    21.1,   19.0,  0.10136,      0,      0,       0,       0,       0,       0
total,       1334743,   39621,   39621,   39621,     1.3,     1.0,     2.0,     2.3,     3.3,     4.0,   20.0,  0.10055,      0,      0,       0,       0,       0,       0
total,       1375579,   40836,   40836,   40836,     1.2,     1.0,     2.0,     2.1,     3.4,     5.7,   21.0,  0.09927,      0,      0,       0,       0,       0,       0
total,       1415576,   39997,   39997,   39997,     1.2,     1.0,     2.0,     2.3,     3.2,     4.1,   22.0,  0.09807,      0,      0,       0,       0,       0,       0
total,       1449268,   33692,   33692,   33692,     1.5,     1.4,     2.5,     3.2,     4.2,     5.6,   23.0,  0.09800,      0,      0,       0,       0,       0,       0
total,       1471873,   22605,   22605,   22605,     2.2,     2.0,     4.8,     5.9,     7.0,     7.9,   24.0,  0.10015,      0,      0,       0,       0,       0,       0
...
```

Fixes: https://github.com/scylladb/scylladb/issues/24411

This is a new feature, so no backport needed.

Closes scylladb/scylladb#25412

* github.com:scylladb/scylladb:
  docs: workload-prioritization: add driver service level
  test: add test to verify use of `sl:driver`
  transport: use `sl:driver` to handle driver's control connections
  transport: whitespace only change in update_scheduling_group
  transport: call update_scheduling_group for non-auth connections
  generic_server: transport: start using `sl:driver` for new connections
  test: add test_desc_* for driver service level
  test: service_levels: add tests for sl:driver creation and removal
  test: add reload_raft_topology_state() to ScyllaRESTAPIClient
  service_level_controller: automatically create `sl:driver`
  service_level_controller: methods to create driver service level
  service_level_controller: handle special sl:driver in DESC output
  topology_coordinator: add service_level_controller reference
  system_keyspace: add service_level_driver_created
  test: add MAX_USER_SERVICE_LEVELS
2025-09-18 19:45:17 +03:00
Piotr Dulikowski
fb0e5784e4 mv: fix typo in start_backgroud_fibers
Letter "n" was missing in this name.
2025-09-18 15:50:16 +02:00
Piotr Dulikowski
5f55787e50 Merge 'CDC with tablets' from Michael Litvak
initial implementation to support CDC in tablets-enabled keyspaces.

The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.

In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets

Fixes https://github.com/scylladb/scylladb/issues/22576

backport not needed - new feature

update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136

Closes scylladb/scylladb#23795

* github.com:scylladb/scylladb:
  docs/dev: update CDC dev docs for tablets
  doc: update CDC docs for tablets
  test: cluster_events: enable add_cdc and drop_cdc
  test/cql: enable cql cdc tests to run with tablets
  test: test_cdc_with_alter: adjust for cdc with tablets
  test/cqlpy: adjust cdc tests for tablets
  test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
  cdc: enable cdc with tablets
  topology coordinator: change streams on tablet split/merge
  cdc: virtual tables for cdc with tablets
  cdc: generate_stream_diff helper function
  cdc: choose stream in tablets enabled keyspaces
  cdc: rename get_stream to get_vnode_stream
  cdc: load tablet streams metadata from tables
  cdc: helper functions for reading metadata from tables
  cdc: colocate cdc table with base
  cdc: remove streams when dropping CDC table
  cdc: create streams when allocating tablets
  migration_listener: add on_before_allocate_tablet_map notification
  cdc: notify when creating or dropping cdc table
  cdc: move cdc table creation to pre_create
  cdc: add internal tables for cdc with tablets
  cdc: add cdc_with_tablets feature flag
  cdc: add is_log_schema helper
2025-09-18 13:39:37 +02:00
Piotr Dulikowski
4ed045a15c Merge 'db/view/view_building_worker: wrap shared_sstable in foreign_ptr' from Michał Jadwiszczak
When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually processed).
Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards.

This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards.

Fixes https://github.com/scylladb/scylladb/issues/25859

View building coordinator isn't present in any release yet, no backport needed.

Closes scylladb/scylladb#25832

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indent
  db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`
  db/view/view_building_worker: use table id in `register_staging_sstable_tasks()`
  db/view/view_building_worker: move helper functions higher
2025-09-18 10:24:27 +02:00
Piotr Dulikowski
b71af71ab5 Merge 'db/view/view_building_worker: change sharded<abort_source> to local abort_source' from Michał Jadwiszczak
Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop.
This leads to races because the worker may call `batch::abort()`, which access the abort_sources.

This patch solves this be changing `sharded<abort_source>` into `abort_source`.
Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively.

Fixes https://github.com/scylladb/scylladb/issues/25805
Fixes https://github.com/scylladb/scylladb/issues/26045

View building coordinator hasn't been released yet, so no backport needed.

Closes scylladb/scylladb#26059

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indents
  db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`
  db/view/view_building_worker: execute entire `batch::do_work` on tasks shard
  db/view/view_building_worker: store reference to sharded worker in batch
2025-09-18 10:11:20 +02:00
Andrzej Jackowski
dd9b4c64d2 system_keyspace: add service_level_driver_created
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.

A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.

Refs: scylladb/scylladb#24411
2025-09-18 09:28:32 +02:00
Michał Jadwiszczak
1d8b41a51d db/view/view_building_worker: fix indents 2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
99db5a6c30 db/view/view_building_worker: change sharded<abort_source> to local abort_source
Previously the sharded abort_sources was stopped at the end of batch::do_work(),
which is working in parallel to view building worker main loop.
This leads to races because the worker may call batch::abort(),
which access the abort_sources.

This patch solves this be changing `sharded<abort_source>` into
`abort_source`.
Since now `batch::do_work()` is executed on tasks' shard,
all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`,
but this method now does `smp::submit_to(replica.shard, [request abort])`,
so the abort source is used on tasks' shard exclusively.

Fixes scylladb/scylladb#25805
Fixes scylladb/scylladb#26045
2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
7b9db335c0 db/view/view_building_worker: execute entire batch::do_work on tasks shard
This change will allow us to get rid of problematic
`sharded<abort_source>` and use local `abort_source` instead.
2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
2f65af8aa7 db/view/view_building_worker: store reference to sharded worker in batch
Change reference to view building worker in batch to sharded container.
In next commits, I'm going to execute `do_work()` exclusively on tasks
target shard and sharded reference will be more useful.
2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
e4a0de53ea db/view/view_building_worker: fix indent 2025-09-18 02:57:36 +02:00
Michał Jadwiszczak
7dfb76f9a7 db/view/view_building_worker: wrap shared_sstable in foreign_ptr
When a staging sstable is registered to view building worker,
it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually
processed).
Until now this was done using plain `sstables::shared_sstable`
(= `lw_shared_ptr`) which is not safe to be moved between shards.

This patch fixes this by wrapping the pointer in `foreign_ptr` and
obtains necessary informations (owner shard, last token) on the original
shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure
`staging_sstable_task_info`, which can be safely moved between shards.

Fixes scylladb/scylladb#25859
2025-09-18 02:57:36 +02:00
Michał Jadwiszczak
50678030c0 db/view/view_building_worker: use table id in register_staging_sstable_tasks()
There is no need to pass the pointer only to get id of the table.
2025-09-18 02:57:35 +02:00
Michał Jadwiszczak
b44c223d47 db/view/view_building_worker: move helper functions higher
So they can be used in
`view_building_worker::register_staging_sstable_tasks()`.
2025-09-18 02:57:35 +02:00
Ernest Zaslavsky
a1f18a8883 treewide: Move schema related files to a schema directory As requested in #22111, moved the files and fixed other includes and build system.
Moved files:
- frozen_schema.hh
- frozen_schema.cc
- schema_mutations.hh
- schema_mutations.cc
- column_computation.hh

Fixes: #22111

Closes scylladb/scylladb#25089
2025-09-17 17:31:05 +03:00
Botond Dénes
cc5153ef8c Merge 'db: cache: consider preempting after each partition' from Aleksandra Martyniuk
Currently, during cache invaldation we check if we need to preempt
only after the partition gets invaldaited. This may lead to stalls
if we have a chain of filtered out partitions.

Check for preemption even if the partition does not get invaldated.

Refs: https://github.com/scylladb/scylladb/issues/9136.

Optimization; no backport

Closes scylladb/scylladb#26053

* github.com:scylladb/scylladb:
  db: fix indentation
  db: cache: consider preempting after each partition
2025-09-17 17:26:29 +03:00
Michael Litvak
b8442c6087 cdc: virtual tables for cdc with tablets
Define two new virtual tables in system keyspace: cdc_timestamps and
cdc_streams. They expose the internal cdc metadata for tablets-enabled
keyspace to be consumed by users consuming the CDC log.

cdc_timestamps lists all timestamps for a table where a stream change
occured.

cdc_streams list additionally the current streams sets for each
table and timestamp, as well as difference - closed and opened streams -
from the previous stream set.
2025-09-17 14:47:12 +02:00
Michael Litvak
aa61f074b5 cdc: helper functions for reading metadata from tables
Define functions in system_keyspace that read from the internal cdc
tables and construct the data into internal cdc data structures.
2025-09-17 14:47:12 +02:00
Michael Litvak
b9ee28eaab cdc: add internal tables for cdc with tablets
Add new group0-based tables in system keyspace to be used for cdc with tablets:
* cdc_streams_state - describing "base" state of CDC streams for each
  table - an initial timestamp and a stream set.
* cdc_streams_history - describing following committed stream sets by
  diffs (opened / closed streams) from the previous set.
2025-09-17 14:47:11 +02:00
Piotr Dulikowski
6a90a1fd29 Merge 'db/view/view_building_worker: split batch's data preparation and execution' from Michał Jadwiszczak
The view building batch lives on shard0 but it might be doing
work on shard which owns the tablet replica.
Until now the batch data was accessed from multiple shards (shard0 and
where the batch was executed).

This patch fixes this by splitting tasks execution into:
- preparation which is always happening on shard0
- actual execution of the tasks on relevant shard, but all necessary
  data is copied to the shard and batch object isn't accessed.

Fixes https://github.com/scylladb/scylladb/issues/25804

View building coordinator hasn't been released yet, so no backport needed.

Closes scylladb/scylladb#26058

* github.com:scylladb/scylladb:
  db/view/view_building_worker: move try-catch outside `invoke_on()`
  db/view/view_building_worker: split batch's data preparation and execution
2025-09-17 14:17:25 +02:00
Michał Jadwiszczak
d98237b33c db/view/view_building_worker: move try-catch outside invoke_on()
It's just stylist change, to me doing `invoke_on()` in try-catch
block looks better than the other way.
2025-09-16 23:15:44 +02:00
Michał Jadwiszczak
9458ceff8f db/view/view_building_worker: split batch's data preparation and execution
The view building batch lives on shard0 but it might be doing
work on shard which owns the tablet replica.
Until now the batch data was accessed from multiple shards (shard0 and
where the batch was executed).

This patch fixes this by splitting tasks execution into:
- preparation which is always happening on shard0
- actual execution of the tasks on relevant shard, but all necessary
  data is copied to the shard and batch object isn't accessed.

Fixes scylladb/scylladb#25804
2025-09-16 23:13:36 +02:00
Ernest Zaslavsky
d624413ddd treewide: Move query related files to a new query directory
As requested in #22120, moved the files and fixed other includes and build system.

Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh

Fixes: #22120

This is a cleanup, no need to backport

Closes scylladb/scylladb#25105
2025-09-16 23:40:47 +03:00
Sergey Zolotukhin
2640b288c2 raft: disable caching for raft log.
This change disables caching for raft log table due to the following reasons:
* Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls.
* Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls.

Fixes scylladb/scylladb#26027

Closes scylladb/scylladb#26031
2025-09-16 23:40:47 +03:00
Aleksandra Martyniuk
17e9ec11d7 db: fix indentation 2025-09-16 14:49:54 +02:00
Aleksandra Martyniuk
0024339a71 db: cache: consider preempting after each partition
Currently, during cache invaldation we check if we need to preempt
only after the partition gets invaldaited. This may lead to stalls
if we have a chain of filtered out partitions.

Check for preemption even if the partition does not get invaldated.

Refs: #9136.
2025-09-16 14:45:28 +02:00
Aleksandra Martyniuk
75b772adfb db: optimize cache invalidation following repair/streaming
Currently, if a new sstable is created during repair/streaming,
we invalidate its whole	token range in cache. If the sstable
is sparse, we unnecessarily clear too much data.

Modify cache invalidation, so that only the partitions present
in the sstable are cleared.

To check whether a partition is present in the sstable, we use bloom
filters. Bloom filters may return false positives and show that
an sstable contains a partition, even though it does not. Due to that
we may invalidate a bit more than we need to, but the cache will be
in valid state.

An issue arises when we do not invalidate two consecutive partitions
that are continuous. The sstable may contain a token that falls
between these partitions, breaking the continuity. To check that, we
would need to scan sstable index. However, such a change would
noticeably complicate the invalidation, both performance and code.
In this change, sstable index reader isn't used. Instead, the continuity
flag is unset for all scanned partitions. This comes at a cost of
heavier reads, as we will need to verify continuity when reading more
than one partition from cache.

Fixes: https://github.com/scylladb/scylladb/issues/9136.

Closes scylladb/scylladb#25996
2025-09-14 19:48:14 +03:00
Radosław Cybulski
30306c3375 Remove const & from tags_extension constructor
`tags_extension` constructor unnecesarily takes `std::map` by const ref,
forcing a copy. This patch removes const ref for performance reasons.

Closes scylladb/scylladb#25977
2025-09-14 13:32:21 +03:00