Commit Graph

48709 Commits

Author SHA1 Message Date
Nadav Har'El
78c10af960 test/cqlpy: add reproducer for INSERT JSON .. IF NOT EXISTS bug
This patch adds an xfailing test reproducing a bug where when adding
an IF NOT EXISTS to a INSERT JSON statement, the IF NOT EXISTS is
ignored.

This bug has been known for 4 years (issue #8682) and even has a FIXME
referring to it in cql3/statements/update_statement.cc, but until now
we didn't have a reproducing test.

The tests in this patch also show that this bug is specific to
INSERT JSON - regular INSERT works correctly - and also that
Cassandra works correctly (and passes the test).

Refs #8682

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25244
2025-07-30 20:14:50 +03:00
Piotr Smaron
8d5249420b Update seastar submodule
* seastar 60b2e7da...7c32d290 (14):
  > posix: Replace static_assert with concept
  > tls: Push iovec with the help of put(vector<temporary_buffer>)
  > io_queue: Narrow down friendship with reactor
  > util: drop concepts.hh
  > reactor: Re-use posix::to_timespec() helper
  > Fix incorrect defaults for io queue iops/bandwidth
  > net: functions describing ssl connection
  > Add label values to the duplicate metrics exception
  > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov
    test: Add unit test for cross-sched-groups wakeups
    test: Add unit test for fair CPU scheduling
    test: Add unit test for basic supergrops manipulations
    test: Add perf test for context switch latency
    scheduling: Add an internal method to get group's supergroup
    reactor: Add supergroup get_shares() API
    reactor: Add supergroup::set_shares() API
    reactor: Create scheduling groups in supergroups
    reactor: Supergroups destroying API
    reactor: Supergroups creating API
    reactor: Pass parent pointer to task_queue from caller
    reactor: Wakeup queue group on child activation
    reactor: Add pure virtual sched_entity::run_tasks() method
    reactor: Make task_queue_group be sched_entity too
    reactor: Split task_queue_group::run_some_tasks()
    reactor: Count and limit supergroup children
    reactor: Link sched entity to its parent
    reactor: Switch activate(task_queue*) to work on sched_entity
    reactor: Move set_shares() to sched_entity()
    reactor: Make account_runtime() work with sched_entity
    reactor: Make insert_activating_task_queue() work on sched_entity
    reactor: Make pop_active_task_queue() work on sched_entity
    reactor: Make insert_active_task_queue() work on sched_entity
    reactor: Move timings to sched_entity
    reactor: Move active bit to sched_entity
    reactor: Move shares to sched_entity
    reactor: Move vruntime to sched_entity
    reactor: Introduce sched_entity
    reactor: Rename _activating_task_queues -> _activating
    reactor: Remove local atq* variable
    reactor: Rename _active_task_queues -> _active
    reactor: Move account_runtime() to task_queue_group
    reactor: Move vruntime update from task_queue into _group
    reactor: Simplify task_queue_group::run_some_tasks()
    reactor: Move run_some_tasks() into task_queue_group
    reactor: Move insert_activating_task_queues() into task_queue_group
    reactor: Move pop_active_task_queue() into task_queue_group
    reactor: Move insert_active_task_queue() into task_queue_group
    reactor: Introduce and use task_queue_group::activate(task_queue)
    reactor: Introduce task_queue_group::active()
    reactor: Wrap scheduling fields into task_queue_group
    reactor: Simplify task_queue::activate()
    reactor: Rename task_queue::activate() -> wakeup()
    reactor: Make activate() method of class task_queue
    reactor: Make task_queue::run_tasks() return bool
    reactor: Simplify task_queue::run_tasks()
    reactor: Make run_tasks() method of class task_queue
  > Fix hang in io_queue for big write ioproperties numbers
  > split random io buffer size in 2 options
  > reactor: document run_in_background
  > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar
    Add unit test for validating computed params in io_queue
    Move `disk_params` and `disk_config_params` to their own unit
    Add an overload for `disk_config_params::generate_config`

Closes scylladb/scylladb#25254
2025-07-30 16:44:18 +03:00
Patryk Jędrzejczak
5ce16488c9 Merge 'test/cqlpy: two small fixes for "--release" feature' from Nadav Har'El
This small series fixes two small bugs in the "--release" feature of test/cqlpy/run and test/alternator/run, which allows a developer to run signle-node functional tests against any past release of Scylla. The two patches fix:

1. Allow "run --release" to be used when Scylla has not even been built from source.
2. Fix a mistake in choosing the most recent release when only a ".0" and RC releases are available. This is currently the case for the 2025.2 branch, which is why I discovered the bug now.

Fixes #25223

This patch only affects developer's experience if using the test/cqlpy/run script manually (these scripts are not used by CI), so should not be backported.

Closes scylladb/scylladb#25227

* https://github.com/scylladb/scylladb:
  test/cqlpy: fix fetch_scylla.py for .0 releases
  test/cqlpy: fix "run --release" when Scylla hasn't been built
2025-07-30 15:13:26 +02:00
Aleksandra Martyniuk
99ff08ae78 streaming: close sink when exception is thrown
If an exception is thrown in result_handling_cont in streaming,
then the sink does not get closed. This leads to a node crash.

Close sink in exception handler.

Fixes: https://github.com/scylladb/scylladb/issues/25165.

Closes scylladb/scylladb#25238
2025-07-30 14:26:14 +03:00
Andrei Chekun
4c33ff791b build: add pytest-timeout to the toolchain
Adding this plugin allows using timeout for a test or timeout for the whole
session. This can be useful for Unit Test Custom task in the pipeline to avoid
running tests is batches, that will mess with the test names later in Jenkins.

Closes #25210

[avi: regenerate frozen toolchain with optimized clang from

  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]

Closes scylladb/scylladb#25243
2025-07-30 12:53:10 +03:00
Botond Dénes
2985c343ed Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He
When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message.

This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression.

This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream.

Tests are added to make sure the message split works.

Fixes #24808

Closes scylladb/scylladb#25002

* github.com:scylladb/scylladb:
  repair: Avoid too many fragments in a single repair_row_on_wire
  repair: Change partition_key_and_mutation_fragments to use chunked_vector
  utils: Allow chunked_vector::erase to work with non-default-constructible type
2025-07-29 17:45:57 +03:00
Patryk Jędrzejczak
8e43856ca7 Merge 'Pass more elaborated "reasons" to stop_ongoing_compactions()' from Pavel Emelyanov
When running compactions are aborted by the aforementioned helper, in logs there appear a line like
"Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation".

With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other.

Closes scylladb/scylladb#25136

* https://github.com/scylladb/scylladb:
  compaction: Pass "reason" to perform_task_on_all_files()
  compaction: Pass "reason" to run_with_compaction_disabled()
  compaction: Pass "reason" to stop_and_disable_compaction()
2025-07-29 16:06:17 +02:00
Pavel Emelyanov
286fad4da6 api: Simplify table_info::name extraction with std::views::transform
Instead of using lambda, pass pointer to struct member. The result is
the same, but the code is nicer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25123
2025-07-29 15:56:58 +02:00
Nadav Har'El
22f845b128 docs/alternator: mention missing ShardFilter support
Add in docs/alternator/compatibility.md a mention of the ShardFilter
option which we don't support in Alternator Streams. This option was
only introduced to DynamoDB a week ago, so it's not surprising we
don't yet support it :-)

Refs #25160

Closes scylladb/scylladb#25161
2025-07-29 14:37:24 +03:00
Andrei Chekun
a6a3d119e8 docs: update documentation with new way of running C++ tests
Documentation had outdated information how to run C++ test.
Additionally, some information added about gathered test metrics.

Closes scylladb/scylladb#25180
2025-07-29 14:36:19 +03:00
Dawid Mędrek
408b45fa7e db/commitlog: Extend error messages for corrupted data
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.

Closes scylladb/scylladb#25190
2025-07-29 14:35:14 +03:00
Anna Stuchlik
b67bb641bc doc: add OS support for ScyllaDB 2025.3
This commit adds the information about support for platforms in ScyllaDB version 2025.3.

Fixes https://github.com/scylladb/scylladb/issues/24698

Closes scylladb/scylladb#25220
2025-07-29 14:33:12 +03:00
Anna Stuchlik
8365219d40 doc: add the upgrade guide from 2025.2 to 2025.3
This PR adds the upgrade guide from version 2025.2 to 2025.3.
Also, it removes the upgrade guide existing for the previous version
that is irrelevant in 2025.2 (upgrade from 2025.1 to 2025.2).

Note that the new guide does not include the "Enable Consistent Topology Updates" page and note,
as users upgrading to 2025.3 have consistent topology updates already enabled.

Fixes https://github.com/scylladb/scylladb/issues/24696

Closes scylladb/scylladb#25219
2025-07-29 14:32:31 +03:00
Avi Kivity
11ee58090c commitlog: replace std::enable_if with a constraint
std::enable_if is obsolete and was replaced with concepts
and constraint.

Replace the std::is_fundamental_v enable_if constraint with
std::integral. The latter is more accurate - std::ntoh()
is not defined for floats, for example. In any case, we only
read integrals in commitlog.

Closes scylladb/scylladb#25226
2025-07-29 12:51:24 +02:00
Michał Chojnowski
6d27065f99 cql3/result_set: set GLOBAL_TABLES_SPEC in metadata if appropriate
Unless the client uses the SKIP_METADATA flag,
Scylla attaches some metadata to query results returned to the CQL
client.
In particular, it attaches the spec (keyspace name, table
name, name, type) of the returned columns.

By default, the keyspace name and table name is present in each column
spec. However, since they are almost always the same for every column
(I can't think of any case when they aren't the same;
it would make sense if Cassandra supported joins, but it doesn't)
that's a waste.

So, as an optimization, the CQL protocol has the GLOBAL_TABLES_SPEC flag.
The flag can be set if all columns belong to the same table,
and if is set, then the keyspace and table name are only written
in the first column spec, and skipped in other column specs.

Scylla sets this flag, if appropriate, in responses to a PREPARE requests.
But it never sets the flag in responses to queries.

But it could. And this patch causes it to do that.

Fixes #17788

Closes scylladb/scylladb#25205
2025-07-29 12:40:12 +03:00
Nadav Har'El
f6a3e6fbf0 sstables: don't depend on fmt 11.1 to build
A recent commit a0c29055e5 added
some trace printouts which print an std::reference_wrapper<>.
Apparently a formatter for this type was only added to fmt
in version 11.1.0, and it doesn't exist on earlier versions,
such as fmt 11.0.2 on Fedora 41.

Let's avoid requiring shiny-new versions of fmt. The workaround
is easy: just unwrap the reference_wrapper - print pr.get()
instead of just pr, and Scylla returns to building correctly on
Fedora 41.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25228
2025-07-29 11:32:06 +02:00
Patryk Jędrzejczak
3299ffba51 Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.

The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.

Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.

Refs #25115

Fixes #24625

Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.

Closes scylladb/scylladb#25151

* https://github.com/scylladb/scylladb:
  raft_group0: split shutdown into abort_and_drain and destroy
  Revert "main.cc: fix group0 shutdown order"
2025-07-29 10:39:00 +02:00
Asias He
e28c75aa79 repair: Avoid too many fragments in a single repair_row_on_wire
When repairing a partition with many rows, we can store many fragments
in a repair_row_on_wire object which is sent as a rpc stream message.

This could cause reactor stalls when the rpc stream compression is
turned on, because the compression compresses the whole message without
any split and compression.

This patch solves the problem at the higher level by reducing the
message size that is sent to the rpc stream.

Tests are added to make sure the message split works.

Fixes #24808
2025-07-29 13:43:53 +08:00
Asias He
266a518e4c repair: Change partition_key_and_mutation_fragments to use chunked_vector
With the change in "repair: Avoid too many fragments in a single
repair_row_on_wire", the

std::list<frozen_mutation_fragment> _mfs;

in partition_key_and_mutation_fragments will not contain large number of
fragments any more. Switch to use chunked_vector.
2025-07-29 13:43:17 +08:00
Asias He
4a4fbae8f7 utils: Allow chunked_vector::erase to work with non-default-constructible type
This is needed for chunked_vector<frozen_mutation_fragment> in repair.
2025-07-29 13:43:17 +08:00
Avi Kivity
d3cdb88fe7 tools: toolchain: dbuild: increase depth of nested podman configuration coverage
The initial support for nested containers (2d2a2ef277) worked on
my machine (tm) and even laptop, but does not work on fresh installs.
This is likely due to changes in where persistent configuration is
stored on the host between various podman versions; even though my
podman is fully updated, it uses configuration created long ago.

Make nested containers work on fresh installs by also configuring
/etc/containers/storage.conf. The important piece is to set graphroot
to the same location as the host.

Verified both on my machine and on a fresh install.

Closes scylladb/scylladb#25156
2025-07-29 08:23:41 +03:00
Botond Dénes
f3ed27bd9e Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov
Nowadays the way to configure an internal service is

1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value

The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.

For the reference: similar changes with other services: #23705 , #20174 , #19166

Closes scylladb/scylladb#25118

* github.com:scylladb/scylladb:
  gms,init: Move get_disabled_features_from_db_config() from gms
  code: Update callers generating feature service config
  gms: Make feature_config a simple struct
  gms: Split feature_config_from_db_config() into two
2025-07-29 08:17:49 +03:00
Anna Stuchlik
18b4d4a77c doc: add tablets support information to the Drivers table
This commit:

- Extends the Drivers support table with information on which driver supports tablets
  and since which version.
- Adds the driver support policy to the Drivers page.
- Reorganizes the Drivers page to accommodate the updates.

In addition:
- The CPP-over-Rust driver is added to the table.
- The information about Serverless (which we don't support) is removed
  and replaced with tablets to correctly describe the contents of the table.

Fixes https://github.com/scylladb/scylladb/issues/19471

Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69

Closes scylladb/scylladb#24635
2025-07-29 08:11:42 +03:00
Avi Kivity
f7324a44a2 compaction: demote normal compaction start/end log messages to debug level
Compaction is routine and the log messages pollute the log files,
hiding important information.

All the data is available via `nodetool compactionhistory`.

Reduce noise by demoting those log messages to debug level.

One test is adjusted to use debug level for compaction, since it
listens for those messages.

Closes scylladb/scylladb#24949
2025-07-29 08:02:22 +03:00
Nadav Har'El
e43828c10b test/cqlpy: fix fetch_scylla.py for .0 releases
The test/cqlpy/fetch_scylla.py script is used by test/cqlpy/run and
test/alternator/run to implement their "--release" option - which allows
you to run current tests against any official release of Scylla
downloaded from Scylla's S3 bucket.

When you ask to get release "2025.1", the idea is to fetch the latest
release available in the 2025.1 stream - currently it is 2025.1.5.
fetch_scylla.py does this by listing the available 2025.1 releases,
sorting them and fetching the last one.

We had a bug in the sort order - version 0 was sorted before version
0-rc1, which is incorrect (the version 2025.2.0 came after
2025.2.0~rc1).

For most releases this didn't cause any problem - 0~rc1 was sorted after
0, but 5 (for example) came after both, so 2025.1.5 got downloaded.
But when a release has **only** an rc and a .0 release, we incorrectly
used the rc instead of the .0.

This patch fixes the sort order by using the "/" character, which sorts
before "0", in rc version strings when sorting the release numbers.

Before this patch, we had this problem in "--release 2025.2" because
currently 2025.2 only has RC releases (rc0 and rc1) and a .0 release,
and we wrongly downloaded the rc1. After this patch, the .0 is chosen
as expected:

  $ test/cqlpy/run --release 2025.2
  Chosen download for ScyllaDB 2025.2: 2025.2.0

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-28 22:02:15 +03:00
Nadav Har'El
72358ee9f4 test/cqlpy: fix "run --release" when Scylla hasn't been built
The "--release" option of test/cqlpy/run can be used to run current
cqlpy tests against any official release of Scylla, which is
automatically downloaded from Scylla's S3 bucket. You should be
able to run tests like that even without having compiled Scylla
from source. But we had a bug, where test/cqlpy/run looked for
the built Scylla executable *before* parsing the "--release"
option, and this bug is fixed in this patch.

The Alternator version of the run script, test/alternator/run,
doesn't need to be fixed because it already did things in the
right order.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-07-28 21:42:02 +03:00
Dawid Mędrek
b41151ff1a test: Enable RF-rack-valid keyspaces in all Python suites
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.

That encompasses the following suites:

* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.

Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.

The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.

The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.

Fixes scylladb/scylladb#25126

Closes scylladb/scylladb#24617
2025-07-28 16:32:59 +02:00
Gleb Natapov
198cfc6fe7 migration manager: do not use group0 on non zero shard
Commit ddc3b6dcf5 added a check of group0 state in
get_schema_for_write(), but group0 client can only be used on shard 0,
and get_schema_for_write() can be called on any shard, so we cannot use
_group0_client there directly. Move assert where we use another group0
function already where it is guarantied to run on shard 0.

Closes scylladb/scylladb#25204
2025-07-28 14:10:01 +02:00
Nadav Har'El
b4fc3578fc Merge 'LWT: enable for tablet-based tables' from Petr Gusev
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.

Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
  * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
  * This breaks LWT correctness in certain scenarios.
  * Failing test cases demonstrating this:
      * test_lwt_state_is_preserved_on_tablet_migration
      * test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
  * Tablets may be placed on arbitrary shards by the tablet balancer.
  * Accessing Paxos state in system tables could require a shard jump, degrading performance.

We move Paxos state into a dedicated Paxos table, colocated with the base table:
  * Each base table gets its own Paxos state table.
  * This table is lazily created on the first LWT operation.
  * Its tablets are colocated with those of the base table, ensuring:
    * Co-migration during tablet movement
    * Co-rebuilding with the base table
    * Shard alignment for local access to Paxos state

Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].

This PR addresses two issues from the "Why doesn't it work for tablets" section  in [1]:
  * Tablet migration vs LWT correctness
  * Paxos table sharding

Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.

References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)

Backport: not needed because this is a new feature.

Closes scylladb/scylladb#24819

* github.com:scylladb/scylladb:
  create_keyspace: fix warning for tablets
  docs: fix lwt.rst
  docs: fix tablets.rst
  alternator: enable LWT
  random_failures: enable execute_lwt_transaction
  test_tablets_lwt: add test_paxos_state_table_permissions
  test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
  test_tablets_lwt: test timeout creating paxos state table
  test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
  test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
  test_tablets_lwt: migrate test_lwt_support_with_tablets
  test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
  test_tablets_lwt: add simple test for LWT
  check_internal_table_permissions: handle Paxos state tables
  client_state: extract check_internal_table_permissions
  paxos_store: handle base table removal
  database: get_base_table_for_tablet_colocation: handle paxos state table
  paxos_state: use node_local_only mode to access paxos state
  query_options: add node_local_only mode
  storage_proxy: handle node_local_only in query
  storage_proxy: handle node_local_only in mutate
  storage_proxy: introduce node_local_only flag
  abstract_replication_strategy: remove unused using
  storage_proxy: add coordinator_mutate_options
  storage_proxy: rename create_write_response_handler -> make_write_response_handler
  storage_proxy: simplify mutate_prepare
  paxos_state: lazily create paxos state table
  migration_manager: add timeout to start_group0_operation and announce
  paxos_store: use non-internal queries
  qp: make make_internal_options public
  paxos_store: conditional cf_id filter
  paxos_store: coroutinize
  feature_service: add LWT_WITH_TABLETS feature
  paxos_state: inline system_keyspace functions into paxos_store
  paxos_state: extract state access functions into paxos_store
2025-07-28 13:19:23 +03:00
Taras Veretilnyk
6b6622e07a docs: fix typo in command name enbleautocompaction -> enableautocompaction
Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'.

Fixes scylladb/scylladb#25172

Closes scylladb/scylladb#25175
2025-07-28 12:49:26 +03:00
Tomasz Grabiec
55116ee660 topology_coordinator: Trigger load stats refresh after replace
Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet
scheduler needs load stats for the new node (replacing) to make
decisisons.

Fixes #25163

Closes scylladb/scylladb#25181
2025-07-28 11:07:17 +02:00
Robert Bindar
d921a565de Add open-coredump script depndencies to install-dependencies.sh
Whilst the coredump script checks for prerequisites, the user
experience is not ideal because you either have to go in the
script and get the list of deps and install them or wait for
the script to complain about lacking dependencies one by one.
This commit completes the list of dependencies in the
install script (some of them were already there for Fedora),
so you already have them installed by the time you
get to run the coredump script.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

[avi:
 - remove trailing whitespace
 - regenerate frozen toolchain

Optimized clang binaries generated and stored in

  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]

Closes #22369

Closes scylladb/scylladb#25203
2025-07-28 06:45:01 +03:00
Avi Kivity
1930f3e67f Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.

However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.

For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)

Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.

Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.

Preparation for new functionality, no backporting needed.

Closes scylladb/scylladb#25093

* github.com:scylladb/scylladb:
  sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
  test/boost: add a test for inexact index lookups
  sstables/mx/reader: allow passing a custom index reader to the constructor
  sstables/index_reader: remove advance_to
  sstables/mx/reader: handle inexact lookups in `advance_context()`
  sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
  sstables/index_reader: make the return value of `get_partition_key` optional
  sstables/mx/reader: handle "backward jumps" in forward_to
  sstables/mx/reader: filter out partitions outside the queried range
  sstables/mx/reader: update _pr after `fast_forward_to`
2025-07-27 19:39:36 +03:00
Avi Kivity
8180cbcf48 Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.

Add clone() and clone_gently() methods to
allow explicit copies.

* minor optimization, no backport needed

Closes scylladb/scylladb#24978

* github.com:scylladb/scylladb:
  tablets: prevent accidental copy of tablets_map
  locator: tablets: get rid of synchronous mutate_tablet_map
2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar
0c5fa8e154 locator/token_metadata.cc: use chunked_vector to store _sorted_tokens
The `token_metadata_impl` stores the sorted tokens in an `std::vector`.
With a large number of nodes, the size of this vector can grow quickly,
and updating it might lead to oversized allocations.

This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such
issues. It also updates all related code to use `chunked_vector` instead
of `std::vector`.

Fixes #24876

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#25027
2025-07-27 11:29:22 +03:00
Tomasz Grabiec
a1d7722c6d Merge 'api: repair_async: refuse repairing tablet keyspaces' from Aleksandra Martyniuk
A tablet repair started with /storage_service/repair_async/ API
bypasses tablet repair scheduler and repairs only the tablets
that are owned by the requested node. Due to that, to safely repair
the whole keyspace, we need to first disable tablet migrations
and then start repair on all nodes.

With the new API - /storage_service/tablets/repair -
tailored to tablet repair requirements, we do not need additional
preparation before repair. We may request it on one node in
a cluster only and, thanks to tablet repair scheduler,
a whole keyspace will be safely repaired.

Both nodetool and Scylla Manager have already started using
the new API to repair tablets.

Refuse repairing tablet keyspaces with /storage_service/repair_async -
403 Forbidden is returned. repair_async should still be used to repair
vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/23008.

Breaking change; no backport.

Closes scylladb/scylladb#24678

* github.com:scylladb/scylladb:
  repair: remove unused code
  api: repair_async: forbid repairing tablet keyspaces
2025-07-27 09:25:42 +02:00
Piotr Dulikowski
44de563d38 Merge 'db/hints: Improve logging' from Dawid Mędrek
We improve logging in critical functions in hinted handoff
to capture more information about the behavior of the module.
That should help us in debugging sessions.

The logs should only be printed during more important events
and so they should not clog the log files.

Backport: not necessary.

Closes scylladb/scylladb#25031

* github.com:scylladb/scylladb:
  db/hints/manager.cc: Add logs for changing host filter
  db/hints: Increase log level in critical functions
2025-07-27 09:25:42 +02:00
Michael Litvak
3ff388cd94 storage service: drain view builder before group0
The view builder uses group0 operations to coordinate view building, so
we should drain the view builder before stopping group0.

Fixes scylladb/scylladb#25096

Closes scylladb/scylladb#25101
2025-07-27 09:25:42 +02:00
Pavel Emelyanov
403a72918d sstables/types.hh: Remove duplicate version.hh inclusion
The latter header in included two times, one is enough

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25109
2025-07-27 09:25:42 +02:00
Pavel Emelyanov
1b9eb4cb9f init.hh: Remove unused forward declarations
The init.hh contains some bits that only main.cc needs. Some of its
forward declarations are neede by neither the headers itself, nor the
main.cc that includes it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25110
2025-07-27 09:25:42 +02:00
Petr Gusev
8b8b7adbe5 raft_group0: split shutdown into abort_and_drain and destroy
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).

However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.

This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.

The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.

Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().

Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().

The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
2025-07-25 17:16:14 +02:00
Michał Chojnowski
b1da5f2d0f sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
After making the sstable reader more permissive,
we can weaken the abstract_index_reader interface.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
be1f54c6d2 test/boost: add a test for inexact index lookups 2025-07-25 11:00:18 +02:00
Michał Chojnowski
810eb93ff0 sstables/mx/reader: allow passing a custom index reader to the constructor
For tests.
Will be used for testing how the data reader reacts to various
combinations of inexact index lookup results.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
fe8ee34024 sstables/index_reader: remove advance_to
`advance_to` is unused now, so remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
03bf6347e2 sstables/mx/reader: handle inexact lookups in advance_context()
`advance_context()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_context()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.

This makes `advance_to` unused, so we remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
11792850dd sstables/mx/reader: handle inexact lookups in advance_to_next_partition()
`advance_to_next_partition()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_to_next_partition()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
141895f9eb sstables/index_reader: make the return value of get_partition_key optional
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
a0c29055e5 sstables/mx/reader: handle "backward jumps" in forward_to
A bunch of code assumes that the Data.db stream can only go forward.
But with BTI indexes, if we perform an advance_to, the index can point to a position
which the data reader has already passed, since the index is inexact.

The logic of the data reader ensures that it has stopped
within the last partition range, or just immediately
after it, after reading the next partition key and
noticing that it doesn't belong to the range.

But forward_to can only be used with increasing ranges.
The start of the next range must be greater or equal to the
end of the previous range.

This means that the exact start of the next partition range
must be no earlier than:
1. Before the partition key just read by the data reader,
if the data reader is positioned immediately after a partition key.
2. The start of the first partition after the current data reader
position, if the data reader isn't positioned immediately after a
partition key.

So, if the index returns a position smaller than the current data
reader position, then:
1. If the reader is immediately after a partition key,
we have to reuse this partition key (since we can't go back
in the stream to read it again), and keep reading from
the current position.
2. Otherwise we can safely walk the index to the first partition
that lies no earlier than the current position.
2025-07-25 10:49:58 +02:00
Michał Chojnowski
218b2dffff sstables/mx/reader: filter out partitions outside the queried range
The current index format is exact: it always returns the position of the
first partition in the queried partition range.

But we are about the add an index format where that doesn't have to be the case.
In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares
the reader for that, by skipping the partitions which were read by the
data reader but don't belong to the queried range.

Note: as of this patch, only the "normal path" is ever used.
We add tests exercising these code paths later.

Also note that, as of this patch, actually stepping outside
the queried range would cause the reader to end up in a
state where the underlying parser is positioned right after
partition key immediately following the queried range.
If the reader was forwarded to that key in this state,
it would trip an assert, because the parser can't handle backward
jumps. We will add logic to handle this case in the next patch.
2025-07-25 10:49:57 +02:00