Commit Graph

8488 Commits

Author SHA1 Message Date
Calle Wilund
5cc3fc4f14 cluster/test_encryption: bring test from enterprise (and enable)
Fixes scylladb/scylla-enterprise#5262

Part of the source-available code migration from scylla-enterprise.git
to scylla.git.

Original comment: topology_custom: add test_file_streaming_respects_encryption

Reproducer for issue scylladb/scylla-enterprise#4246.

Closes scylladb/scylladb#23320
2025-03-20 10:07:16 +02:00
Botond Dénes
d06bc27979 Merge 'Don't export string filenames from sstable' from Pavel Emelyanov
There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names.

Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names.

This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit.

refs: #14122 (previous ugly attempt to achieve the same goal)

Closes scylladb/scylladb#23194

* github.com:scylladb/scylladb:
  sstable: Remove unused malformed_sstable_exctpion(string filename)
  sstables: Make filename() return component_name
  sstables: Make file_writer keep component_name on board
  sstables: Make get_filename() return component_name
  sstables: Make toc_filename() return component_name
  sstables: Make sstable::index_filename() return component_name
  sstables: Introduce struct component_name
  sstables: Remove unused sstable::component_filenames() method
  sstables: Do not print component filenames on load-and-stream wrap-up
  sstables: Explicitly format prefix in S3 object name making
  sstables: Don't include directory name in exception
  sstables: Use fmt::format instead of string concatenation
  sstables: Rename filename($component) calls to ${component}_filename()
  sstables: Rename local filename variable to component_name
2025-03-20 09:51:03 +02:00
Avi Kivity
a62ab824e6 schema: deprecate schema_extension
schema_extension allows making invisible changes to system_schema
that evade upgrade rollback tests. They appear in system_schema
as an encoded blob which reduces serviceability, as they cannot
be read.

Deprecate it and point users to adding explicit columns in scylla_tables.

We could probably make use of the data structure, after we teach it
to encode its payload into proper named and typed columns instead of
using IDL.

Closes scylladb/scylladb#23151
2025-03-19 20:36:16 +02:00
Nadav Har'El
317de64281 test/alternator: enable debugging output during Python crashes
For a long time now, we've been seeing (see #17564), once in a while,
Alternator tests crashing with the Python process getting killed on
SIGSEGV after the tests have already finished successfully and all
pytest had to do is exit. We have not been able to figure out where the
bug is. Unfortunately, we've never been able to reproduce this bug
locally - and only rarely we see it in CI runs, and when it happens
we don't any information on why it happend.

So the goal of this patch is to print more information that might
hopefully help us next time we see this problem in CI (this patch
does NOT fix the bug). This patch adds to test/alternator's conftest.py
a call to faulthandler.enable(). This traps SIGSEGV and prints a stack
trace (for each thread, if there are several) showing what Python was
trying to do while it is crashing. Hopefully we'll see in this output
some specific cleanup function belonging to boto3 or urllib or whatever,
and be able to figure out where the bug is and how to avoid it.

We could have added this faulthandler.enable() call to the top-level
conftest.py or to test.py, but since we only ever had this Python
crash in Alternator tests, I think it is more suitable that we limit
this desperate debugging attempt only to Alternator tests.

Refs #17564

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23340
2025-03-19 18:18:51 +03:00
Pavel Emelyanov
73187a2e19 Merge 'mutation/mutation_consumer_concepts: simplify consumer hierarchy' from Botond Dénes
The reader consumer concept hierarchy is a sprawling confusing jungle of deeply nested concepts. Looking at `FlattenedConsumer[V2]` -- the subject of this PR: this consumer is defined in terms of the `StreamedMutationConsumer[V2]` which in terms is defined in terms of the `FragmentConsumer[V2]`.
This amount of nesting makes it really hard to see what a concept actually comes down to: made even more difficult by the fact that the concepts are scattered across two header files.
In theory, this nesting allows for greater flexibility: some code can use a lower lever concept directly while it can also serve as the basis for the higher lever concepts. But the fact of the matter is that none of the lower level concepts are used directly, so we pay the price in hard-to-follow code for no benefit.

This PR cuts down the complexity by folding up the entire hierarchy into the top-level `FlattenedConsumer[V2]` and `FlatteneConsumerReturning[V2]` concepts.
Doing this immediately reveals just how similar the two major consumer concepts (`FlattenedConsumer[V2]` and `MutationFragmentConsumer[V2]`) supported by `mutation_reader` are. In a follow-up PR, we will attempt to unify the two.

Refactoring, no backport needed.

Closes scylladb/scylladb#23344

* github.com:scylladb/scylladb:
  mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2]
  mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2]
  test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
2025-03-19 15:43:00 +03:00
Pavel Emelyanov
f06cc32812 sstables: Make filename() return component_name
Similarly to toc_, index_ and data filenames, make the generic component
name getter return back not string, but a wrapper object. Most of
callers are log messages and exception generations. Other than that
there are tests, filesystem storage driver and few more places in
generic code who "know" that they work with real files, so make them use
explicit fmt::to_string().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
1ba91e28cb sstables: Make get_filename() return component_name
Similarly to previous patches -- mostly the result is used as log
argument. The remaining users include

- scylla sstable tool that dumps component names to json output
- API endpoint that returns component names to user
- tests

these are all good to explicitly convert component_names to strings.

There are few more places that expect strings instead of component name
objects. For now they also use fmt::to_string() explicitly, partially it
will be fixed later, mostly -- as future follow-ups.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
0cdeed858c sstables: Make toc_filename() return component_name
Most of the callers use the returned value as log message parameter,
some construct malformed_sstable_exception that was prepared by previous
patch.

The remaining callers explicitly use fmt::to_string(), these are

- pending deletion log creation
- filesystem storage code
- tests
- stream-blob code that re-loads sstable

All but the last one are OK to use string toc name, the last one is not
very correct in its usage of toc_filename string, but it needs more care
to be fixed properly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
dcc9167734 sstables: Rename filename($component) calls to ${component}_filename()
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Botond Dénes
8f0d0daf53 Merge 'repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.

Use topology_guard to maintain safety.

Fixes: https://github.com/scylladb/scylladb/issues/22408.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#22842

* github.com:scylladb/scylladb:
  test: add test to check concurrent tablets migration and repair
  repair: do not hold erm for repair scheduled by scheduler
  repair: get total rf based on current erm
  repair: make shard_repair_task_impl::erm private
  repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
  repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
  repair: pass session_id to repair_writer_impl::create_writer
  repair: keep materialized topology guard in shard_repair_task_impl
  repair: pass session_id to repair_meta
2025-03-19 08:55:24 +02:00
Piotr Dulikowski
2ca1c0b6f9 Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak
This PR introduces the new Raft-based recovery procedure for group 0
majority loss.

The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.

The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.

We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.

The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.

For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.

The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.

Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from  `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.

In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.

This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.

Fixes scylladb/scylladb#20657

Closes scylladb/scylladb#22286

* github.com:scylladb/scylladb:
  test: mark tests with the gossip-based recovery procedure
  test: add tests for the Raft-based recovery procedure
  test: topology: util: fix the tokens consistency check for left nodes
  test: topology: util: extend start_writes
  gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
  raft_group0: modify_raft_voter_status: do not add new members
  treewide: allow recreating group 0 in the Raft-based recovery procedure
2025-03-18 19:10:56 +01:00
Pavel Emelyanov
420b5bee20 test/s3: Increase boost/s3_test log levels
When something goes wrong, it's impossible to find anyting out without
s3 and http logs, so increase them for boost tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23245
2025-03-18 15:59:05 +02:00
Botond Dénes
969b07fdfd test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
The class actually implements the FlattenedConsumer, so fix the comment.
This eliminates the only reference to the StreamedMutationConsumer
concept.
2025-03-18 07:57:04 -04:00
Botond Dénes
2795d83b32 Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund
Fixes #23017

When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed.

Now, eventually, this should be released, once we do deletion again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure.

As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available space.

Closes scylladb/scylladb#23150

* github.com:scylladb/scylladb:
  main/commitlog: wait for file deletion and distribute recycled segments to shards
  commitlog: Serialize file deletion
2025-03-18 11:47:17 +02:00
Botond Dénes
afa305ffb4 Merge 'perf/perf_sstable: stop using at_exit() ' from Kefu Chai
`seastar::at_exit()` was marked deprecated recently. so let's use the recommended approach to perform cleanups.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#23253

* github.com:scylladb/scylladb:
  perf/perf_sstable: fix the indent
  perf/perf_sstable: stop using at_exit()
2025-03-17 15:30:10 +02:00
Andrei Chekun
d68e54c26d test.py: Remove reuse cluster in cluster tests
Pool is not aware of the cluster configuration, so it can return cluster
to the test that is not suitable for it. Removing reuse will remove such
possibility, so there will be less flaky tests.

Closes scylladb/scylladb#23277
2025-03-17 15:27:59 +02:00
Calle Wilund
4ed81e05bf commitlog: Serialize file deletion
Fixes #23017

When deleting segments while our footprint is over the limit,
mainly when recycling/deleting segments after replay (recover
boot) we can cause two deletion passes to be running at the same
time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are
considering many files for delete/recycle, we can, due to task
switch, end up considering segments ok to keep, in parallel,
even though one of them should be deleted. The end result
will be us keeping one more segment than should be allowed.
Now, eventually, this should be released, once we do deletion
again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might
cause some delay in processing cycles for recycle, but in
practice, this should never happen when we are in fact under
pressure.

Small unit test included.
2025-03-17 12:09:00 +00:00
Andrei Chekun
7423edb1f7 test.py: Increase verbosity of pytest
Currently, pytest truncates long objects in assertions.
This makes understanding the failure message difficult.
This will increase verbosity and pytest will stop truncating messages.

Closes scylladb/scylladb#23263
2025-03-17 12:51:41 +02:00
Aleksandra Martyniuk
20f9d7b6eb test: add test to check concurrent tablets migration and repair
Add a test to check whether a tablet can be migrated while another
tablet is repaired.
2025-03-17 10:37:03 +01:00
Andrei Chekun
a20d848c01 test.py: Refactor test/conftest.py
Move functions responsible for preparation of the environment to the util file.
This is extracted from https://github.com/scylladb/scylladb/pull/22894 to make it easier to work together.

Closes scylladb/scylladb#23221
2025-03-17 11:31:00 +02:00
Avi Kivity
4416b0c732 treewide: use angle brackets for including seastar headers
Seastar is an external library, so we use angle brackets to
include its interfaces.

Closes scylladb/scylladb#23301
2025-03-17 10:03:06 +02:00
Raphael S. Carvalho
e9944f0b7c service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>

Closes scylladb/scylladb#23247
2025-03-16 22:45:00 +02:00
Pavel Emelyanov
604fdd86e9 test: Count mutation fragments verbosily in scoped restore test
Sometimes after scoped restore a key is not found in nodes' mutation
fragments. This patch makes the counting more verbose to get better
understanding of what's going on in case of test failure

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23296
2025-03-14 21:31:36 +02:00
Botond Dénes
39bcf99f8e Merge 'Apply hard limit to partition range vectors in secondary index queries' from Nikos Dragazis
Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls).

This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded.

Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements:
```
* Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode
* Schema:
  CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false};
  CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck));
  CREATE INDEX t1_pk2_idx ON ks.t1(pk2);
* Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1;

+------------+-------------------+-------------------+
|  Page Size |      Master       |   Vector Limit    |
+============+===================+===================+
|            |   Latency (sec)   |   Latency (sec)   |
+------------+-------------------+-------------------+
|     100    |  5.80 ± 0.13      |  5.64 ± 0.10      |
+------------+-------------------+-------------------+
|    1000    |  4.77 ± 0.07      |  4.62 ± 0.06      |
+------------+-------------------+-------------------+
|    2000    |  4.67 ± 0.07      |  5.13 ± 0.03      |
+------------+-------------------+-------------------+
|    5000    |  4.82 ± 0.09      |  6.25 ± 0.06      |
+------------+-------------------+-------------------+
|   10000    |  4.89 ± 0.36      |  7.52 ± 0.13      |
+------------+-------------------+-------------------+
|     -1     |  4.90 ± 0.67      |  4.79 ± 0.33      |
+------------+-------------------+-------------------+
```
We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size.

Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude.

Finally, a unit test is added to reproduce the bug.

Fixes #18536.

The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport.

Closes scylladb/scylladb#22682

* github.com:scylladb/scylladb:
  cql3: secondary index: Limit page size for single-row partitions
  cql3: secondary index: Limit the size of partition range vectors
  cql3: untyped_result_set: Store rows in chunked_vector
  test: Reproduce bug with large allocations from secondary index
2025-03-14 15:06:07 +02:00
Botond Dénes
83ea1877ab Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky
scylla-sstable: Enable support for S3-stored sstables

Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532)

This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst

ref: https://github.com/scylladb/scylladb/issues/20532
fixes: https://github.com/scylladb/scylladb/issues/20535

No backport needed since the S3 functionality was never released

Closes scylladb/scylladb#22321

* github.com:scylladb/scylladb:
  tests: Add Tests for Scylla-SSTable S3 Functionality
  docs: Update Scylla Tools Documentation for S3 SSTable Support
  scylla-sstable: Enable Support for S3 SSTables
  s3: Implement S3 Fully Qualified Name Manipulation Functions
  object_storage: Refactor `object_storage.yaml` parsing logic
2025-03-14 15:05:52 +02:00
Patryk Jędrzejczak
ca5c223505 test: mark tests with the gossip-based recovery procedure
This patch makes it clear which Raft recovery procedure is used in
each test.

Tests with "This test uses the gossip-based recovery procedure." are
the tests that use the gossip-based topology. This tests should be
deleted once we make the Raft-based topology mandatory.

Tests with the new FIXME are the tests that use the Raft-based
topology. They should be changed to use the Raft-based recovery
procedure or removed if they don't test anything important with
the new procedure.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
4fd0e93154 test: add tests for the Raft-based recovery procedure 2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
4e055882c1 test: topology: util: fix the tokens consistency check for left nodes
When we remove a node in the Raft-based topology
(by remove/replace/decommission), we remove its
tokens from `system.topology`, but we do not
change `num_tokens`. Hence, the old check could
fail for left nodes.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
d0efc77d20 test: topology: util: extend start_writes
We extend `start_writes` to allow:
- providing `ks_name` from the test,
- restarting it (by starting it again with the same `ks_name`),
- running it in the presence of shutdowns.

We use these features in a new test in one of the following patches.
2025-03-14 13:53:05 +01:00
Nadav Har'El
de7c1d526a test/cqlpy: test DESC doesn't list an index as a view
Issue #6058 complained that "DESCRIBE TABLE" or "DESCRIBE KEYSPACE" list
a secondary index as materialized view (the view used to back the index
in Scylla's implementation of secondary indexes). This patch adds a test
to verify that this issue no longer exists in server-side describe - so we
can mark the issue as fixed.

While preparing this test, I noticed that Scylla and Cassandra behave
differently on whether DESC TABLE should list materialized views or not,
so this patch also includes a test for that as well - and I opened
issue #23014 on Scylla and CASSANDRA-20365 on Cassandra to further
discuss that new issue.

Fixes #6058
Refs #23014.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23015
2025-03-14 14:40:19 +03:00
Nadav Har'El
c0821842de alternator: document the state of tablet support in Alternator
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.

We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.

This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.

Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.

When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.

Fixes #21629

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22462
2025-03-14 14:03:15 +03:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Nadav Har'El
a72dde2ee6 test/cqlpy: add test for long table names
Scylla inherited a 48-character limit on the length of table (and
keyspace) names from Cassandra 3. It turns out that Cassandra 4 and
5 unintentionally dropped this limit (see history lesson in
CASSANDRA-20425), and now Cassandra accepts longer table names.
Some Cassandra users are using such longer names and disappointed
that Scylla doesn't allow them.

This patch includes tests for this feature. One test tries a
48-character table name - it passes on Scylla and all versions
of Cassandra. A second test tries a 100-character table name - this
one passes on Cassandra version 4 and above (but not on 3), and
fails on Scylla so marked "xfail". A third test tries a 500-character
table name. This one fails badly on Cassandra (see CASSANDRA-20389),
but passes on Scylla today. This test is important because we need to
be sure that it continues to pass on Scylla even after the Scylla is
fixed to allow the 100-character test.

Refs #4480 - an issue we already have about supporting longer names

Note on the test implementation:
Ideally, the test for a particular table-name length shouldn't just
create the table - it should also make sure we can write table to it
and flush it, i.e., that sstables can get written correctly. But in
practice, these complications are not needed, because in modern Scylla
it is the directory name which contains the table's name, and the
individual sstable files do not contain the table's name. Just creating
the table already creates the long directory name, so that is the part
that needs to be tested. If we created this directory successfully,
later creating the short-named sstables inside it can't fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23229
2025-03-14 11:15:07 +03:00
Kefu Chai
a82cfbecad test: perf_sstable: close frag_stream before destoying it
the underlying reader should be closed before being destroyed. otherwise
we'd have following failure when testing the "full_scan_streaming":

```
$ scylla perf-sstable --parallelism 1 --iterations 20 --partitions 20 --testdir /tmp/sstable --mode full_scan_streaming

ERROR 2025-03-13 15:04:26,321 [shard  0:main] mutation_reader - N8sstables2mx27mx_sstable_full_scan_readerE [0x60015a36b650]: permit *.*:test: was not closed before destruction, at: 0x235931e 0x2359470 0x239deb3 0x62a1ed3 0x89fd156 0x89c3fba 0x22a6ed3 0x22a8fea 0x22aae17 0x22a9928 0x26bb7d0 0x26bbe3e 0x89bca67 0x246bd8d /lib64/libc.so.6+0x3247 /lib64/libc.so.6+0x330a 0x1657774
------
   seastar::internal::coroutine_traits_base<double>::promise_type
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23270
2025-03-14 11:12:44 +03:00
Piotr Smaron
d365d9b2ad test/ldap: assign non-busy ports to ldap
It may happen that the ports we randomly choose for LDAP are busy, and
that'd fail the test suite, so once we randomly select ports, now we'll
see if they're busy or not, and if they're busy, we'll select next ones,
until we finally have some free ports for LDAP.
Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`:
before the fix, this command fails after ~112 runs, and of course it
passes with the fix.

Fixes: scylladb/scylla-enterprise#5120
Fixes: scylladb/scylladb#23149
Fixes: scylladb/scylladb#23242

Closes scylladb/scylladb#23275
2025-03-14 11:09:19 +03:00
Kefu Chai
31320399e8 test: sstable_test: use auto instead of statistics to avoid name collision
Replace explicit `statistics` type with `auto` in sstable_test to
resolve name collision. This addresses ambiguity introduced by commit
87c221cb which added `struct statistics` in
`seastar/include/seastar/net/api.hh`, conflicting with the existing
definition in `scylladb/sstables/types.hh` when the `seastar` namespace
is opened.

The `auto` keyword avoids the need to explicitly reference either type,
cleanly resolving the collision while maintaining functionality.
This change prepares for the upcoming change to bump up seastar
submodule.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23249
2025-03-13 22:51:21 +02:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Dawid Mędrek
0a6137218a db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811
2025-03-13 11:55:15 +02:00
Paweł Zakrzewski
d483051e44 cql3/select_statement: reject aggregate functions when PER PARTITION LIMIT is present
Before this patch we silently allowed and ignored PER PARTITION LIMIT.
While using aggregate functions in conjunction with PER PARTITION LIMIT
can make sense, we want to disable it until we can offer proper
implementation, see #9879 for discussion.

We want to match Cassandra, and for queries with aggregate functions it
behaves as follows:
- it silently ignores PER PARTITION LIMIT if GROUP BY is present, which
  matches our previous implementation.
- rejects PER PARTITION LIMIT when GROUP BY is *not* present.

This patch adds rejection of the second group.

Fixes #9879

Closes scylladb/scylladb#23086
2025-03-13 10:29:53 +02:00
Pavel Emelyanov
f50bcbf4d0 test/perf/s3: Don't forget to stop sharded<tester> on error
In case invoke_on_all(tester::start) throws, the sharded<tester>
instance remains non-stopped and calltrace is reported on test stop. Not
nice, fix it so that sharded<> thing is stopped in any case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23244
2025-03-13 09:54:09 +02:00
Kefu Chai
68fc067106 perf/perf_sstable: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-12 19:00:50 +08:00
Kefu Chai
4f62f79622 perf/perf_sstable: stop using at_exit()
seastar::at_exit() was marked deprecated recently. so let's use
the recommended approach to perform cleanups.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-12 19:00:50 +08:00
Nadav Har'El
3ca2e6ddda Merge 's3_client: Add retries to Security Token Service/EC2 instance metadata credentials providers' from Ernest Zaslavsky
Several updates and improvements to the retryable HTTP client functionality, as well as enhancements to error handling and integration with AWS services, as part of this PR. Below is a summary of the changes:

- Moved the retryable HTTP client functionality out of the S3 client to improve modularity and reusability across other services like AWS STS.

- Isolated the retryable_http_client into its own file, improving clarity and maintainability.

- Added a make_request method that introduces a response-skipping handler.

- Introduced a custom error handler constructor, providing greater flexibility in handling errors.

- Updated the STS and Instance Metadata Service credentials providers to utilize the new retryable HTTP client, enhancing their robustness and reliability.

- Extended the AWS error list to handle errors specific to the STS service, ensuring more granular and accurate error management for STS operations.

- Enhanced error handling for system errors returned by Seastar’s HTTP client, ensuring smoother operations.

- Properly closed the HTTP client in instance_profile_credentials_provider and sts_assume_role_credentials_provider to prevent resource leaks.

- Reduced the log severity in the retry strategy to avoid SCT test failures that occur when any log message is tagged as an ERROR.

No backport needed since we dont have any s3 related activity on the scylla side been released

Closes scylladb/scylladb#21933

* github.com:scylladb/scylladb:
  s3_client: Adjust Log Severity in Retry Strategy
  aws_error: Enhance error handling for AWS HTTP client
  aws_error: Add STS specific error handling
  credentials_providers: Close retryable clients in Credentials Providers
  credentials_providers: Integrate retryable_http_client with Credentials Providers
  s3_client: enhance `retryable_http_client` functionality
  s3_client: isolate `retryable_http_client`
  s3_client: Prepare for `retryable_http_client` relocation
  s3_client: Remove `is_redirect_status` function
  s3_client: Move retryable functionality out of s3 client
2025-03-12 10:19:15 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
8425c26462 gossiper: start using host ids to send messages earlier
Send digest ack and ack2 by host ids as well now since the id->ip
mapping is available after receiving digest syn. It allows to convert
more code to host id here.
2025-03-11 12:09:21 +02:00
Gleb Natapov
f0af3f261e messaging_service: add temporary address map entry on incoming connection
We want to move to use host ids as soon as possible. Currently it is
possible only after the full gossiper exchange (because only at this
point gossiper state is added and with it address map entry). To make it
possible to move to host ids earlier this patch adds address map entries
on incoming communication during CLIENT_ID verb processing. The patch
also adds generation to CLIENT_ID to use it when address map is updated.
It is done so that older gossiper entries can be overwritten with newer
mapping in case of IP change.
2025-03-11 12:09:21 +02:00
Nikos Dragazis
76b31a3acc cql3: secondary index: Limit the size of partition range vectors
The partition range vector is an std::vector, which means it performs
contiguous allocations. Large allocations are known to cause problems
(e.g., reactor stalls).

For paged queries, limit the vector size to 1000. If more partition keys
are available in the query result, discard them. Ideally, we should not
be fetching them at all, but this is not possible without knowing the
size of each partition.

Currently, each vector element is 120 bytes and the standard allocator's
max preferred contiguous allocation is 128KiB. Therefore, the chosen
value of 1000 satisfies the constraint (128 KiB / 120 = 1092 > 1000).
This should be good enough for most cases. Since secondary index queries
involve one base table query per partition key, these queries are slow.
A higher limit would only make them slower and increase the probability
of a timeout. For the same reason, saving a follow-up paged request from
the client would not increase the efficiency much.

For unpaged queries, do not apply any limit. This means they remain
susceptible to stalls, but unpaged queries are considered unoptimized
anyway.

Finally, update the unit test reproducer since the bug is now fixed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-10 12:18:42 +02:00
Ernest Zaslavsky
8e46929474 aws_error: Enhance error handling for AWS HTTP client
- Seastar's HTTP client is known to throw exceptions for various reasons, including network errors, TLS errors and other transient issues.
- Update error handling to correctly capture and process all exceptions from Seastar's HTTP client.
- Previously, only aws_exception was handled, causing retryable errors to be missed and `should_retry` not invoked.
- Now, all exceptions trigger the appropriate retry logic per the intended strategy.
- Add tests for the S3 proxy to ensure robustness and reliability of these enhancements.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
6a3cef5703 metadata: Correct "DESCRIBE" output for keyspace metadata
Update the "DESCRIBE" command output to accurately display `tablet` settings in keyspace metadata.

Closes scylladb/scylladb#23056
2025-03-09 14:50:08 +02:00
Ernest Zaslavsky
050c3cdbc2 tests: Add Tests for Scylla-SSTable S3 Functionality
Extended existing Scylla Tools tests to cover the new functionality of
reading SSTables from S3. This ensures that the new S3 integration is
thoroughly tested and performs as expected.
2025-03-09 10:17:48 +02:00