Commit Graph

8462 Commits

Author SHA1 Message Date
Botond Dénes
969b07fdfd test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
The class actually implements the FlattenedConsumer, so fix the comment.
This eliminates the only reference to the StreamedMutationConsumer
concept.
2025-03-18 07:57:04 -04:00
Raphael S. Carvalho
e9944f0b7c service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>

Closes scylladb/scylladb#23247
2025-03-16 22:45:00 +02:00
Pavel Emelyanov
604fdd86e9 test: Count mutation fragments verbosily in scoped restore test
Sometimes after scoped restore a key is not found in nodes' mutation
fragments. This patch makes the counting more verbose to get better
understanding of what's going on in case of test failure

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23296
2025-03-14 21:31:36 +02:00
Botond Dénes
39bcf99f8e Merge 'Apply hard limit to partition range vectors in secondary index queries' from Nikos Dragazis
Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls).

This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded.

Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements:
```
* Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode
* Schema:
  CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false};
  CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck));
  CREATE INDEX t1_pk2_idx ON ks.t1(pk2);
* Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1;

+------------+-------------------+-------------------+
|  Page Size |      Master       |   Vector Limit    |
+============+===================+===================+
|            |   Latency (sec)   |   Latency (sec)   |
+------------+-------------------+-------------------+
|     100    |  5.80 ± 0.13      |  5.64 ± 0.10      |
+------------+-------------------+-------------------+
|    1000    |  4.77 ± 0.07      |  4.62 ± 0.06      |
+------------+-------------------+-------------------+
|    2000    |  4.67 ± 0.07      |  5.13 ± 0.03      |
+------------+-------------------+-------------------+
|    5000    |  4.82 ± 0.09      |  6.25 ± 0.06      |
+------------+-------------------+-------------------+
|   10000    |  4.89 ± 0.36      |  7.52 ± 0.13      |
+------------+-------------------+-------------------+
|     -1     |  4.90 ± 0.67      |  4.79 ± 0.33      |
+------------+-------------------+-------------------+
```
We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size.

Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude.

Finally, a unit test is added to reproduce the bug.

Fixes #18536.

The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport.

Closes scylladb/scylladb#22682

* github.com:scylladb/scylladb:
  cql3: secondary index: Limit page size for single-row partitions
  cql3: secondary index: Limit the size of partition range vectors
  cql3: untyped_result_set: Store rows in chunked_vector
  test: Reproduce bug with large allocations from secondary index
2025-03-14 15:06:07 +02:00
Botond Dénes
83ea1877ab Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky
scylla-sstable: Enable support for S3-stored sstables

Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532)

This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst

ref: https://github.com/scylladb/scylladb/issues/20532
fixes: https://github.com/scylladb/scylladb/issues/20535

No backport needed since the S3 functionality was never released

Closes scylladb/scylladb#22321

* github.com:scylladb/scylladb:
  tests: Add Tests for Scylla-SSTable S3 Functionality
  docs: Update Scylla Tools Documentation for S3 SSTable Support
  scylla-sstable: Enable Support for S3 SSTables
  s3: Implement S3 Fully Qualified Name Manipulation Functions
  object_storage: Refactor `object_storage.yaml` parsing logic
2025-03-14 15:05:52 +02:00
Nadav Har'El
de7c1d526a test/cqlpy: test DESC doesn't list an index as a view
Issue #6058 complained that "DESCRIBE TABLE" or "DESCRIBE KEYSPACE" list
a secondary index as materialized view (the view used to back the index
in Scylla's implementation of secondary indexes). This patch adds a test
to verify that this issue no longer exists in server-side describe - so we
can mark the issue as fixed.

While preparing this test, I noticed that Scylla and Cassandra behave
differently on whether DESC TABLE should list materialized views or not,
so this patch also includes a test for that as well - and I opened
issue #23014 on Scylla and CASSANDRA-20365 on Cassandra to further
discuss that new issue.

Fixes #6058
Refs #23014.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23015
2025-03-14 14:40:19 +03:00
Nadav Har'El
c0821842de alternator: document the state of tablet support in Alternator
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.

We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.

This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.

Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.

When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.

Fixes #21629

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22462
2025-03-14 14:03:15 +03:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Nadav Har'El
a72dde2ee6 test/cqlpy: add test for long table names
Scylla inherited a 48-character limit on the length of table (and
keyspace) names from Cassandra 3. It turns out that Cassandra 4 and
5 unintentionally dropped this limit (see history lesson in
CASSANDRA-20425), and now Cassandra accepts longer table names.
Some Cassandra users are using such longer names and disappointed
that Scylla doesn't allow them.

This patch includes tests for this feature. One test tries a
48-character table name - it passes on Scylla and all versions
of Cassandra. A second test tries a 100-character table name - this
one passes on Cassandra version 4 and above (but not on 3), and
fails on Scylla so marked "xfail". A third test tries a 500-character
table name. This one fails badly on Cassandra (see CASSANDRA-20389),
but passes on Scylla today. This test is important because we need to
be sure that it continues to pass on Scylla even after the Scylla is
fixed to allow the 100-character test.

Refs #4480 - an issue we already have about supporting longer names

Note on the test implementation:
Ideally, the test for a particular table-name length shouldn't just
create the table - it should also make sure we can write table to it
and flush it, i.e., that sstables can get written correctly. But in
practice, these complications are not needed, because in modern Scylla
it is the directory name which contains the table's name, and the
individual sstable files do not contain the table's name. Just creating
the table already creates the long directory name, so that is the part
that needs to be tested. If we created this directory successfully,
later creating the short-named sstables inside it can't fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23229
2025-03-14 11:15:07 +03:00
Kefu Chai
a82cfbecad test: perf_sstable: close frag_stream before destoying it
the underlying reader should be closed before being destroyed. otherwise
we'd have following failure when testing the "full_scan_streaming":

```
$ scylla perf-sstable --parallelism 1 --iterations 20 --partitions 20 --testdir /tmp/sstable --mode full_scan_streaming

ERROR 2025-03-13 15:04:26,321 [shard  0:main] mutation_reader - N8sstables2mx27mx_sstable_full_scan_readerE [0x60015a36b650]: permit *.*:test: was not closed before destruction, at: 0x235931e 0x2359470 0x239deb3 0x62a1ed3 0x89fd156 0x89c3fba 0x22a6ed3 0x22a8fea 0x22aae17 0x22a9928 0x26bb7d0 0x26bbe3e 0x89bca67 0x246bd8d /lib64/libc.so.6+0x3247 /lib64/libc.so.6+0x330a 0x1657774
------
   seastar::internal::coroutine_traits_base<double>::promise_type
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23270
2025-03-14 11:12:44 +03:00
Piotr Smaron
d365d9b2ad test/ldap: assign non-busy ports to ldap
It may happen that the ports we randomly choose for LDAP are busy, and
that'd fail the test suite, so once we randomly select ports, now we'll
see if they're busy or not, and if they're busy, we'll select next ones,
until we finally have some free ports for LDAP.
Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`:
before the fix, this command fails after ~112 runs, and of course it
passes with the fix.

Fixes: scylladb/scylla-enterprise#5120
Fixes: scylladb/scylladb#23149
Fixes: scylladb/scylladb#23242

Closes scylladb/scylladb#23275
2025-03-14 11:09:19 +03:00
Kefu Chai
31320399e8 test: sstable_test: use auto instead of statistics to avoid name collision
Replace explicit `statistics` type with `auto` in sstable_test to
resolve name collision. This addresses ambiguity introduced by commit
87c221cb which added `struct statistics` in
`seastar/include/seastar/net/api.hh`, conflicting with the existing
definition in `scylladb/sstables/types.hh` when the `seastar` namespace
is opened.

The `auto` keyword avoids the need to explicitly reference either type,
cleanly resolving the collision while maintaining functionality.
This change prepares for the upcoming change to bump up seastar
submodule.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23249
2025-03-13 22:51:21 +02:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Dawid Mędrek
0a6137218a db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811
2025-03-13 11:55:15 +02:00
Paweł Zakrzewski
d483051e44 cql3/select_statement: reject aggregate functions when PER PARTITION LIMIT is present
Before this patch we silently allowed and ignored PER PARTITION LIMIT.
While using aggregate functions in conjunction with PER PARTITION LIMIT
can make sense, we want to disable it until we can offer proper
implementation, see #9879 for discussion.

We want to match Cassandra, and for queries with aggregate functions it
behaves as follows:
- it silently ignores PER PARTITION LIMIT if GROUP BY is present, which
  matches our previous implementation.
- rejects PER PARTITION LIMIT when GROUP BY is *not* present.

This patch adds rejection of the second group.

Fixes #9879

Closes scylladb/scylladb#23086
2025-03-13 10:29:53 +02:00
Pavel Emelyanov
f50bcbf4d0 test/perf/s3: Don't forget to stop sharded<tester> on error
In case invoke_on_all(tester::start) throws, the sharded<tester>
instance remains non-stopped and calltrace is reported on test stop. Not
nice, fix it so that sharded<> thing is stopped in any case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23244
2025-03-13 09:54:09 +02:00
Nadav Har'El
3ca2e6ddda Merge 's3_client: Add retries to Security Token Service/EC2 instance metadata credentials providers' from Ernest Zaslavsky
Several updates and improvements to the retryable HTTP client functionality, as well as enhancements to error handling and integration with AWS services, as part of this PR. Below is a summary of the changes:

- Moved the retryable HTTP client functionality out of the S3 client to improve modularity and reusability across other services like AWS STS.

- Isolated the retryable_http_client into its own file, improving clarity and maintainability.

- Added a make_request method that introduces a response-skipping handler.

- Introduced a custom error handler constructor, providing greater flexibility in handling errors.

- Updated the STS and Instance Metadata Service credentials providers to utilize the new retryable HTTP client, enhancing their robustness and reliability.

- Extended the AWS error list to handle errors specific to the STS service, ensuring more granular and accurate error management for STS operations.

- Enhanced error handling for system errors returned by Seastar’s HTTP client, ensuring smoother operations.

- Properly closed the HTTP client in instance_profile_credentials_provider and sts_assume_role_credentials_provider to prevent resource leaks.

- Reduced the log severity in the retry strategy to avoid SCT test failures that occur when any log message is tagged as an ERROR.

No backport needed since we dont have any s3 related activity on the scylla side been released

Closes scylladb/scylladb#21933

* github.com:scylladb/scylladb:
  s3_client: Adjust Log Severity in Retry Strategy
  aws_error: Enhance error handling for AWS HTTP client
  aws_error: Add STS specific error handling
  credentials_providers: Close retryable clients in Credentials Providers
  credentials_providers: Integrate retryable_http_client with Credentials Providers
  s3_client: enhance `retryable_http_client` functionality
  s3_client: isolate `retryable_http_client`
  s3_client: Prepare for `retryable_http_client` relocation
  s3_client: Remove `is_redirect_status` function
  s3_client: Move retryable functionality out of s3 client
2025-03-12 10:19:15 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
8425c26462 gossiper: start using host ids to send messages earlier
Send digest ack and ack2 by host ids as well now since the id->ip
mapping is available after receiving digest syn. It allows to convert
more code to host id here.
2025-03-11 12:09:21 +02:00
Gleb Natapov
f0af3f261e messaging_service: add temporary address map entry on incoming connection
We want to move to use host ids as soon as possible. Currently it is
possible only after the full gossiper exchange (because only at this
point gossiper state is added and with it address map entry). To make it
possible to move to host ids earlier this patch adds address map entries
on incoming communication during CLIENT_ID verb processing. The patch
also adds generation to CLIENT_ID to use it when address map is updated.
It is done so that older gossiper entries can be overwritten with newer
mapping in case of IP change.
2025-03-11 12:09:21 +02:00
Nikos Dragazis
76b31a3acc cql3: secondary index: Limit the size of partition range vectors
The partition range vector is an std::vector, which means it performs
contiguous allocations. Large allocations are known to cause problems
(e.g., reactor stalls).

For paged queries, limit the vector size to 1000. If more partition keys
are available in the query result, discard them. Ideally, we should not
be fetching them at all, but this is not possible without knowing the
size of each partition.

Currently, each vector element is 120 bytes and the standard allocator's
max preferred contiguous allocation is 128KiB. Therefore, the chosen
value of 1000 satisfies the constraint (128 KiB / 120 = 1092 > 1000).
This should be good enough for most cases. Since secondary index queries
involve one base table query per partition key, these queries are slow.
A higher limit would only make them slower and increase the probability
of a timeout. For the same reason, saving a follow-up paged request from
the client would not increase the efficiency much.

For unpaged queries, do not apply any limit. This means they remain
susceptible to stalls, but unpaged queries are considered unoptimized
anyway.

Finally, update the unit test reproducer since the bug is now fixed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-10 12:18:42 +02:00
Ernest Zaslavsky
8e46929474 aws_error: Enhance error handling for AWS HTTP client
- Seastar's HTTP client is known to throw exceptions for various reasons, including network errors, TLS errors and other transient issues.
- Update error handling to correctly capture and process all exceptions from Seastar's HTTP client.
- Previously, only aws_exception was handled, causing retryable errors to be missed and `should_retry` not invoked.
- Now, all exceptions trigger the appropriate retry logic per the intended strategy.
- Add tests for the S3 proxy to ensure robustness and reliability of these enhancements.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
6a3cef5703 metadata: Correct "DESCRIBE" output for keyspace metadata
Update the "DESCRIBE" command output to accurately display `tablet` settings in keyspace metadata.

Closes scylladb/scylladb#23056
2025-03-09 14:50:08 +02:00
Ernest Zaslavsky
050c3cdbc2 tests: Add Tests for Scylla-SSTable S3 Functionality
Extended existing Scylla Tools tests to cover the new functionality of
reading SSTables from S3. This ensures that the new S3 integration is
thoroughly tested and performs as expected.
2025-03-09 10:17:48 +02:00
Ernest Zaslavsky
88c4fa6569 s3: Implement S3 Fully Qualified Name Manipulation Functions
Added utility functions to handle S3 Fully Qualified Names (FQN). These
functions enable parsing, splitting, and identification of S3 paths,
enhancing our ability to work with S3 object storage more effectively.
2025-03-09 09:50:36 +02:00
Robert Bindar
27f2d64725 Remove object storage config credentials provider
During development of #22428 we decided that we have
no need for `object-storage.yaml`, and we'd rather store
the endpoints in `scylla.yaml` and get a REST api to exopose
the endpoints for free.
This patch removes the credentials provider used to read the
aws keys from this yaml file.
Followup work will remove the `object-storage.yaml` file
altogether and move the endpoints to `scylla.yaml`.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#22951
2025-03-07 10:40:58 +03:00
Tomasz Grabiec
c4714180cc tablets: Make load balancing capacity-aware
Before this patch the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogenous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assummes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
2025-03-06 13:35:38 +01:00
Tomasz Grabiec
d6f8810e66 topology_coordinator: Allow capacity stats to be refreshed with some nodes down
With capacity-aware balancing, if we're missing capacity for a normal
node, we won't be able to proceed with tablet drain. Consider the
following scenario:

1. Nodes: A, B
2. refresh stats with A and B
3. Add node C
4. Node B goes down
5. removenode B starts
6. stats refreshing fails because B is down

If we don't have capacity stats for node C, load balancer cannot make
decisions and removenode is blocked indefinitely. A reproducer is
added in this patch.

To alleviate that, we allow capacity stats to be collected for nodes
which are reachable, we just don't update the table size part.

To keep table stats monotonic, we cache previous results per node, so
even if it's unreachable now, we use its last reported sizes. It's
still more accurate than not refreshing stats at all. A node can be
down for a long period, and other replicas can grow in size. It's not
perfect, because the stale node can skew the stats in its direction,
but ignoring it completely has its pitfalls too. Better solution is
left for later.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
69c49fb1a7 test: boost: tablets_test: Always provide capacity in load_stats
Move shared_load_stats to topology_builder.hh so that topology_builder
can maintain it. It will set capacity for all created nodes. Needed
after load balancer requires capacity to make decisions.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
dfc9101dfd test: perf_load_balancing: Set node capacity
Otherwise, load balancer will not make any plan once it becomes
capacity-aware.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
6169401dbc test: perf_load_balancing: Convert to topology_builder
The test no longer worked becuase load balancer requires proper schema
in the database now. Convert to topology_builder which builds topology
in the database and create schema in the database (which needs proper
topology).
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Avi Kivity
28906c9261 Merge 'scylla-sstable: introduce the query command' from Botond Dénes
The scylla-sstable dump-* command suite has proven invaluable  in many investigations. In certain cases however, I found that `dump-data` is quite cumbersome. An example would be trying to find certain values in an sstable, or trying to read the content of system tables when a node is down. For these cases, `dump-data`  is very cumbersome: one has to trudge through tons of uninteresting metadata and do compaction in their heads. This PR introduces the new scylla-sstable query command, specifically targeted at situations like this: it allows executing queries on sstables, exposing to the user all the power of CQL, to tailor the output as they see fit.

Select everything from a table:

    $ scylla sstable query --system-schema /path/to/data/system_schema/keyspaces-*/*-big-Data.db
     keyspace_name                 | durable_writes | replication
    -------------------------------+----------------+-------------------------------------------------------------------------------------
            system_replicated_keys |           true |                         ({class : org.apache.cassandra.locator.EverywhereStrategy})
                       system_auth |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 1})
                     system_schema |           true |                              ({class : org.apache.cassandra.locator.LocalStrategy})
                system_distributed |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 3})
                            system |           true |                              ({class : org.apache.cassandra.locator.LocalStrategy})
                                ks |           true | ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})
                     system_traces |           true |   ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 2})
     system_distributed_everywhere |           true |                         ({class : org.apache.cassandra.locator.EverywhereStrategy})

Select everything from a single SSTable, use the JSON output (filtered through [jq](https://jqlang.github.io/jq/) for better readability):

    $ scylla sstable query --system-schema --output-format=json /path/to/data/system_schema/keyspaces-*/me-3gm7_127s_3ndxs28xt4llzxwqz6-big-Data.db | jq
    [
      {
        "keyspace_name": "system_schema",
        "durable_writes": true,
        "replication": {
          "class": "org.apache.cassandra.locator.LocalStrategy"
        }
      },
      {
        "keyspace_name": "system",
        "durable_writes": true,
        "replication": {
          "class": "org.apache.cassandra.locator.LocalStrategy"
        }
      }
    ]

Select a specific field in a specific partition using the command-line:

    $ scylla sstable query --system-schema --query "select replication from scylla_sstable.keyspaces where keyspace_name='ks'" ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
     replication
    -------------------------------------------------------------------------------------
     ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})

Select a specific field in a specific partition using ``--query-file``:

    $ echo "SELECT replication FROM scylla_sstable.keyspaces WHERE keyspace_name='ks';" > query.cql
    $ scylla sstable query --system-schema --query-file=./query.cql ./scylla-workdir/data/system_schema/keyspaces-*/*-Data.db
     replication
    -------------------------------------------------------------------------------------
     ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1})

New functionality: no backport needed.

Closes scylladb/scylladb#22007

* github.com:scylladb/scylladb:
  docs/operating-scylla: document scylla-sstable query
  test/cqlpy/test_tools.py: add tests for scylla-sstable query
  test/cqlpy/test_tools.py: make scylla_sstable() return table name also
  scylla-sstable: introduce the query command
  tools/utils: get_selected_operation(): use std::string for operation_options
  utils/rjson: streaming_writer: add RawValue()
  cql3/type_json: add to_json_type()
  test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()
2025-03-06 13:42:45 +02:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Botond Dénes
1139cf3a98 Merge 'Speed up (and generalize) the way API calculates sstable disk usage' from Pavel Emelyanov
There are several API endpoints that walk a specific list of sstables and sum up their bytes_on_disk() values. All those endpoints accumulate a map of sstable names to their sizes, then squashe the maps together and, finally, sum up the map values to report it back. Maintaining these intermediate collections is the waste of CPU and memory, the usage values can be summed up instantly.

Also add a test for per-cf endpoints to validate the change, and generalize the helper functions while at it.

Closes scylladb/scylladb#23143

* github.com:scylladb/scylladb:
  api: Generalize disk space counting for table and system
  api: Use map_reduce_cf_raw() overload with table name
  api: Don't collect sstables map to count disk space usage
  test: Add unit test for total/live sstable sizes
2025-03-06 11:26:35 +02:00
Raphael S. Carvalho
fedd838b9d replica: Fix race of some operations like cleanup with snapshot
There are two semaphores in table for synchronizing changes to sstable list:

sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.

sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.

they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.

problem:

A = tablet cleanup
B = take_snapshot()

1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk

Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.

But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.

That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.

The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.

So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.

Fixes #23049.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23117
2025-03-06 11:00:48 +02:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
13a22cb6fd utils: add class pluggable
A wrapper around a shared service allowing
safe plug and unplug of the service from its user
using a phased-barrier operation permit guarding
the service while in use.

Also add a unit test for this class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:25:50 +02:00
Nadav Har'El
e0f24c03e7 Merge 'test.py: merge all 'Topology' suite types int one folder 'cluster'' from Artsiom Mishuta
Now that we support suite subfolders, there is no
need to create an own suite for object_store and auth_cluster, topology, topology_custom.
this PR merge all these folders into one: 'cluster"

this pr also introduce and apply 'prepare_3_nodes_cluster' fixture  that  allow preparing non-dirty 3 nodes cluster
that can be reused between tests(for tests that was in topology folder)

number of tests in master
release -3461
dev       -3472
debug   -3446

number of tests in this PR
release -3460
dev       -3471
debug   -3445

There is a minus one test in each mode because It was 2 test_topology_failure_recovery files(topology and topology_custom) with the same utility functions but different test cases. This PR merged them into one

Closes scylladb/scylladb#22917

* github.com:scylladb/scylladb:
  test.py: merge object_store into cluster folder
  test.py: merge auth_cluster into cluster folter
  test.py: rename topology_custom folder to cluster
  test.py: merge topology test suite into topology_custom
  test.py delete conftest in topology_custom
  test.py apply prepare_3_nodes_cluster in topology
  test.py: introduce prepare_3_nodes_cluster marker
2025-03-04 19:26:32 +02:00
Pavel Emelyanov
a8fc1d64bc test: Add unit test for total/live sstable sizes
The pair of column_family/metrics/(total|live)_disk_space_used/{name}
reports the disk usage by sstables. The test creates table, populates,
flushes and checks that the size corresonds to what stat(2) reports for
the respective files.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-04 19:52:33 +03:00
Nikos Dragazis
892690b953 test: Reproduce bug with large allocations from secondary index
Secondary index queries which fetch partitions from the base table can
cause large allocations that can lead to reactor stalls.

Reproduce this with a unit test that runs an indexed query on a table
with thousands of single-row partitions, and checks the memory stats for
any large contiguous allocations.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-04 18:39:28 +02:00
Patryk Jędrzejczak
c13b6c91d3 Merge 'raft topology: drop changing the raft voters config via storage_service' from Emil Maskovsky
For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly.

We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions).
Furthermore, the other bundled commit improves the reaction again by reacting to the node `on_up()` and `on_down()` events, which again shortens the reaction time and improves the HA.

The change has effect on the timing in the tablets migration test though, as it previously relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum.

Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes).

Fixes: scylladb/scylladb#22860

Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969

No backport: Part of the limited voters new feature, so this shouldn't to be backported.

Closes scylladb/scylladb#22847

* https://github.com/scylladb/scylladb:
  raft: use direct return of future for `run_op_with_retry`
  raft: adjust the voters interface to allow atomic changes
  raft topology: drop removing the node from raft config via storage_service
  raft topology: drop changing the raft voters config via storage_service
2025-03-04 13:59:47 +01:00
Nadav Har'El
d096aac200 test/cqlpy/run: reduce number of tablets
In commit 2463e524ed, Scylla's default changed
from starting with one tablet per shard to starting 10 per shard. The
functional tests don't need more tablets and it can only slow down the
tests, so the patch added --tablets-initial-scale-factor=1 to test/*/suite.yaml
but forgot to add it to test/cqlpy/run.py (to affect test/cqlpy/run) so
this patch does this now.

This patch should *only* be about making tests faster, although to be
honest, I don't see any measurable improvement in test speed (10 isn't
so many). But, unfortunately, this is only part of the story. Over time
we allowed a few cqlpy tests to be written in a way that relies on having
only a small number of tablets or even exactly one tablet per shard (!).
These tests are buggy and should be fixed - see issues #23115 and #23116
as examples. But adding the option --tablets-initial-scale-factor=1 also
to run.py will make these bugs not affect test/cqlpy/run in the same way
as it doesn't affect test.py.

These buggy tests will still break with `pytest cqlpy` against a Scylla
you ran yourself manually, so eventually will still need to fix those
test bugs.

Refs #23115
Refs #23116

Closes scylladb/scylladb#23125
2025-03-04 15:39:21 +03:00
Artsiom Mishuta
97a620cda9 test.py: merge object_store into cluster folder
Now that we support suite subfolders, there is no
need to create an own suite for object_store
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
a283b391c2 test.py: merge auth_cluster into cluster folter
Now that we support suite subfolders, there is no
need to create an own suite for auth_cluster
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
d1198f8318 test.py: rename topology_custom folder to cluster
rename topology_custom folder to cluster
as it contains not only topology test cases
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
d8e17c4356 test.py: merge topology test suite into topology_custom
Now that we support suite subfolders, there is no
need to create an own suite for topology
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
ef62dfa6a9 test.py delete conftest in topology_custom
delete conftest in the sepatate commi for brtter diff listing during
merge topology_custom and topology
2025-03-04 10:32:43 +01:00
Artsiom Mishuta
cf48444e3b test.py apply prepare_3_nodes_cluster in topology
apply prepare_3_nodes_cluster for all tests in the topology folder
via applying mark at the test module level using pytestmark
https://docs.pytest.org/en/stable/example/markers.html#marking-whole-classes-or-modules

set initial initial_size for topology folder to 0
2025-03-04 10:32:43 +01:00
Artsiom Mishuta
20777d7fc6 test.py: introduce prepare_3_nodes_cluster marker
prepare_3_nodes_cluster marker will allow preparing non-dirty 3 nodes cluster
that can be reused between tests
2025-03-04 10:32:43 +01:00