After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.
The solution is the same: deselect the tests.
Fixes#23302Closesscylladb/scylladb#23405
During messaging_service object creation remove_rpc_client function may
be called if prefer_local snitch setting is true. The caller does not
provide host id, so _address_to_host_id_mapper is called to obtain it,
but at this point the function is not initialized yet.
The patch fixes the code to not call the function if not initialized.
This is not the problem since during messaging_service creation there
is no connection to drop.
Fixes: #23353
Message-ID: <Z-J2KbBK8NoFNYZZ@scylladb.com>
When we rename columns in a table which has materialized views depending
on it, we need to also rename them in the materialized views' WHERE
clauses.
Currently, we do that by creating a new WHERE clause after each rename,
with the updated column. This is later converted to a mutation that
overwrites the WHERE clause. After multiple renames, we have multiple
mutations, each overwriting the WHERE clause with one column renamed.
As a result, the final WHERE clause is one of the modified clauses with
one column renamed.
Instead, we should prepare one new WHERE clause which includes all the
renamed columns. This patch accomplishes this by processing all the
column renames first, and only preparing the new view schema with the
new WHERE clause afterwards.
This patch also includes a test reproducer for this scenario.
Fixesscylladb/scylladb#22194Closesscylladb/scylladb#23152
This fixes an issue where materialized view tablets are not split
because they are not registered as split candidates by the storage
service.
The code in storage_service::replicate_to_all_cores was changed in
4bfa3060d0 to handle normal tables and view tables separately, but with
that change register_tablet_split_candidate is applied only to normal
tables and not every table like before. We fix it by registering view
tables as well.
We add a test to verify that split of MV tables works.
Closesscylladb/scylladb#23335
fmt 11.1 apparently marks to_string() as [[nodiscard]]. Here we aren't
interested in the result, so explicitly ignore it to avoid an error.
Closesscylladb/scylladb#23403
Clang 20 complains when it sees a user-defined literal operator
defined with a space before the underscore. Assume it's adhering
to the standard and comply.
Closesscylladb/scylladb#23401
Fixes#23225Fixes#23185
Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves
extension wrapping of file and sink objects to storage level.
(Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not).
This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail.
Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR.
Unit tests for io objects and a macro test for S3 encrypted storage included.
Closesscylladb/scylladb#23261
* github.com:scylladb/scylladb:
encryption: Add "wrap_sink" to encryption sstable extension
encrypted_file_impl: Add encrypted_data_sink
sstables::storage: Move wrapping sstable components to storage provider
sstables::file_io_extension: Add a "wrap_sink" method.
sstables::file_io_extension: Make sstable argument to "wrap" const
utils: Add "io-wrappers", useful IO helper types
Currently, to stream data from sstable component the sstables code uses file_data_source_impl. In case the component is on S3, the s3::readable_file is put into that data source. The data source is configured with 128k buffers and at most 4 read-ahead-s. With that configuration, downloading full object from S3 becomes too slow -- GET-ing file with 128k requests is not nice even with 4 parallel read-ahead-s.
Better solution for S3 downloading is to request way larger chunk with one GET and then produce smaller, 128k or alike, buffers upon data arrival. This is what the newly introduced data source impl does -- it spawns a background GET and lets the upper input stream read buffers directly from the arriving body.
This PR doesn't yet make sstable layer use the new sink, just introduces it and adds unit and perf tests.
Testing
|Test|Download speed, MB/s|
|-|-|
|file_input_stream (*), 1 socket | 4.996|
|file_input_stream (*), 2 sockets | 9.403|
|s3_data_source (**) | 93.164|
(*) The file_input_stream test renders 128k GETs and is configured to issue at most 4 read-ahead-s
(**) The s3_data_source uses at most 1 socket regardless of what perf-test configures it to
refs: #22458Closesscylladb/scylladb#22907
* github.com:scylladb/scylladb:
test: Extend s3-perf test with stream download one
test/perf: Tune-up s3 test options parsing
test: Add unit test for newly introduced download source
s3/client: Introduce data_source_impl for object downloading
s3/client: Detach format_range_header() helper
Rename the `--upload bool` into `--operation string` one, so that new
tests can be added in the future. Also rename run_download() to
run_contiguous_get() because this is what the internals of this method
do -- just GET contiguous ranges sequentially.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.
The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
keyspaces are RF-rack-valid. If not, the initialization fails.
We provide tests verifying that the changes behave as intended.
---
Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.
---
Implementation strategy (going beyond the scope of this PR):
1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.
---
Fixesscylladb/scylladb#20356Fixesscylladb/scylladb#23276Fixesscylladb/scylladb#23300
---
Backport: this is part of the requirements for releasing 2025.1.
Closesscylladb/scylladb#23138
* github.com:scylladb/scylladb:
main: Refuse to start node when RF-rack-invalid keyspace exists
cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
db/config: Introduce RF-rack-valid keyspaces
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.
For making encrypted data sink writing less cumbersome.
Fixes#23225Fixes#23185
Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.
For testing, and because they fit there, place memory
sink and source types there as well.
This PR introduces several key improvements to bolster the reliability of our S3 client, particularly in handling intermittent authentication and TLS-related issues. The changes include:
1. **Automatic Credential Renewal and Request Retry**: When credentials expire, the new retry strategy now resets the credentials and set the client to the retryable state, so the client will re-authenticate, and automatically retry the request. This change prevents transient authentication failures from propagating as fatal errors.
2. **Enhanced Exception Unwrapping**: The client now extracts the embedded std::system_error from std::nested_exception instances that may be raised by the Seastar HTTP client when using TLS. This allows for more precise error reporting and handling.
3. **Expanded TLS Error Handling**: We've added support for retryable TLS error codes within the std::system_error handler. This modification enables the client to detect and recover from transient TLS issues by retrying the affected operations.
Together, these enhancements improve overall client robustness by ensuring smoother recovery from both credential and TLS-related errors.
No backport needed since it is an enhancement
Closesscylladb/scylladb#22150
* github.com:scylladb/scylladb:
aws_error: Add GNU TLS codes
s3_client: Handle nested std::system_error exceptions
s3_client: Start using new retry strategy
retry_strategy: Add custom retry strategy for S3 client
retry_strategy: Make `should_retry` awaitable
There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names.
Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names.
This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit.
refs: #14122 (previous ugly attempt to achieve the same goal)
Closesscylladb/scylladb#23194
* github.com:scylladb/scylladb:
sstable: Remove unused malformed_sstable_exctpion(string filename)
sstables: Make filename() return component_name
sstables: Make file_writer keep component_name on board
sstables: Make get_filename() return component_name
sstables: Make toc_filename() return component_name
sstables: Make sstable::index_filename() return component_name
sstables: Introduce struct component_name
sstables: Remove unused sstable::component_filenames() method
sstables: Do not print component filenames on load-and-stream wrap-up
sstables: Explicitly format prefix in S3 object name making
sstables: Don't include directory name in exception
sstables: Use fmt::format instead of string concatenation
sstables: Rename filename($component) calls to ${component}_filename()
sstables: Rename local filename variable to component_name
schema_extension allows making invisible changes to system_schema
that evade upgrade rollback tests. They appear in system_schema
as an encoded blob which reduces serviceability, as they cannot
be read.
Deprecate it and point users to adding explicit columns in scylla_tables.
We could probably make use of the data structure, after we teach it
to encode its payload into proper named and typed columns instead of
using IDL.
Closesscylladb/scylladb#23151
For a long time now, we've been seeing (see #17564), once in a while,
Alternator tests crashing with the Python process getting killed on
SIGSEGV after the tests have already finished successfully and all
pytest had to do is exit. We have not been able to figure out where the
bug is. Unfortunately, we've never been able to reproduce this bug
locally - and only rarely we see it in CI runs, and when it happens
we don't any information on why it happend.
So the goal of this patch is to print more information that might
hopefully help us next time we see this problem in CI (this patch
does NOT fix the bug). This patch adds to test/alternator's conftest.py
a call to faulthandler.enable(). This traps SIGSEGV and prints a stack
trace (for each thread, if there are several) showing what Python was
trying to do while it is crashing. Hopefully we'll see in this output
some specific cleanup function belonging to boto3 or urllib or whatever,
and be able to figure out where the bug is and how to avoid it.
We could have added this faulthandler.enable() call to the top-level
conftest.py or to test.py, but since we only ever had this Python
crash in Alternator tests, I think it is more suitable that we limit
this desperate debugging attempt only to Alternator tests.
Refs #17564
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23340
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.
Fixesscylladb/scylladb#23300
In this commit, we refuse to create or alter a keyspace when that operation
would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is
enabled.
We provide two tests verifying that the changes work as intended.
Fixesscylladb/scylladb#23276
The reader consumer concept hierarchy is a sprawling confusing jungle of deeply nested concepts. Looking at `FlattenedConsumer[V2]` -- the subject of this PR: this consumer is defined in terms of the `StreamedMutationConsumer[V2]` which in terms is defined in terms of the `FragmentConsumer[V2]`.
This amount of nesting makes it really hard to see what a concept actually comes down to: made even more difficult by the fact that the concepts are scattered across two header files.
In theory, this nesting allows for greater flexibility: some code can use a lower lever concept directly while it can also serve as the basis for the higher lever concepts. But the fact of the matter is that none of the lower level concepts are used directly, so we pay the price in hard-to-follow code for no benefit.
This PR cuts down the complexity by folding up the entire hierarchy into the top-level `FlattenedConsumer[V2]` and `FlatteneConsumerReturning[V2]` concepts.
Doing this immediately reveals just how similar the two major consumer concepts (`FlattenedConsumer[V2]` and `MutationFragmentConsumer[V2]`) supported by `mutation_reader` are. In a follow-up PR, we will attempt to unify the two.
Refactoring, no backport needed.
Closesscylladb/scylladb#23344
* github.com:scylladb/scylladb:
mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2]
mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2]
test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
Similarly to toc_, index_ and data filenames, make the generic component
name getter return back not string, but a wrapper object. Most of
callers are log messages and exception generations. Other than that
there are tests, filesystem storage driver and few more places in
generic code who "know" that they work with real files, so make them use
explicit fmt::to_string().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to previous patches -- mostly the result is used as log
argument. The remaining users include
- scylla sstable tool that dumps component names to json output
- API endpoint that returns component names to user
- tests
these are all good to explicitly convert component_names to strings.
There are few more places that expect strings instead of component name
objects. For now they also use fmt::to_string() explicitly, partially it
will be fixed later, mostly -- as future follow-ups.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the callers use the returned value as log message parameter,
some construct malformed_sstable_exception that was prepared by previous
patch.
The remaining callers explicitly use fmt::to_string(), these are
- pending deletion log creation
- filesystem storage code
- tests
- stream-blob code that re-loads sstable
All but the last one are OK to use string toc name, the last one is not
very correct in its usage of toc_filename string, but it needs more care
to be fixed properly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.
Use topology_guard to maintain safety.
Fixes: https://github.com/scylladb/scylladb/issues/22408.
Needs backport to 2025.1 that introduces the tablet repair scheduler.
Closesscylladb/scylladb#22842
* github.com:scylladb/scylladb:
test: add test to check concurrent tablets migration and repair
repair: do not hold erm for repair scheduled by scheduler
repair: get total rf based on current erm
repair: make shard_repair_task_impl::erm private
repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
repair: pass session_id to repair_writer_impl::create_writer
repair: keep materialized topology guard in shard_repair_task_impl
repair: pass session_id to repair_meta
This PR introduces the new Raft-based recovery procedure for group 0
majority loss.
The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.
The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.
We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.
The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.
For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.
The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.
Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.
In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.
This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.
Fixesscylladb/scylladb#20657Closesscylladb/scylladb#22286
* github.com:scylladb/scylladb:
test: mark tests with the gossip-based recovery procedure
test: add tests for the Raft-based recovery procedure
test: topology: util: fix the tokens consistency check for left nodes
test: topology: util: extend start_writes
gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
raft_group0: modify_raft_voter_status: do not add new members
treewide: allow recreating group 0 in the Raft-based recovery procedure
When something goes wrong, it's impossible to find anyting out without
s3 and http logs, so increase them for boost tests.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23245
Fixes#23017
When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either
a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback
where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed.
Now, eventually, this should be released, once we do deletion again, but this can take a while.
Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure.
As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available space.
Closesscylladb/scylladb#23150
* github.com:scylladb/scylladb:
main/commitlog: wait for file deletion and distribute recycled segments to shards
commitlog: Serialize file deletion
* Previously, token expiration was considered a fatal error. With this change,
the `s3_client` uses new retry strategy that is trying to renew expired
creds
* Added related test to the `s3_proxy`
`seastar::at_exit()` was marked deprecated recently. so let's use the recommended approach to perform cleanups.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#23253
* github.com:scylladb/scylladb:
perf/perf_sstable: fix the indent
perf/perf_sstable: stop using at_exit()
Pool is not aware of the cluster configuration, so it can return cluster
to the test that is not suitable for it. Removing reuse will remove such
possibility, so there will be less flaky tests.
Closesscylladb/scylladb#23277
Fixes#23017
When deleting segments while our footprint is over the limit,
mainly when recycling/deleting segments after replay (recover
boot) we can cause two deletion passes to be running at the same
time. This is because delete is triggered by either
a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback
where the last one is in fact not even waited for. If we are
considering many files for delete/recycle, we can, due to task
switch, end up considering segments ok to keep, in parallel,
even though one of them should be deleted. The end result
will be us keeping one more segment than should be allowed.
Now, eventually, this should be released, once we do deletion
again, but this can take a while.
Solution is to simply ensure we serialize deletion. This might
cause some delay in processing cycles for recycle, but in
practice, this should never happen when we are in fact under
pressure.
Small unit test included.
Currently, pytest truncates long objects in assertions.
This makes understanding the failure message difficult.
This will increase verbosity and pytest will stop truncating messages.
Closesscylladb/scylladb#23263
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.
Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]
Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.
This is the main problem fixed by this patch.
It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.
Fixes#22994.
Refs #17265.
This patch is based on Raphael's work done in PR #23081. The main differences are:
1) Instead of sorting replicas by rack, we try to find
replicas in sibling tablets which belong to the same rack.
This is similar to how we match replicas within the same host.
It reduces number of across-rack migrations even if RF!=#racks,
which the original patch didn't handle.
Unlike the original patch, it also avoids rack-overloaded in case
RF!=#racks
2) We emit across-rack co-locating migrations if we have no other choice
in order to finalize the merge
This is ok, since views are not supported with tablets yet. Later,
we will disallow this for tables which have views, and we will
allow creating views in the first place only when no such migrations
can happen (RF=#racks).
3) Added boost unit test which checks that rack overload is avoided during merge
in case RF<#racks
4) Moved logging of across-rack migration to debug level
5) Exposed metric for across-rack co-locating migrations
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Closesscylladb/scylladb#23247
Sometimes after scoped restore a key is not found in nodes' mutation
fragments. This patch makes the counting more verbose to get better
understanding of what's going on in case of test failure
refs: #23189
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23296
Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls).
This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded.
Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements:
```
* Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode
* Schema:
CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false};
CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck));
CREATE INDEX t1_pk2_idx ON ks.t1(pk2);
* Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1;
+------------+-------------------+-------------------+
| Page Size | Master | Vector Limit |
+============+===================+===================+
| | Latency (sec) | Latency (sec) |
+------------+-------------------+-------------------+
| 100 | 5.80 ± 0.13 | 5.64 ± 0.10 |
+------------+-------------------+-------------------+
| 1000 | 4.77 ± 0.07 | 4.62 ± 0.06 |
+------------+-------------------+-------------------+
| 2000 | 4.67 ± 0.07 | 5.13 ± 0.03 |
+------------+-------------------+-------------------+
| 5000 | 4.82 ± 0.09 | 6.25 ± 0.06 |
+------------+-------------------+-------------------+
| 10000 | 4.89 ± 0.36 | 7.52 ± 0.13 |
+------------+-------------------+-------------------+
| -1 | 4.90 ± 0.67 | 4.79 ± 0.33 |
+------------+-------------------+-------------------+
```
We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size.
Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude.
Finally, a unit test is added to reproduce the bug.
Fixes#18536.
The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport.
Closesscylladb/scylladb#22682
* github.com:scylladb/scylladb:
cql3: secondary index: Limit page size for single-row partitions
cql3: secondary index: Limit the size of partition range vectors
cql3: untyped_result_set: Store rows in chunked_vector
test: Reproduce bug with large allocations from secondary index
scylla-sstable: Enable support for S3-stored sstables
Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532)
This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst
ref: https://github.com/scylladb/scylladb/issues/20532
fixes: https://github.com/scylladb/scylladb/issues/20535
No backport needed since the S3 functionality was never released
Closesscylladb/scylladb#22321
* github.com:scylladb/scylladb:
tests: Add Tests for Scylla-SSTable S3 Functionality
docs: Update Scylla Tools Documentation for S3 SSTable Support
scylla-sstable: Enable Support for S3 SSTables
s3: Implement S3 Fully Qualified Name Manipulation Functions
object_storage: Refactor `object_storage.yaml` parsing logic
This patch makes it clear which Raft recovery procedure is used in
each test.
Tests with "This test uses the gossip-based recovery procedure." are
the tests that use the gossip-based topology. This tests should be
deleted once we make the Raft-based topology mandatory.
Tests with the new FIXME are the tests that use the Raft-based
topology. They should be changed to use the Raft-based recovery
procedure or removed if they don't test anything important with
the new procedure.
When we remove a node in the Raft-based topology
(by remove/replace/decommission), we remove its
tokens from `system.topology`, but we do not
change `num_tokens`. Hence, the old check could
fail for left nodes.