Define create_new_test_keyspace that can be used in
cases we cannot automatically drop the newly created keyspace
due to e.g. loss of raft majority at the end of the test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Following patch will convert topology tests to use
new_test_keyspace and friends.
Some tests restart server and reset the driver connection
so we cannot use the original cql Session for
dropping the created keyspace in the `finally` block.
Pass the ManagerClient instead to get a new cql
session for dropping the keyspace.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we can not have more than one global topology operation at the same time. This means that we can not have concurrent truncate operations because truncate is implemented as a global topology operation.
Truncate excludes with other topology operations, and has to wait for those to complete before truncate starts executing. This can lead to truncate timeouts. In these cases the client retries the truncate operation, which will check for ongoing global topology operations, and will fail with
an "Another global topology request is ongoing, please retry." error.
This can be avoided by truncate checking if the ongoing global topology operation is a truncate running for the same table who's truncate has just been requested again. In this case, we can wait for the ongoing truncate to complete instead of immediately failing the operation, and
provide a better user experience.
This is an improvement, backport is not needed.
Closes#22166Closesscylladb/scylladb#22371
* github.com:scylladb/scylladb:
test: add test for re-cycling ongoing truncate operations
truncate: add additional logging and improve error message during truncate
storage_proxy: wait on already running truncate for the same table
storage_proxy: allow multiple truncate table fibers per shard
Return after executing the global metadata barrier to allow the topology
handler to handle any transitions that might have started by a
concurrect transaction.
Fixes#22792
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#22793
Refs #22628
Adds exception handler + cleanup for the case where we have a bad config/env vars (hint minio) or similar, such that we fail with exception during setting up the EAR context. In a normal startup, this is ok. We will report the exception, and the do a exit(1).
In tests however, we don't and active context will instead be freed quite proper, in which case we need to call stop to ensure we don't crash on shared pointer destruction on wrong shard. Doing so will hide the real issue from whomever runs the test.
Adds some verbosity to track issues with the network proxy used to test EAR connector difficulties. Also adds an earlier close in input stream to help network usage.
Note: This is a diagnostic helper. Still cannot repro the issue above.
Closesscylladb/scylladb#22810
* github.com:scylladb/scylladb:
gcp/aws kms: Promote service_error to recoverable + use malformed_response_error
encryption_at_rest_test: Add verbosity + earlier stream close to proxy
encryption: Add exception handler to context init (for tests)
Replace boost::accumulate() with the standard library's alternatives
to reduce external dependencies and simplify the codebase. This
change eliminates the requirement for boost::range and makes the
implementation more maintainable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22856
Disable seastar's built in handlers for SIGINT and SIGTERM and thus
fall-back to the OS's default handlers, which terminate the process.
This makes tool applications interruptable by SIGINT and SIGTERM.
The default handler just terminates the tool app immediately and
doesn't allow for cleanup, but this is fine: the tools have no important
data to save or any critical cleanup to do before exiting.
Fixes: scylladb/scylladb#16954Closesscylladb/scylladb#22838
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This PR first refactors the reclaim and reload logic into a single
background fiber. It then updates the sstable manager to subscribe to
changes in the `components_memory_reclaim_threshold` configuration value
and immediately triggers the reclaim/reload fiber when a change is
detected.
Fixes#21947
This is an improvement and does not need to be backported.
Closesscylladb/scylladb#22725
* github.com:scylladb/scylladb:
sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update
sstables_manager: maybe_reclaim_components: yield between iterations
sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()`
sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()`
sstables_manager: rename `_sstable_deleted_event` condition variable
sstables_manager: rename `components_reloader_fiber()`
sstables_manager: fix `maybe_reclaim_components()` indentation
sstables_manager: reclaim components memory until usage falls below threshold
sstables_manager: introduce `get_components_memory_reclaim_threshold()`
sstables_manager: extract `maybe_reclaim_components()`
sstables_manager: fix `maybe_reload_components()` indentation
sstables_manager: extract out `maybe_reload_components()`
The config variable `components_memory_reclaim_threshold` limits the
memory available to the sstable bloom filters. Any change to its value
is not immediately propagated to the sstable manager, despite it being
a LiveUpdate variable. The updated value takes effect only when a new
sstable is created or deleted.
This patch updates the sstable manager to subscribe to any changes in
the above mentioned config value and immediately trigger the
reclaim/reload fiber when a change occurs. Also, adds a testcase to
verify the fix.
Fixes#21947
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Refs #22628
Mark problems parsing response (partial message, network error without exception etc
- hello testing), as "malformed_response_error", and promote this as well as
general "service_error" to recoverable exceptions (don't isolate node on error).
This to better handle intermittent network issues as well as making error-testing
more deterministic.
Refs #22628
Adds some verbosity to track issues with the network proxy used to test
EAR connector difficulties. Also adds an earlier close in input stream
to help network usage.
Note: This is a diagnostic helper. Still cannot repro the issue above.
Adds exception handler + cleanup for the case where we have a
bad config/env vars (hint minio) or similar, such that we fail
with exception during setting up the EAR context.
In a normal startup, this is ok. We will report the exception,
and the do a exit(1).
In tests however, we don't and active context will instead be
freed quite proper, in which case we need to call stop to ensure
we don't crash on shared pointer destruction on wrong shard.
Doing so will hide the real issue from whomever runs the test.
Serializing `raft::append_request` for transmission requires approximately the same amount of memory as its size. This means when the Raft library replicates a log item to M servers, the log item is effectively copied M times. To prevent excessive memory usage and potential out-of-memory issues, we limit the total memory consumption of in-flight `raft::append_request` messages.
Fixesscylladb/scylladb#14411Closesscylladb/scylladb#22835
* github.com:scylladb/scylladb:
raft_rpc::send_append_entries: limit memory usage
fms: extract entry_size to log_entry::get_size
Reorder member variable initialization sequence to ensure `pw` is accessed
before being moved. While the current use-after-move warning from clang-tidy
is a false positive, this change:
- Makes the initialization order more logical
- Eliminates misleading static analysis warnings
- Prevents potential future issues if class structure changes
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22830
This patch adds to the fetch_scylla.py script, used by the "--release"
option of test/{cqlpy,alternator}/run, the ability to download the new
2025.1 releases.
In the new single-stream releases, the number looks like the old
Scylla Enterprise releases, but the location of the artifacts in the
S3 bucket look like the old open-source releases (without the word
"-enterprise" in the paths). So this patch introduces a new "if"
for the (major >= 2025) case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22778
This change adds two log messages. One for the creation of the truncate
global topology request, and another for the truncate timeout. This is
added in order to help with tracking truncate operation events.
It also extends the "Another global topology request is ongoing, please
retry." error message with more information: keyspace and table name.
Currently, we can not have more than one global topology operation at
the same time. This means that we can not have concurrent truncate
operations because truncate is implemented as a global topology
operation.
Truncate excludes with other topology operations, and has to wait for
those to complete before truncate starts executing. This can lead to
truncate timeouts. In these cases the client retries the truncate operation,
which will check for ongoing global topology operations, and will fail with
an "Another global topology request is ongoing, please retry." error.
This can be avoided by truncate checking if we have a truncate for the same
table already queued. In this case, we can wait for the ongoing truncate to
complete instead of immediatelly failing the operation, and provide a better
user experience.
Both, repair and streaming depend on view builder, but since the builder is started too late, both keep sharded<> reference on it and apply `if (view_builder.local_is_initialized())` safety checks.
However, view builder can do its sharded start much earlier, there's currently nothing that prevents it from doing so. This PR moves view builder start up together with some other of its dependencies, and relaxes the way repair and streaming use their view-builder references, in particular -- removes those ugly initialization checks.
refs: scylladb/scylladb#2737Closesscylladb/scylladb#22676
* github.com:scylladb/scylladb:
streaming: Relax streaming::make_streamig_consumer() view builder arg
streaming: Keep non-sharded view_builder dependency reference
streaming: Remove view_builder.local_is_initialized() checks
repair: Keep non-sharded view_builder dependency reference
repair: Remove view_builder.local_is_initialized() checks
main: Start sharded<view_builder> earlier
test/cql_env: Move stream manager start lower
Replace legacy shell test operator (-o) with more portable OR (||) syntax.
Fix fragile file handling in find loop by using while read loop instead.
Warnings fixed:
- SC2166: Replace [ p -o q ] with [ p ] || [ q ]
- SC2044: Replace for loop over find with while read loop
While no issues were observed with the current code, these changes improve
robustness and portability across different shell environments.
also, set the pipefail option, so that we can catch the unexpected
failure of `find` command call.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22385
It is redundant with reader_permit::impl::_ttl_timer. Use the latter for
TTL of inactive reads too. The usage of the two exclude each other, at
any point in time, either one or the other is used, so no reason to keep
both.
Closesscylladb/scylladb#22863
Demote do-nothing decisions to debug level, but keep them at info
if we did decide to do nothing (such as migrate a tablet). Information
about more major events (like split/merge) is kept at info level.
Once log line that logs node information now also logs the datacenter,
which was previously supplied by a log line that is now debug-only.
Closesscylladb/scylladb#22783
Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support.
Fixes https://github.com/scylladb/scylladb/issues/22417
New feature. No backport is needed.
Closesscylladb/scylladb#22621
* github.com:scylladb/scylladb:
test: add test to check dcs and hosts repair filter
test: add repair dc selection to test_tablet_metadata_persistence
repair: Introduce Host and DC filter support
docs: locator: update the docs and formatter of tablet_task_info
result_set_row is a heavyweight object containing multiple cell types:
regular columns, partition keys, and static values. To prevent expensive
accidental copies, delete the copy constructor and replace it with:
1. A move constructor for efficient vector reallocation
2. An explicit copy() method when copies are actually needed
This change reduces overhead in some non-hot paths by eliminating implicit
deep copies. Please note, previously, in `create_view_from_mutation()`,
we kept a copy of `result_set_row`, and then reused `table_rs` for
holding the mutation for `scylla_tables`. Because we don't copy
the `result_set_row` in this change, in order to avoid invalidating
the `row` after reusing `table_rs` in the outer scope, we define a
new `table_rs` shadowing the one in the out scope.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22741
The existing test measures latencies of object GET-s. That's nice (though incomplete), but we want to measure upload performance. Here it is.
refs: #22460Closesscylladb/scylladb#22480
* github.com:scylladb/scylladb:
test/perf/s3: Add --part-size-mb option for upload test
test/perf/s3: Add uploading test
test/perf/s3: Some renames not to be download-centric
test/perf/s3: Make object/file name configurable
test/perf/s3: Configure maximum number of sockets
test/perf/s3: Remove parallelizm
s3/client: Make http client connections limit configurable
Given two sets of equivalent types, return the set
intersection.
This is a generic function which adapts to the actual
input type.
A unit test is added.
Closesscylladb/scylladb#22763
this PR is propper(pythonic) chance of commit 288a47f815
Creating an own folder used to be needed for two reasons:
we want a separate test suite, with its own settings
we want to structure tests, e.g. tablets, raft, schema, gossip.
We've been creating many folders recently. However, test suite
infrastructure is expensive in test.py - each suite has its own
pool of servers, concurrency settings and so on.
Make it possible to structure tests without too many suites,
by supporting subfolders within a suite.
As an example, this PR move mv tests into a separate folder
custom test.py lookup also works.
tests can be run as:
1. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets/test_mv_tablets_empty_ip
2. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets
3. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv
Fixes https://github.com/scylladb/scylladb/issues/20570Closesscylladb/scylladb#22816
* github.com:scylladb/scylladb:
test.py: move mv tests into a separate folder
test.py: suport subfolders
Seems tox is not used anywhere, so there is no need to have it then.
Especially when it messes with pytest. In some cases it can change the
config dir in pytest run.
Closesscylladb/scylladb#22819
Table updates that try to enable stream (while changing or not the
StreamViewType) on a table that already has the stream enabled
will result in ValidationError.
Table updates that try to disable stream on a table that does not
have the stream enabled will result in ValidationError.
Add two tests to verify the above.
Mark the test for changing the existing stream's StreamViewType
not to xfail.
Fixesscylladb/scylladb#6939Closesscylladb/scylladb#22827
Switch to using schema_ptr wrapper when handling schema references in
scylla_read_stats function. The existing fallback for older versions
(where schema is already a raw pointer) remains preserved.
Fixes#18700
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#22726
Fixes#22688
If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.
Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.
v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some
Closesscylladb/scylladb#22693
The test
topology_custom/test_alternator::test_localnodes_broadcast_rpc_address
sets up nodes with a silly "broadcast rpc address" and checks that
Alternator's "/localnodes" requests returns it correctly.
The problem is that although we don't use CQL in this test, the test
framework does open a CQL connection when the test starts, and closes
it when it ends. It turns out that when we set a silly "broadcast RPC
address", the driver tends to try to connect to it when shutting down,
I'm not even sure why. But the choice of the silly address was 1.2.3.4
is unfortunate, because this IP address is actually routable - and
the driver hangs until it times out (in practice, in a bit over two
minutes). This trivial patch changes 1.2.3.4 to 127.0.0.0 - and equally
silly address but one to which connections fail immediately.
Before this patch, the test often takes more than 2 minutes to finish
on my laptop, after this patch, it always finishes in 4-5 seconds.
Fixes#22744
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22746
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.
Fixes: #22770Closesscylladb/scylladb#22771
This PR addresses two related issues in our task system:
1. Prepares for asynchronous resource cleanup by converting release_resources() to a coroutine. This refactoring enables future improvements in how we handle resource cleanup.
2. Fixes a cross-shard resource cleanup issue in the SSTable loader where destruction of per-shard progress elements could trigger "shared_ptr accessed on non-owner cpu" errors in multi-shard environments. The fix uses coroutines to ensure resources are released on their owner shards.
Fixes#22759
---
this change addresses a regression introduced by d815d7013c, which is contained by 2025.1 and master branches. so it should be backported to 2025.1 branch.
Closesscylladb/scylladb#22791
* github.com:scylladb/scylladb:
sstable_loader: fix cross-shard resource cleanup in download_task_impl
tasks: make release_resources() a coroutine
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22857
In a rolling upgrade, nodes that weren't upgraded yet will not recognize
the new tablet_resize_finalization state, that serves both split and
merges, leading to a crash. To fix that, coordinator will pick the
old tablet_split_finalization state for serving split finalization,
until the cluster agrees on merge, so it can start using the new
generic state for resize finalization introduced in merge series.
Regression was introduced in e00798f.
Fixes#22840.
Reported-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22845
The script fetch_scylla.py is used by the "--release" option of
test/cqlpy/run and test/alternator/run to fetch a given release of
Scylla. The release is fetched from S3, and the script assumed that the
user properly set up $HOME/.aws/config and $HOME/.aws/credentials
to determine the source of that download and the credentials to do this.
But this is unnecessary - Scylla's "downloads.scylladb.com" bucket
actually allows **anonymous** downloads, and this is what we should use.
After this patch, fetch_scylla.py (and the "--release" option of the
run scripts) work correctly even for a user that doesn't have $HOME/.aws
set up at all.
This fix is especially important to new developers, who might not even
have AWS credentials to put into these files.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22773
Two callers of it -- repair and stream-manager -- both have non-sharded
reference and can just use it as argument. The helper in question gets
sharded<> one by itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous path -- view builder is started early
enough and construction of stream manager can happen with non-sharded
reference on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>