To store the most important schema fields, like id, version, keyspace
name, table name and the list of all columns, along with their kind,
name and type. This will serve as alternative schema source to the one
stored in statistics component. This latter one doesn't store any of the
metatada and neither does primary key names (just the types), so it is
leads to confusion when it is used as schema source for scylla-sstable.
This new schema stored in the scylla-metadata component is not intended
to be a full-schema, equivalent to the one stored in the schema tables,
it is intended to be good enough for scylla-sstable being able to parse
sstables in a self-sufficient manner.
The test starts a 3-node cluster and immediately creates a big file
on one of the nodes, to trigger the out of space prevention to start
rejecting writes on this node. Then a write is executed and checked it
did not reach the node with critical disk utilization but reached
the remaining nodes (it should, RF=3 is set)
However, when not specified, a default LOCAL_ONE consistency level
is used. This means that only one node is required to acknowledge the
write.
After the write, the test checks if the write
+ did NOT reach the node with critical disk utilization (works)
+ did reach the remaning nodes
This can cause the test to fail sporadically as the write might not
yet be on the last node.
Use CL=QUORUM instead.
Fixes: https://github.com/scylladb/scylladb/issues/26004Closesscylladb/scylladb#26030
vector_store_client: Add support for multiple IPs in DNS responses
The DNS resolution logic now processes all IP addresses returned in a DNS
response, not just the primary one.
The client will iterate through the list of resolved IPs, attempting to
query the next one if a request fails. This improves high availability
by allowing the client to query other available nodes if one is down.
References: VECTOR-187
As this is a new feature no backport is needed.
Closesscylladb/scylladb#26055
* github.com:scylladb/scylladb:
vector_store_client: Rename HTTP_REQUEST_RETRIES to ANN_RETRIES
vector_store_client: Format with clang-format
vector_store_client: Add support for multiple IPs in DNS responses
vector_store_client_test: Extract `make_vs_server` helper function
vector_store_client_test: Ensure cleanup on exception
vector_store_client_test: Fix unreliable unavailable port tests
Consider the following:
The tablet load balancer is working on:
- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas
In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:
// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};
load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.
Let's assume dst is a shard on node1.
Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:
// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;
If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.
This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.
Fixes: #26203Closesscylladb/scylladb#26207
In f3dccc2215 ("interval: change start()/end() not to return
references to data members"), we introduced interval_bound_const_ref
as a lightweight alternative to interval_bound that does not carry
a T. This was needed because interval no longer contains
interval_bound:s.
This interval_bound_const_ref was just an interval_bound<const T&>,
and converting constructors and operators were added to move between
the interval_bound<T> and interval_bound<const T&>.
However, these happen to be illegal in C++ and just happened to work
in clang 20. Clang 21 tightened its checks and these are now flagged.
The problem is that when instantiating interval_bound<const T&> the
converting constructor looks like a copy constructor; and while it's
behind a constraint (that evaluates to false) the rules don't care
about that.
Fix this by having a separate interval_bound_const_ref template.
The new template is slightly better as it allows assignment (since
the payload is a pointer rather than a reference). Not that it's really
needed.
The C++ rule was reported [1] as too restrictive, but there is no
resolution yet.
[1] https://cplusplus.github.io/CWG/issues/2837.htmlClosesscylladb/scylladb#26081
This PR refactors the can_vote function in the Raft algorithms for improved clarity and maintainability by providing safer strong boolean types to the raft algorithm.
Fixes: #21937
Backport: No backport required
Closesscylladb/scylladb#25787
This PR improves the handling of malformed SSTables during scrub and adds tests to validate the updated behavior.
When scrub is used, there is an increased chance of encountering malformed SSTables. These should not be retried as in regular compaction. Instead, they must be handled according to the selected scrub mode: in skip mode, in case of malformed_sstable_exception, invalid data or whole SSTable should be removed, in abort and segregate modes, the scrub process should abort.
Previously, all modes treated malformed_sstable_exception the same way, causing scrub to abort even when skip mode was selected. This PR updates the scrub logic to properly handle malformed SSTable exceptions based on the selected mode.
Unit tests are added to verify the intended behavior.
Fixesscylladb/scylladb#19059
Backport is not required, it is an improvement
Closesscylladb/scylladb#25828
* github.com:scylladb/scylladb:
sstable_compaction_test: add scrub tests for malformed SSTables
scrub: skip sstable on malformed sstable exception in skip mode
Since we generage pgo profiles once a fortnight (and not every build),
pgo hash mismatches are expected as the code built diverges from the
code measured. The warnings about hash mismatches don't provide any
value (they cannot be acted upon) and are only distracting.
A possible downside is that we'll miss the pgo training process failing
(it is visible in the warnings list getting longer and longer), but that's
not a proper indication.
Suppress them with the appropriate linker switch.
Ref #26010.
Closesscylladb/scylladb#26162
As requested in #22104, moved the files and fixed other includes and build system.
Moved files:
- combine.hh
- collection_mutation.hh
- collection_mutation.cc
- converting_mutation_partition_applier.hh
- converting_mutation_partition_applier.cc
- counters.hh
- counters.cc
- timestamp.hh
Fixes: #22104
This is a cleanup, no need to backport
Closesscylladb/scylladb#25085
Currently, test.py always runs in verbose mode in non-TTY due to
the custom reporter TabularConsoleOutput (custom-implemented test
reporter in test.py) limitations, because in non-verbose mode,
it "Use ANSI Escape sequences for manipulating console," which is not
possible in non-TTY.
So, it also affects the new pure pytest runner (now only boosts tests
run under it), but it should not, since pytest standard output does not
depend on TTY.
This commit changes the logic to increase only TabularConsoleOutput
verbosity for CI(non-TTY) run instead of the whole framework.
Closesscylladb/scylladb#26202
The -fvisibility=hidden flag removes a shared library's symbols from
the dynamic symbol table, reducing the shared object size and the
dynamic linking time. In return, the user promises not to rely on
the uniqueness of objects declared with exactly the same name in the
shared libary and the main executable.
However, we violate this assumption. In Seatar's noncopyable_function,
we compare _vtable to _s_empty_vtable (in operator bool()). The full name
of this _s_empty_vtable is
seastar::noncopyable_function<Signature>::_s_empty_vtable. Since it
can be instantiated in both the main executable and in libseastar.so,
the comparison can fail even though we're comparing what is, in C++
view, the address of a unique object to itself.
To solve the problem, we can either:
- reimplement noncopyable_function::operator bool in a way that does not
depent on the object name (for exaple, 'return _vtable->empty;') where
_empty is a boolean initialized to true for the empty vtable), and be
careful not to repeat the mistake elsewhere
- drop -fvisibility=hidden
Here, we choose the second option. The benefit of -fvisibility=hidden is
important, but we only use shared libraries in debug mode, and the time
spent chasing these apparent one-definition-rule violations is more
important than some milliseconds during launch time and a few megabytes
in libseastar.so.
We do trade it in for -fvisibility-inlines-hidden. This is similar to
-fvisibility=hidden, but applies only to addresses of inline functions.
Since comparing functions by address is very rare, and inline functions are
very common, it seems like a reasonable tradeoff to make.
Fixes#25286.
Closesscylladb/scylladb#26154
Rename `HTTP_REQUEST_RETRIES` to `ANN_RETRIES` in `vector_store_client`,
as it now applies to all vector store nodes, not just HTTP requests.
Also, remove an unused test setter function.
The DNS resolution logic now processes all IP addresses returned in a DNS
response, not just the primary one.
The client will iterate through the list of resolved IPs, attempting to
query the next one if a request fails. This improves high availability
by allowing the client to query other available nodes if one is down.
The `generate_unavailable_localhost_port` function is not robust because it
can suffer from a race condition. It finds an available port but does not
keep it occupied, meaning another process could bind to it before the test
can use it.
The `unavailable_server` helper is a more robust solution. It creates a
server that listens on a port for its entire lifetime and immediately
closes any incoming connections. This guarantees the port remains
unavailable, making the test more reliable.
After https://github.com/scylladb/scylladb/pull/22034, staging status of sstables streamed
via file streaming was ignored and view updates were never generated.
This patch fixes it and now staging sstables are registered to
`view_building_worker`. Then, the worker create view building tasks
for those sstables, so the view building coordinator can schedule them
once the tablet migration is finished.
Fixes https://github.com/scylladb/scylla-enterprise/issues/4572
This fix affects only views on tablets, which are still experimental, so no backport needed.
Closesscylladb/scylladb#25776
* github.com:scylladb/scylladb:
test/test_view_building_coordinator: add reproducer for file streaming
streaming/stream_blob: register staging sstables to process them
When the `time_window_compaction_strategy::to_timestamp_type` encounters
an invalid timestamp resolution it throws an `std::runtime_error`. This
patch replaces it with `on_internal_error()` and also logs the invalid
timestamp resolution for easier debugging.
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26138
When a node is being replaced, it enters a "left" state while still
owning tokens. Before this patch, this is also the time when we start
draining hints targeted to this node, so the hints may get sent before
the token ownership gets migrated to another replica, and these hints
may get lost.
In this patch we postpone the hint draining for the "left" nodes to
the time when we know that the target nodes no longer hold ownership
of any tokens - so they're no longer referenced in topology. I'm
calling such nodes "released".
Before this change, when we were starting draining hints, we knew the
IP addresses of the target nodes. We lose this information after entering
"left" stage, so when draining hints after a node is "released", we
can't drain the hints targeted to a specific IP instead of host_id.
We may have hints targeted to IPs if the migration rom IP-based to
host_ID-based hints didn't happen yet. The migration happens
when enabling a cluster feature since 2024.2.0, so such hints can
only exist if we perform a direct upgrade from a version before
2024.2.0 to a version that has this change (2025.4.0+).
To avoid losing hints completely when such an upgrade is combined
with a node removal/replacement, we still drain hints when the node
enters a "left" state and the migration of hints to host_id wasn't
performed yet. For these drains, the problematic scenario can't
occur because it only affects tablets, and when upgrading from
a version before 2024.2.0, no tablets can exist yet.
If we perform such a drain, we no longer need to drain hints when
entering the "released" state, so we only drain when entering that
state if the migration was already completed.
With this setup, we'll always drain hints at least once when a node
is leaving. However, if the migration to host_ids finishes between
entering the "left" state and the "released" state, we'll attempt
to drain the hints twice. This shouldn't be problem though because
each `drain_for()` is performed with the `_drain_lock` and after
a `hint_endpoint_manger` is drained, it's removed, so we won't try
to drain it twice.
This patch also includes a test for verifying that hints are properly
replayed after a node replace.
Fixes https://github.com/scylladb/scylladb/issues/24980Closesscylladb/scylladb#24981
Passing `0` as the `initial_tablets` argument causes `schema_loaders`'s
placeholder keyspace to be a tablet keyspace.
This causes `scylla sstable` to reject some table schemas which
are legitimate in this context. For example, `scylla sstable`
refuses to work with sstables which contains `counter` columns,
because tablets don't support counters.
This is undesirable. Let's make `schema_loader`'s keyspace
a non-tablet keyspace.
Closesscylladb/scylladb#26192
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger on_fatal_internal_error.
Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.
Check module abort source before creating compaction executor and
adding it to _tasks.
Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
compaction to finish.
Fixes: https://github.com/scylladb/scylladb/issues/25806.
Fixes shutdown bug; Needs backports to all version
Closesscylladb/scylladb#25885
* github.com:scylladb/scylladb:
compaction: move _tasks check
compaction: stop compaction module in really_do_stop
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.
The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.
Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
- new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.
Remove nightly mark.
Fixes: #26148.
Closesscylladb/scylladb#26209
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.
Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.
Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to reballance the cluster (plan making
only, sum of all iterations, no streaming):
Before: 16 min 25 s
After: 0 min 25 s
Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):
testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)
The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.
After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls.
Fixes#26016Closesscylladb/scylladb#26017
* github.com:scylladb/scylladb:
test: perf: perf-load-balancing: Add parallel-scaleout scenario
test: perf: perf-load-balancing: Convert to tool_app_template
tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.
It may also surprise consumers of the plan, like we saw in #25912.
So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.
Fixes#26038Closesscylladb/scylladb#26048
Byte comparable format is not supported for counter types. This patch
adds explicit handling for them for completeness, allowing the default
abstract type handler to be removed.
Refs #19407
New features. No need to backport.
Closesscylladb/scylladb#26206
* github.com:scylladb/scylladb:
types/comparable_bytes: remove default abstract type handler
types/comparable_bytes: handle counter type
Includes fix for scylla-gdb -- the fair_queue is now hierarchy and
priority queues are no longer accessible via _handles member. However,
the fix is incomplete -- it silently assumes that the hierarchy is flat
(and it _is_ flat now, scylla doesn't yet create nested groups) but it
should be handled eventually
* seastar b6be384e...c8a3515f (8):
> Merge 'Nested scheduling groups (IO classes)' from Pavel Emelyanov
test: Add test case for wake-from-idle accumulator fixups
test: Add fair_queue test for queues activations
test: Expand fair queue random run test with groups
test: Add test for basic fair-queue nested linkage
test: Cleanup scheduling groups after io_queue_test cases
code: Update IO priority group shares from supergroup shares change
io_queue: Register class groups in fair-queue
fair_queue: Add test class friendship
fair_queue: Move nr_classes into group_data
fair_queue: Fix indentation after previous patch
fair_queue: Implement hierarchical queue activation (wakeup)
fair_queue: Remove now unused push/pop helpers
fair_queue: Implement hierarchical priority_entry::pop_front()
fair_queue: Implement hierarchical priority_entry::top()
fair_queue: Link priority_entries into tree
fair_queue: Add priority_class_group_data::reserve()
fair_queue: Move more bits onto priority_entry
fair_queue: Move shares on priority_entry
fair_queue: Move last_accumulated on priority_class_group_data
fair_queue: Introduce priority_class_group_data
fair_queue: Inherit priority_class_data from priority_entry
fair_queue: Rename priority_class_ptr
ioqueue: Opencode get_class_info() helper
ioq: Move fair_queue class registration down
> tls: Rework session termination
> http/request: get_query_param(): add default_value argument
> http/request: add has_query_param()
> sharded: Deprecate distributed alias
> io_tester: Allow configuring sloppy_size_hint for files
> file: Remove duplicating static_assert-ions
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26143
The test reproduces issue scylladb/scylla-enterprise#4572.
Before the fix, file-streaming didn't respect staging status of a
sstable and view updates weren't generated, leading to base-view
inconsistency.
The test creates the inconsistency in the view, triggers file-streaming
of staging sstables and verifies that the updates are generated.
After scylladb/scylladb#22034, staging status of sstables streamed
via file streaming was ignored and view updates were never generated.
This patch fixes it and now staging sstables are registered to
`view_building_worker`. Then, the worker create view building tasks
for those sstables, so the view building coordinator can schedule them
once the tablet migration is finished.
Fixesscylladb/scylla-enterprise#4572
There's a bunch of /storage_service/... endpoints that start compaction manager tasks and wait for it. Most of them have async peer in /tasks/... that start the very same task, but return to the caller with the task ID.
This patch moves those handlers' code from storage_service.cc to tasks.cc, next to the corresponding async peers, to keep handlers that need compaction_manager in one place.
That's preparation for more future changes. Later all those endpoints will stop using database from http_context and will capture the compaction_manager they need from main, like it was done in #20962 for /compaction_manager/... endpoints. Even "more later", the former and the latter blocks of endpoints will be registered and unregistered together, e.g. like database endpoints were collected in one reg/unreg sequence by #25674.
Part of http_context dependencies cleanup effort, no need to backport.
Closesscylladb/scylladb#26140
* https://github.com/scylladb/scylladb:
api: Move /storage_service/compact to tasks.cc
api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc
api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc
api: Move /storage_service/keyspace_cleanup to tasks.cc
api: Move /storage_service/keyspace_compaction to tasks.cc
Add unit tests for scrub behavior with malformed SSTables:
- sstable_scrub_abort_mode_malformed_sstable_test(verifies scrub aborts on malformed SSTables)
- sstable_scrub_skip_mode_malformed_sstable_test(verifies scrub skips malformed SSTables without aborting)
Previously, malformed_sstable_exception during scrub was handled the same way in
abort, skip and segragate mode, causing the scrub process to abort even when skip was specified.
This commit updates the behavior to correctly handle malformed_sstable_exception in
skip mode by removing invalid data or whole malformed SSTable instead of aborting the entire scrub.
In compaction_manager::really_do_stop we check whether _tasks list
is empty after the compactions are stopped. However, a new task may
still sneak in, causing the assertion failure. Such a task won't
be there for long - module::make_task will fail as the module is
already stopped.
Move the assertion, that checks if _tasks is empty, after the
compaction_states' gates are closed.
Fixes: #25806.
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.
Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.
Stop compaction module in really_do_stop, after ongoing compactions
are stopped.
It's a preparation for further patches.
Byte comparable format is not supported for counter types. This patch
adds explicit handling for them for completeness, allowing the default
abstract type handler to be removed in the next patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
`test_long_query_timeout_erm` is slow because it has many parameterized
variants, and it verifies timeout behavior during ERM operations, which
are slow by nature.
This change speeds the test up by roughly 3× (319s -> 114s) by:
- Removing two of the five scenarios that were near duplicates.
- Shortening timeout values to reduce waiting time.
- Parallelizing waiting on server_log with asyncio.TaskGroup().
The two removed scenarios (`("SELECT", True, False)`,
`("SELECT_WHERE", True, False)`) were near duplicates to
`("SELECT_COUNT_WHERE", True, False)` scenario, because all three
scenarios use non-mapreduce query and triggers basically the same
system behavior. It is sufficient to keep only one of them, so the test
verifies three cases:
- One with nodes shutdown
- One with mapreduce query
- One with non-mapreduce query
Fixes: scylladb/scylla#24127Closesscylladb/scylladb#25987
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.
This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.
This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.
This patch also includes a test for this exact scenario, which is enforced by an injection.
Fixes https://github.com/scylladb/scylladb/issues/21492Closesscylladb/scylladb#24396
* github.com:scylladb/scylladb:
mv: handle mismatched base/view replica count caused by RF change
mv: save the nodes used for pairing calculations for later reuse
mv: move the decision about simple rack-aware pairing later
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.
To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.
A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.
Fixesscylladb/scylladb#24806Closesscylladb/scylladb#26057
Simulates reblancing on a single scale-out involving simultaneous
addition of multiple nodes per rack.
Default parameters create a cluster with 2 racks, 70 tables, 256
tablets/table, 10 nodes, 88 shards/node.
Adds 6 nodes in parallel (3 per rack).
Current result on my laptop:
testlog - Rebalance took 21.874 [s] after 82 iteration(s)
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.
Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.
Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to rebalance the cluster (plan making
only, sum of all iterations, no streaming):
Before: 16 min 25 s
After: 0 min 25 s
Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):
Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)
The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.
After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making).
The plan is twice as large because it combines the output of two subsequent (pre-patch)
plan-making calls.
Fixes#26016
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).
All three tests perform two rolling restarts, but the latter ones
already have the workaround.
Fixes#26005Closesscylladb/scylladb#26056
`line_to_row` is a test function that converts `syslog` audit log to
the format of `table` audit log so tests can use the same checks
for both types of audit. Because `syslog` audit doesn't have `date`
information, the field was filled with the current date. This behavior
broke the tests running at 23:59:59 because `line_to_row` returned
different results on different days.
Fixes: scylladb/scylladb#25509Closesscylladb/scylladb#26101
Issue #26079 noted that multiple Alternator tests are failing when run against DynamoDB. This pull request fixes many of them, in several small patches. In one case we need to avoid a DynamoDB bug that wasn't even the point of the original test (and we create a new test specifically for that DynamoDB bug). Another test exposed a real incompatibility with Alternator (#26103) but didn't need to be exposed in this specific test so again we split the test to one that passes, and another one that xfails on Alternator (not on DynamoDB). A bigger changed had to be done to the tags feature test - since August 2024, the TagResource operation became asynchronous which broke our tests, so we fix this.
Each of these changes are described in more detail in the individual patches.
Refs #26079. It doesn't fix it completely because there are some tests which remain flaky, and some tests which, surprisingly, pass on us-east-1 but fail on eu-north-1. We'll need to address the rest later.
No backports needed, we only run tests against DynamDB from master (when we rarely do...), not on old branches.
Closesscylladb/scylladb#26114
* github.com:scylladb/scylladb:
test/alternator: fix test_list_tables_paginated on DynamoDB
test/alternator: fix tests in test_tag.py on DynamoDB
test/alternator: fix test_health_only_works_for_root_path on DynamoDB
test/alternator: reproducer tests for faux GSI range key problem
test/alternator: fix test "test_17119a" to pass on DynamoDB
test/alternator: fix test to pass on DynamoDB
The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong.
Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.
No need to backport, view build coordinator is not a part of any release yet.
Closesscylladb/scylladb#26122
* github.com:scylladb/scylladb:
mv: fix typo in start_backgroud_fibers
mv: run view building worker fibers in streaming group
`purgeable.lua` was written for a specific investigation a few years ago.
`writetime-histogram.lua` is an sstable script transcription of the former scylla-sstable writetime-histogram command. This was also written for an investigation (before script command existed) and is too specific to be a native command, so was removed by edaf67edcb.
Add both scripts to the sample script library, they can be useful, either for a future investigation, or as samples to copy+edit to write new scripts (and train AI).
New sstable scripts, no backport
Closesscylladb/scylladb#26137
* github.com:scylladb/scylladb:
tools/scylla-sstable-scripts: introduce writetime-histogram.lua
tools/scylla-sstable-scripts: introduce purgable.lua