Fixes the following scenario:
1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed
We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:
1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.
Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.
Fixes#23631
(cherry picked from commit 1e407ab4d2)
The test `test_mv_write_to_dead_node` currently uses a timeout of 60
seconds for remove_node, after it was increased from 30 seconds to fix
scylladb/scylladb#22953. Apparently it is still too low, and it was
observed to fail in debug mode.
Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000
seconds, but the test requires a timeout which is shorter than 5
minutes, because it is a regression test for an issue where MV updates
hold topology changes for more than 5 minutes, and we want to verify in
the test that the topology change completes in less than 5 minutes.
To resolve the issue, we set the test to skip in debug mode, because the
remove node operation is unpredictably slow, and we increase the timeout
to 180 seconds which is hopefully enough time for remove_node in
non-debug modes, and still sufficient to satisfy the test requirements.
Fixesscylladb/scylladb#22530Closesscylladb/scylladb#23833
(cherry picked from commit 5c1d24f983)
Closesscylladb/scylladb#23874
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).
In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.
2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.
But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
This is a regression fix, should be backported to all affected releases.
- (cherry picked from commit 4e2f62143b)
- (cherry picked from commit 6c1889f65c)
Parent PR: #23782Closesscylladb/scylladb#23810
* github.com:scylladb/scylladb:
managed_bytes_test: add a reproducer for #23781
managed_bytes: in the copy constructor, respect the target preferred allocation size
Said method passes down its diff input to mutate_internal(), after some std::ranges massaging. Said massaging is destructive -- it moves items from the diff. If the output range is iterated-over multiple times, only the first time will see the actual output, further iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens: mutate_internal() iterates over the range multiple times, first to log its content, then to pass it down the stack. This ends up resulting in an empty range being pased down and write handlers being created with nullopt optionals.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714
A follow-up stability fix for the test is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23513
Fixes: https://github.com/scylladb/scylladb/issues/23512
Based on code-inspection, all versions are vulnerable, although <=6.2 use boost::ranges, not std::ranges.
- (cherry picked from commit 7150442f6a)
Parent PR: #21910Closesscylladb/scylladb#23791
* github.com:scylladb/scylladb:
test/cluster/test_read_repair.py: increase read request timeout
service/storage_proxy: schedule_repair(): materialize the range into a vector
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.
In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.
Fixes: https://github.com/scylladb/scylladb/issues/17174.
Needs backport to 2025.1 to prevent out of space during streaming
- (cherry picked from commit b80e957a40)
- (cherry picked from commit ed7b8bb787)
- (cherry picked from commit 5d6041617b)
- (cherry picked from commit 4a847df55c)
- (cherry picked from commit eb17af6143)
- (cherry picked from commit acd32b24d3)
- (cherry picked from commit 372b562f5e)
Parent PR: #23187Closesscylladb/scylladb#23682
* github.com:scylladb/scylladb:
test: add test for rebuild with repair
locator: service: move to rebuild_v2 transition if cluster is upgraded
locator: service: add transition to rebuild_repair stage for rebuild_v2
locator: service: add rebuild_repair tablet transition stage
locator: add maybe_get_primary_replica
locator: service: add rebuild_v2 tablet transition kind
gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.
Fixes: #23513Closesscylladb/scylladb#23558
(cherry picked from commit 7bbfa5293f)
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.
Need backporting to 2025.1, as RCU and WCU are not fully supported
Fixes#23690
- (cherry picked from commit 0eabf8b388)
- (cherry picked from commit 88095919d0)
- (cherry picked from commit 3acde5f904)
Parent PR: #23691Closesscylladb/scylladb#23790
* github.com:scylladb/scylladb:
test_returnconsumedcapacity.py: test RCU for batch get item
alternator/executor: Add RCU support for batch get items
alternator/consumed_capacity: make functionality public
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.
Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714Closesscylladb/scylladb#21910
(cherry picked from commit 7150442f6a)
This patch adds tests for consumed capacity in batch get item. It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.
(cherry picked from commit 3acde5f904)
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.
For making encrypted data sink writing less cumbersome.
(cherry picked from commit 9ac9813c62)
Fixes#23225Fixes#23185
Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.
(cherry picked from commit e02be77af7)
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.
(cherry picked from commit e100af5280)
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.
For testing, and because they fit there, place memory
sink and source types there as well.
(cherry picked from commit 98a6d0f79c)
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.
This PR fixes the issue by updating the load balancer to no schedule any
migrations and to not make any repair plans when there a resize
finalization is pending in any table.
Also added a testcase to verify the fix.
Fixes#21762
- (cherry picked from commit 8cabc66f07)
- (cherry picked from commit 5b47d84399)
- (cherry picked from commit dccce670c1)
Parent PR: #22148Closesscylladb/scylladb#23633
* github.com:scylladb/scylladb:
topology_coordinator: fix indentation in generate_migration_updates
topology_coordinator: do not schedule migrations when there are pending resize finalizations
load_balancer: make repair plans only when there is no pending resize finalization
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.
This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.
This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.
A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.
Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
underlying consumer's on_end_of_stream(), so when consuming multiple
frozen mutations, the underlying's on_end_of_stream() is called for
each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().
Fixes: #20084Closesscylladb/scylladb#23657
(cherry picked from commit d67202972a)
Closesscylladb/scylladb#23694
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.
The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.
The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.
Fixes: https://github.com/scylladb/scylladb/issues/22409.
Needs backport to 2025.1 that introduces the new tablet repair API
- (cherry picked from commit cbde835792)
- (cherry picked from commit b81c81c7f4)
- (cherry picked from commit aa3973c850)
- (cherry picked from commit 8bbc5e8923)
- (cherry picked from commit 02fb71da42)
- (cherry picked from commit 9769d7a564)
Parent PR: #22905Closesscylladb/scylladb#23672
* github.com:scylladb/scylladb:
docs: nodetool: update repair and add tablet-repair docs
test: nodetool: add tests for cluster repair command
nodetool: add cluster repair command
nodetool: repair: extract getting hosts and dcs to functions
nodetool: repair: warn about repairing tablet keyspaces
nodetool: repair: move keyspace_uses_tablets function
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes#18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23560
(cherry picked from commit 0f59deffaa)
(cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23649
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
- (cherry picked from commit c2518cdf1a)
- (cherry picked from commit 6b5b563ef7)
- (cherry picked from commit 7e600a0747)
- (cherry picked from commit d126ea09ba)
- (cherry picked from commit cb76cafb60)
- (cherry picked from commit df09b3f970)
- (cherry picked from commit e5afd9b5fb)
- (cherry picked from commit 34b18d7ef4)
- (cherry picked from commit f7938e3f8b)
- (cherry picked from commit 6c1f6427b3)
- (cherry picked from commit 0d39091df2)
Parent PR: #23255Closesscylladb/scylladb#23673
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.
To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.
Also added a testcase to verify the fix.
Fixes#21762
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 5b47d84399)
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
(cherry picked from commit 0d39091df2)
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
(cherry picked from commit e5afd9b5fb)
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.
(cherry picked from commit 6b5b563ef7)
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.
Fixes#23378
- (cherry picked from commit d6232a4f5f)
- (cherry picked from commit 6bff596fce)
Parent PR: #23478Closesscylladb/scylladb#23635
* github.com:scylladb/scylladb:
tablets: Make tablet allocation equalize per-shard load
tablets: load_balancer: Fix reporting of total load per node
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.
To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.
A reproducer is added to the nodetool netstats crash.
Fixes: scylladb/scylladb#23394Closesscylladb/scylladb#23395
(cherry picked from commit bd8973a025)
Closesscylladb/scylladb#23476
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.
This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.
Fixes scylladb/scylladb#23325
Fixes scylladb/scylladb#23305
Fixes scylladb/scylladb#21815
Backport: looks like this is quite a frequent issue, therefore backport to 2025.1.
- (cherry picked from commit 6abfed9817)
- (cherry picked from commit b1e89246d4)
- (cherry picked from commit 0d9d0fe60e)
- (cherry picked from commit d448f3de77)
Parent PR: #23336Closesscylladb/scylladb#23470
* github.com:scylladb/scylladb:
database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`
database: Unify exception handling in `do_apply` and `apply_with_commitlog`
storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down.
exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.
Fixes: scylladb/scylladb#22919
Bug is present in all live versions, backports are required.
- (cherry picked from commit 4d8eb02b8d)
- (cherry picked from commit 7ba29ec46c)
Parent PR: #23044Closesscylladb/scylladb#23145
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.
Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.
Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.
Fixes: scylladb/scylladb#22953Closesscylladb/scylladb#22958
(cherry picked from commit 43ae3ab703)
Closesscylladb/scylladb#23053
Fixes#22688
If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.
Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.
v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some
Closesscylladb/scylladb#22693
(cherry picked from commit 342df0b1a8)
Closesscylladb/scylladb#22877
Warn about an attempt to repair tablet keysapce with nodetool repair.
A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.
(cherry picked from commit b81c81c7f4)
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.
The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.
Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).
Fixes#23534
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23547
(cherry picked from commit 84fd52315f)
Closesscylladb/scylladb#23643
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.
Refs scylladb/scylla-enterprise#4355
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 62aeba759b)
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c62865df90)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogenous clusters. Nodes with fewer shards will end up with
overloaded shards.
Refs #23378
(cherry picked from commit 6bff596fce)
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.
Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.
Fixes: #23453.
Needs backport to 2025.1 that introduces the tablet repair scheduler.
- (cherry picked from commit 1dc29ddc86)
- (cherry picked from commit bae6711809)
Parent PR: #23455Closesscylladb/scylladb#23580
* github.com:scylladb/scylladb:
\test: add test to check concurrent migration and repair of two different tablets
repair: release erm in repair_writer_impl::create_writer when possible
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.
Fixes: scylladb/scylladb#23218
(cherry picked from commit fee50f287c)
Parent PR: #23219Closesscylladb/scylladb#23594
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.
We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.
So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.
Fixes#23569.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23578
(cherry picked from commit a9a6f9eecc)
Closesscylladb/scylladb#23601
Draining hints may occur in one of the two scenarios:
* a node leaves the cluster and the local node drains all of the hints
saved for that node,
* the local node is being decommissioned.
Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.
There are two reasons for that:
* Generally, in situations like that, we'd like to be able to shut down
nodes as fast as possible. The data stored in the hints won't
disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
have the highest priority and it's reflected in the scheduling groups we
use as well as the explicitly enforced throughput. If there are a large
number of hints to be replayed, it might affect our tests.
It's already happened, see: scylladb/scylladb#21949.
To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.
That should ensure all data is retained, while possibly speeding up
the shutdown process.
There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.
We also provide tests verifying that the implementation works as intended.
Fixesscylladb/scylladb#21949Closesscylladb/scylladb#22811
(cherry picked from commit 0a6137218a)
Closesscylladb/scylladb#23370
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
1) tablets have the same size
2) shards have the same capacity
That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.
After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.
One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.
Fixes#23042
* github.com:scylladb/scylladb:
tablets: Make load balancing capacity-aware
topology_coordinator: Fix confusing log message
topology_coordinator: Refresh load stats after adding a new node
topology_coordinator: Allow capacity stats to be refreshed with some nodes down
topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
test: boost: tablets_test: Always provide capacity in load_stats
test: perf_load_balancing: Set node capacity
test: perf_load_balancing: Convert to topology_builder
config, disk_space_monitor: Allow overriding capacity via config
storage_service, tablets: Collect per-node capacity in load_stats
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
Closesscylladb/scylladb#23443
* github.com:scylladb/scylladb:
Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
The test fails sporadically with:
cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
That's becase a server is stopped in the middle of the workload.
The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.
Fixes#20492Closesscylladb/scylladb#23451
(cherry picked from commit 8e506c5a8f)
Closesscylladb/scylladb#23469
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.
The solution is the same: deselect the tests.
Fixes#23302Closesscylladb/scylladb#23405
(cherry picked from commit 574c81eac6)
Closesscylladb/scylladb#23460
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
1) tablets have the same size
2) shards have the same capacity
That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.
After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.
One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.
Refs #23042Closesscylladb/scylladb#23079
* github.com:scylladb/scylladb:
tablets: Make load balancing capacity-aware
topology_coordinator: Fix confusing log message
topology_coordinator: Refresh load stats after adding a new node
topology_coordinator: Allow capacity stats to be refreshed with some nodes down
topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
test: boost: tablets_test: Always provide capacity in load_stats
test: perf_load_balancing: Set node capacity
test: perf_load_balancing: Convert to topology_builder
config, disk_space_monitor: Allow overriding capacity via config
storage_service, tablets: Collect per-node capacity in load_stats
(cherry picked from commit b1d9f80d85)
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.
shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.
(cherry picked from commit 5e471c6f1b)