The skipped ranges should be multiplied by the number of tables
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
(cherry picked from commit 1392a6068d)
(cherry picked from commit 9868ccbac0)
Refs #21252Closesscylladb/scylladb#21313
* github.com:scylladb/scylladb:
test: Add test_node_ops_metrics.py
repair: Make the ranges more consistent in the log
repair: Fix finished ranges metrics for removenode
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.
Fixesscylladb/scylladb#21044
(cherry picked from commit 846d94134f)
Closesscylladb/scylladb#21282
Fixes#21159
When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.
Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.
(cherry picked from commit c28a5173d9)
Closesscylladb/scylladb#21393
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.
This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.
Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.
Fixesscylladb/scylladb#20060
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 1c55747637)
Closesscylladb/scylladb#21404
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.
Fix it by taking a copy instead. This is a slow path anyway.
Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)
Closesscylladb/scylladb#21361
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.
Fixes: scylladb/scylladb#19849Fixes: scylladb/scylladb#17857
(cherry picked from commit 72f3c95a63)
(cherry picked from commit 39dfd2d7ac)
(cherry picked from commit c00d40b239)
Refs scylladb/scylladb#20909Closesscylladb/scylladb#21334
* github.com:scylladb/scylladb:
fix nodetool status to show zero-token nodes
test: move `wait_for_first_completed` to pylib/util.py
token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.
This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.
Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.
The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.
This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.
Fixes: scylladb/scylladb#19849
(cherry picked from commit c00d40b239)
Rename host_id map getter, 'get_endpoint_to_host_id_map_for_reading' to 'get_endpoint_to_host_id_map_'
Also modify the getter to return information regarding joining nodes as well.
This getter will later be used for retrieving the nodes in nodetool status, hence it needs to show all nodes,
including joining ones.
The function name suffix `_for_reading` suggests that the function was used
in some other places in the past, and indeed if we need endpoints
"for reading" then we cannot show joining endpoints. But it was confirmed
that this function is currently only used by "/storage_service/host_id" endpoint,
hence it can be modified as required.
Fixes: scylladb/scylladb#17857
(cherry picked from commit 72f3c95a63)
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.
By default, the monitoring stack queries it once every 60s.
(cherry picked from commit 93777fa907)
Closesscylladb/scylladb#21304
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.
(cherry picked from commit 1392a6068d)
The skipped ranges should be multiplied by the number of tables.
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.
Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.
Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264
(cherry picked from commit dbb26da2aa)
Closesscylladb/scylladb#21303
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.
Fixes: https://github.com/scylladb/scylladb/issues/21102
Should be backported to every 6.x branch, as it may lead to a crash.
(cherry picked from commit de511f56ac)
(cherry picked from commit 3f4c8a30e3)
(cherry picked from commit 522bede8ec)
Refs https://github.com/scylladb/scylladb/pull/21121Closesscylladb/scylladb#21256
* github.com:scylladb/scylladb:
test: topology: add disable_schema_agreement_wait utility function
test: add UT to test retrying ALTER tablets KEYSPACE
cql/tablets: fix indentation in `rf_change` event handler
cql/tablets: fix retrying ALTER tablets KEYSPACE
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.
After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.
Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
Fixes: https://github.com/scylladb/scylladb/issues/21263
This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2
(cherry picked from commit e1d8cddd09)
Refs scylladb/scylladb#21058Closesscylladb/scylladb#21178
* github.com:scylladb/scylladb:
test/boost/mutation_test: add test for multishard permit safety
test/lib/reader_lifecycle_policy: add semaphore factory to constructor
test/lib/reader_lifecycle_policy: rename factory_function
repair/row_level: drop now unneeded release_base_resource() calls
readers/multishard: make multishard reader safe to create with admitted permits
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.
However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.
In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.
Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.
Fixes: scylladb/scylladb#21060
(cherry picked from commit a380a2efd9)
Closesscylladb/scylladb#21129
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.
The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.
Fixes#21175.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit dfc217f99a)
Closesscylladb/scylladb#21190
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.
(cherry picked from commit e1d8cddd09)
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.
(cherry picked from commit 5a3fd69374)
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.
(cherry picked from commit c8598e21e8)
Passing an admitted permit -- i.e. one with count resources on it -- to
the multishard reader, will possibly result in a deadlock, because the
permit of the multishard reader is destroyed after the permits of its
child readers. Therefore its semaphore resources won't be automatically
released until children acquire their own resources.
This creates a dependency (an edge in the "resource allocation graph"),
where the semaphore used by the multishard reader depends on the
semaphores used by children. When such dependencies create a cycle, and
permits are acquired by different reads in just the right order, a
deadlock will happen.
Users of the multishard reader have to be aware of this gotcha -- and of
course they aren't. This is small wonder, considering that not even the
documentation on the multishard reader mentions this problem.
To work around this, the user has to call
`reader_permit::release_base_resources()` on the permit, before passing
it to the multishard reader.
On multiple occasions, developers (including the very author of the
multishard reader), forgot or didn't know about this and this resulted
in deadlocks down the line.
This is a design-flaw of the multishard reader, which is addressed in
this patch, after which, it is safe to pass admitted or not admitted
permits to the multishard reader, it will handle the call to
`release_base_resources()` if needed.
(cherry picked from commit 218ea449a5)
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
(cherry picked from commit 519e167611)
Backported from #20985 to 6.2.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#21251
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
(cherry picked from commit 522bede8ec)
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.
Fixes: scylladb/scylladb#21102
(cherry picked from commit de511f56ac)
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.
Fixes#20043
(cherry picked from commit f09fe4f351)
(cherry picked from commit e5bf376cbc)
(cherry picked from commit 1863ccd900)
Refs #21020Closesscylladb/scylladb#21111
* github.com:scylladb/scylladb:
tablets: Validate system.tablets update
group0_client: Introduce change validation
group0_client: Add shared_token_metadata dependency
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.
If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#19325
(cherry picked from commit 56c971373c)
(cherry picked from commit 0ac450de05)
Refs #20528Closesscylladb/scylladb#21211
* github.com:scylladb/scylladb:
scylla_raid_setup: configure SELinux file context
scylla_coredump_setup: fix SELinux configuration for RHEL9
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#20573
(cherry picked from commit 0ac450de05)
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.
Fixes#19325
(cherry picked from commit 56c971373c)
deselect remove_data_dir_of_dead_node event from test_random_failures
due to ussue #20751
(cherry picked from commit 9b0e15678e)
Closesscylladb/scylladb#21138
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.
Fixes: #21034
Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.
(cherry picked from commit f8475915fb)
Closesscylladb/scylladb#21107
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.
If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.
However, the check assumes that the RELEASE is in the format
<build identifier>.<date>.<commit hash>
and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.
Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.
Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.
Fixes: scylladb/scylladb#21027
(cherry picked from commit 64ca58125e)
Closesscylladb/scylladb#21103
This is a manual backport of #20788
When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:
1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1
If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.
This change fixes this problem by disabling tombstone garbage collection for pending replicas.
This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.
Fixes#21090Closesscylladb/scylladb#21061
* github.com:scylladb/scylladb:
test: test tombstone GC disabled on pending replica
tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
database::table: add tombstone_gc_enabled(locator::tablet_id)
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```
in this change, we redistribute `addr2line.py` as well. this
should address the issue above.
Fixesscylladb/scylladb#21077
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)
Closesscylladb/scylladb#21085
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in filter_for_query(): the map is considered incorrect if the list of replicas contains a node from a data center whose replication factor is 0.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282; it only adds condition checks to prevent undefined behavior in cases of inconsistent inputs.
Refs scylladb/scylladb#20625
As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.
(cherry picked from commit 132358dc92)
(cherry picked from commit ae23d42889)
(cherry picked from commit ad93cf5753)
(cherry picked from commit 8db6d6bd57)
(cherry picked from commit c373edab2d)
Refs #20851Closesscylladb/scylladb#21067
* github.com:scylladb/scylladb:
Add conditions checking for get_read_executor
Avoid an extra call to block_for in db::filter_for_query.
Improve code readability in consistency_level.cc and storage_proxy.cc
tools: Add build_info header with functions providing build type information
tests: Add tests for alter table with RF=1 to RF=0
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.
The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.
However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.
Refs scylladb/scylladb#17575
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)
Closesscylladb/scylladb#20722
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.
This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.
(cherry picked from commit 6cf3747c5f)
Closesscylladb/scylladb#21030
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
get_endpoints_for_reading(): the map is considered incorrect the number of
read replica nodes is higher than replication factor. The check is
applied only when built in non release mode.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.
Refs scylladb/scylladb#20625
(cherry picked from commit c373edab2d)