before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,
> If set to process, only the main process itself is killed (not recommended!).
and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".
in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.
in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.
Fixesscylladb/scylladb#21507
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21508
(cherry picked from commit 961a53f716)
Closesscylladb/scylladb#23177
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.
Fixes: #22620Closesscylladb/scylladb#22622
(cherry picked from commit 7ce932ce01)
Closesscylladb/scylladb#22717
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.
A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.
The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.
To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.
Fixesscylladb/scylladb#21829Closesscylladb/scylladb#22493
(cherry picked from commit 6d34125eb7)
Closesscylladb/scylladb#22605
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.
Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.
Fixes: #21964.
Closesscylladb/scylladb#22145
(cherry picked from commit 32ab58cdea)
Closesscylladb/scylladb#22317
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.
Fix the method to check the task map local to the given shard.
Fixes: #22156.
Closesscylladb/scylladb#22161
(cherry picked from commit a91e03710a)
Closesscylladb/scylladb#22196
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.
While at it, the type of these fields is also fixed in the
documentation.
Before:
"min_column_names": [
"��Z���\u0011�\u0012ŷ4^��<",
"�2y\u0000�}\u007f"
],
"max_column_names": [
"��Z���\u0011�\u0012ŷ4^��<",
"}��B\u0019l%^"
],
After:
"min_column_names": [
"9dd55a92bc8811ef12c5b7345eadf73c",
"80327900e2827d7f"
],
"max_column_names": [
"9dd55a92bc8811ef12c5b7345eadf73c",
"7df79242196c255e"
],
Fixes: #22078Closesscylladb/scylladb#22225
(cherry picked from commit f899f0e411)
Closesscylladb/scylladb#22295
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.
Fixes: #21480Closesscylladb/scylladb#22251
(cherry picked from commit 55963f8f79)
Closesscylladb/scylladb#22378
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.
Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.
Fixesscylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22451
(cherry picked from commit 4a268362b9)
Closesscylladb/scylladb#22509
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.
Skip the keyspace repair if no_such_keyspace is thrown during preparations.
Fixes: #22073.
Requires backport to 6.1 and 6.2 as they contain the bug
- (cherry picked from commit bfb1704afa)
- (cherry picked from commit 54e7f2819c)
Parent PR: #22473Closesscylladb/scylladb#22540
* github.com:scylladb/scylladb:
test: add test to check if repair handles no_such_keyspace
repair: handle keyspace dropped
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.
Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.
What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.
To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.
Fixesscylladb/scylladb#21227Closesscylladb/scylladb#22093
(cherry picked from commit 4f5550d7f2)
Closesscylladb/scylladb#22544
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.
Skip the keyspace repair if no_such_keyspace is thrown during preparations.
(cherry picked from commit bfb1704afa)
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.
Fixes scylladb/scylladb#20814
Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.
- (cherry picked from commit 775411ac56)
- (cherry picked from commit 16053a86f0)
- (cherry picked from commit 8c48f7ad62)
- (cherry picked from commit 3da4848810)
- (cherry picked from commit 228a66d030)
Parent PR: #22253Closesscylladb/scylladb#22357
* github.com:scylladb/scylladb:
raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
raft: Handle non-critical config update errors in when changing status to voter.
test: Add test to check that a node does not fail on unknown commit status error when starting up.
raft: Add run_op_with_retry in raft_group0.
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.
(cherry picked from commit 228a66d030)
for retrying.
There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.
(cherry picked from commit 3da4848810)
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.
Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.
Fixesscylladb/scylladb#20638
(cherry picked from commit b1be2d3c41)
Closesscylladb/scylladb#22355
to voter.
When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.
Fixesscylladb/scylladb#20814
(cherry picked from commit 8c48f7ad62)
error when starting up.
Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.
Test for scylladb/scylladb#20814
(cherry picked from commit 16053a86f0)
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.
(cherry picked from commit 775411ac56)
This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values.
Fixes scylladb/scylladb#19101
Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2
- (cherry picked from commit e577f1d141)
- (cherry picked from commit 39785c6f4e)
- (cherry picked from commit 155480595f)
Parent PR: #21996Closesscylladb/scylladb#22297
* github.com:scylladb/scylladb:
storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
test: Add test case for checking read repair diff calculation when having conflicting keys.
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.
Fixesscylladb/scylladb#21807
Backport: not needed since this is a fix in the testing scripts.
Closesscylladb/scylladb#22192
(cherry picked from commit 2f1731c551)
Closesscylladb/scylladb#22248
function.
The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.
(cherry picked from commit 1554805)
diff calculation hashmap.
This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.
Fixesscylladb/scylladb#19101
(cherry picked from commit 39785c6)
conflicting keys.
The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.
(cherry picked from commit e577f1d141)
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.
Problem 1:
The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.
Problem 2:
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
Fixesscylladb/scylladb#21536
(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)
Parent PR: scylladb/scylladb#21948Closesscylladb/scylladb#22227
* github.com:scylladb/scylladb:
cache_algorithm_test: harden against stats being confused by background activity
cache_algorithm_test: fix a use of an uninitialized variable
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)
Let's add a comment to notify the user we have conflicts
Closesscylladb/scylladb#21939
(cherry picked from commit 2e6755ecca)
Closesscylladb/scylladb#22189
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.
Added a testcase to verify the fix.
Fixes#21887
This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.
Closesscylladb/scylladb#21895
* github.com:scylladb/scylladb:
sstables_manager: reclaim memory from sstables on unlink
sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
sstables: introduce disable_component_memory_reload()
sstables_manager: log sstable name when reclaiming components
(cherry picked from commit d4129ddaa6)
Closesscylladb/scylladb#21997
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.
This can cause the test to fail spuriously.
With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.
This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)
(cherry picked from commit 6caaead4ac)
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...
Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.
(cherry picked from commit 1fffd976a4)
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.
Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.
As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.
We want to backport this change to all supported branches so that
it helps us in all tests.
Fixesscylladb/scylladb#22160Closesscylladb/scylladb#22157
The series contains small fixes to the gossiper one of which fixes#21930. Others I noticed while debugged the issue.
Fixes: #21930
- (cherry picked from commit 91cddcc17f)
Parent PR: #21956Closesscylladb/scylladb#21990
* github.com:scylladb/scylladb:
gossiper: do not reset _just_removed_endpoints in non raft mode
gossiper: do not call apply for the node's old state
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.
This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.
A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.
Fixes: scylladb/scylladb#20082Closesscylladb/scylladb#21600
(cherry picked from commit 6c90a25014)
Closesscylladb/scylladb#21821
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.
The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.
Fixesscylladb/scylladb#21763Closesscylladb/scylladb#21773
(cherry picked from commit 373855b493)
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.
The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.
The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.
Fixesscylladb/scylladb#20483Closesscylladb/scylladb#21748
(cherry picked from commit 53224d90be)
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).
Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.
Fixes#21325
(cherry picked from commit 868f5b59c4)
(cherry picked from commit 0f2101b055)
Refs #21323Closesscylladb/scylladb#21359
* github.com:scylladb/scylladb:
utils: cached_file: Mark permit as awaiting on page miss
utils: cached_file: Push resource_unit management down to cached_file
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.
I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.
The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.
(cherry picked from commit 0f2101b055)
It saves us permit operations on the hot path when we hit in cache.
Also, it will lay the ground for marking the permit as awaiting later.
(cherry picked from commit 868f5b59c4)
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.
Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
the step to always execute, regardless of `github.event_name`
Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.
Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements
Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.
Example Error:
```
python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
shell: /usr/bin/bash -e {0}
env:
DEFAULT_BRANCH: master
GITHUB_TOKEN: ***
Traceback (most recent call last):
File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
main()
File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
commits = repo.compare(start_commit, end_commit).commits
File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
headers, data = self._requester.requestJsonAndCheck(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
return self.__check(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```
Fixesscylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21809
(cherry picked from commit e04aca7efe)
Closesscylladb/scylladb#21819
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.
Fixes#20030
This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.
Closesscylladb/scylladb#21582
* github.com:scylladb/scylladb:
compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
compaction_group: rename `update_main_sstable_list_on_compaction_completion`
compaction_group: update maintenance sstable set on scrub compaction completion
compaction_group: store table::sstable_list_builder::result in replacement_desc
table::sstable_list_builder: remove old sstables only from current list
table::sstable_list_builder: return removed sstables from build_new_list
(cherry picked from commit 58baeac0ad)
Closesscylladb/scylladb#21789
This fixes a use-after-free bug when parsing clustering key across
pages.
Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).
Details of the first problem:
Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.
To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.
In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.
A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.
If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.
The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.
This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.
Details of the solution:
We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.
We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.
Fixes#20766Closesscylladb/scylladb#20837
* github.com:scylladb/scylladb:
sstables: bsearch_clustered_cursor: Add trace-level logging
sstables: bsearch_clustered_cursor: Move definitions out of line
test, sstables: Verify parsing stability when allocating section is retried
test, sstables: Verify parsing stability when buffers cross page boundary
sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
cached_file: Adapt page_view to ContiguousSharedBuffer
cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
sstables, utils: Allow parsers to work with different buffer types
sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
(cherry picked from commit fb8743b2d6)
Closesscylladb/scylladb#20906
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):
forward_jump_clocks(std::chrono::seconds(60*60*24*31));
The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.
When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:
ttl_opt topology_request_tracking_mutation_builder::ttl() const {
return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
- std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
}
_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.
The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.
The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().
Closes scylladb/scylladb#21558
(cherry picked from commit 1d0c6aa26f)
Closesscylladb/scylladb#21583Fixes#21581
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.
Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.
Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.
Fixes: #21503.
Closesscylladb/scylladb#21504
(cherry picked from commit 572b005774)
Closesscylladb/scylladb#21621
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.
Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency
Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions
Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios
Fixesscylladb/scylladb#21724
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21726
(cherry picked from commit 65949ce607)
Closesscylladb/scylladb#21733
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.
A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.
Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.
Fixes: #21612.
Needs backport to 6.1 and 6.2 as they contain the bug.
Closesscylladb/scylladb#21616
* github.com:scylladb/scylladb:
test: add test to check if repair is properly aborted
repair: add shard param to task_manager_module::is_aborted
repair: use task abort source to abort repair
repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
repair: fix task_manager_module::abort_all_repairs
(cherry picked from commit 5ccbd500e0)
Closesscylladb/scylladb#21641
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.
Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.
However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().
This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.
Fixes#21538
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21540
(cherry picked from commit 7607f5e33e)
Closesscylladb/scylladb#21633
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.
During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.
In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.
Fixes#20699.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b8d6f864bc)
Closesscylladb/scylladb#21203
Fixes#21159
When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.
Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.
(cherry picked from commit c28a5173d9)
Closesscylladb/scylladb#21394
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.
Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.
Fixes: scylladb/scylladb#19959Closesscylladb/scylladb#21477
(cherry picked from commit 92db2eca0b)
Closesscylladb/scylladb#21555
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.
Fixes https://github.com/scylladb/scylladb/issues/21441
Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.
Fixes https://github.com/scylladb/scylladb/issues/21351Closesscylladb/scylladb#21474
* github.com:scylladb/scylladb:
scylla_raid_setup: support installing semanage on Amazon Linux 2
scylla_raid_setup: fix failure on SELinux package installation
(cherry picked from commit 1c212df62d)
Closesscylladb/scylladb#21546
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.
Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.
Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264Fixes: scylladb/scylladb#21570Closesscylladb/scylladb#21059
(cherry picked from commit 7c75fc599f)
Closesscylladb/scylladb#21571
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.
stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.
Fixes scylladb/scylladb#21159
* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body
(cherry picked from commit e942c074f2)
(cherry picked from commit d8500472b3)
(cherry picked from commit c08ba8af68)
(cherry picked from commit a7a55298ea)
(cherry picked from commit 6cce67bec8)
Refs #21299Closesscylladb/scylladb#21435
* github.com:scylladb/scylladb:
compaction_manager: stop: await _stop_future if engaged
compaction_manager: really_do_stop: assert that no tasks are left behind
compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
compaction/compaction_manager: stop_tasks(): unlink stopped tasks
compaction/compaction_manager: make _tasks an intrusive list
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental. What matters is that `_stop_future` is engaged.
While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6cce67bec8)
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7a55298ea)
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.
Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.
Fixesscylladb/scylladb#21298
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c08ba8af68)
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).
(cherry picked from commit d8500472b3)
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.
Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.
(cherry picked from commit e942c074f2)
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.
For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.
This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.
Fixesscylladb/scylladb#13757Closesscylladb/scylladb#21459
(cherry picked from commit 35921eb67e)
Closesscylladb/scylladb#21496
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).
When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:
- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted
Fixes: https://github.com/scylladb/scylladb/issues/18973Closesscylladb/scylladb#21401
(cherry picked from commit 77604b4ac7)
Closesscylladb/scylladb#21465
Adding an auto-backport.py script to handle backport automation instead of Mergify.
The rules of backport are as follows:
* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)
Fixes: https://github.com/scylladb/scylladb/issues/18973
(cherry picked from commit f9e171c7af)
Closesscylladb/scylladb#21470
The skipped ranges should be multiplied by the number of tables
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
(cherry picked from commit 1392a6068d)
(cherry picked from commit 9868ccbac0)
Refs #21252Closesscylladb/scylladb#21314
* github.com:scylladb/scylladb:
test: Add test_node_ops_metrics.py
repair: Make the ranges more consistent in the log
repair: Fix finished ranges metrics for removenode
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.
This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.
Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.
Fixesscylladb/scylladb#20060
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 78ceaeabca)
Closesscylladb/scylladb#21405
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.
Fixes: https://github.com/scylladb/scylladb/issues/21102
Should be backported to every 6.x branch, as it may lead to a crash.
(cherry picked from commit de511f56ac)
(cherry picked from commit 3f4c8a30e3)
(cherry picked from commit 522bede8ec)
Refs https://github.com/scylladb/scylladb/pull/21121Closesscylladb/scylladb#21340
* github.com:scylladb/scylladb:
test: topology: add disable_schema_agreement_wait utility function
test: add UT to test retrying ALTER tablets KEYSPACE
cql/tablets: fix indentation in `rf_change` event handler
cql/tablets: fix retrying ALTER tablets KEYSPACE
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.
By default, the monitoring stack queries it once every 60s.
(cherry picked from commit 93777fa907)
Closesscylladb/scylladb#21305
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.
(cherry picked from commit 522bede8ec)
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.
Fixes: scylladb/scylladb#21102
(cherry picked from commit de511f56ac)
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.
Fix it by taking a copy instead. This is a slow path anyway.
Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)
Closesscylladb/scylladb#21373
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in filter_for_query(): the map is considered incorrect if the list of replicas contains a node from a data center whose replication factor is 0.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282; it only adds condition checks to prevent undefined behavior in cases of inconsistent inputs.
Refs scylladb/scylladb#20625
As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.
(cherry picked from commit 132358dc92)
(cherry picked from commit ae23d42889)
(cherry picked from commit ad93cf5753)
(cherry picked from commit 8db6d6bd57)
(cherry picked from commit c373edab2d)
Refs #20851Closesscylladb/scylladb#21068
* github.com:scylladb/scylladb:
Add conditions checking for get_read_executor
Avoid an extra call to block_for in db::filter_for_query.
Improve code readability in consistency_level.cc and storage_proxy.cc
tools: Add build_info header with functions providing build type information
tests: Add tests for alter table with RF=1 to RF=0
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.
(cherry picked from commit 1392a6068d)
The skipped ranges should be multiplied by the number of tables.
Otherwise the finished ranges ratio will not reach 100%.
Fixes#21174
(cherry picked from commit cffe3dc49f)
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.
Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.
Fixes#20916
`perf-simple-query` stats before and after this fix :
`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59393 insns/op, 24029 cycles/op, 0 errors)
97551.14 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59376 insns/op, 23966 cycles/op, 0 errors)
96599.92 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59367 insns/op, 23998 cycles/op, 0 errors)
97774.91 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59370 insns/op, 23968 cycles/op, 0 errors)
97796.13 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59368 insns/op, 23947 cycles/op, 0 errors)
throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19
// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59392 insns/op, 24058 cycles/op, 0 errors)
97311.48 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59375 insns/op, 24005 cycles/op, 0 errors)
98043.10 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59381 insns/op, 23941 cycles/op, 0 errors)
96750.31 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59396 insns/op, 24025 cycles/op, 0 errors)
93381.21 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 59390 insns/op, 24097 cycles/op, 0 errors)
throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```
This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.
Closesscylladb/scylladb#20985
* github.com:scylladb/scylladb:
topology-custom: add test to verify tombstone gc in read path
replica/table: check memtable before discarding tombstone during read
compaction_group: track maximum timestamp across all sstables
(cherry picked from commit 519e167611)
Backported from #20985 to 6.1.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#21250
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.
If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.
However, the check assumes that the RELEASE is in the format
<build identifier>.<date>.<commit hash>
and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.
Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.
Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.
Fixes: scylladb/scylladb#21027
(cherry picked from commit 64ca58125e)
Closesscylladb/scylladb#21104
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.
The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.
However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.
Refs scylladb/scylladb#17575
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)
Closesscylladb/scylladb#20723
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes https://github.com/scylladb/scylladb/issues/20044.
Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.
(cherry picked from commit bcd358595f)
(cherry picked from commit 93815e0649)
Refs https://github.com/scylladb/scylladb/pull/20939Closesscylladb/scylladb#21205
* github.com:scylladb/scylladb:
replica: Fix tombstone GC during tablet split preparation
service: Improve error handling for split
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.
fixes#20043
(cherry picked from commit f09fe4f351)
(cherry picked from commit e5bf376cbc)
(cherry picked from commit 1863ccd900)
Refs #21020Closesscylladb/scylladb#21110
* github.com:scylladb/scylladb:
tablets: Validate system.tablets update
group0_client: Introduce change validation
group0_client: Add shared_token_metadata dependency
replica/tablets: Add to_tablet_metadata_(row_)?key helpers
replica/tablets: extract tablet_replica_set_from_cell()
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.
If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extraceted from larger patch f5976aa87b (replica/tablets: add
get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint())
by Botond. The helpers are needed to decode mutations with tablets
update to validate them later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).
The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups. Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.
Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.
Fixes scylladb/scylladb#18862
Needs backport to 6.0 since tablets require this capability
(cherry picked from commit a7b92d7b6f)
(cherry picked from commit 027e64876a)
(cherry picked from commit 44bd183187)
(cherry picked from commit f47b5e60bc)
Refs #19555Closesscylladb/scylladb#20644
* github.com:scylladb/scylladb:
sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
sstables: storage: keep base directory in base class
sstables: storage: define opened_directory in header file
sstable_directory: use only dirlog
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.
Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.
Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.
To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.
Fixes#20044.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.
Note that this requires loading and processing
the base directory first.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f47b5e60bc)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
# Conflicts:
# sstables/sstable_directory.hh
so we can use the base (table) directory for
e.g. pending_delete logs, in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 44bd183187)
So it can be used outside the storage module
in the following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 027e64876a)
Currently, there are leftover log messages using
sstlog rather than dirlog, that was introduced
in aebd965f0e,
and that makes debugging harder.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7b92d7b6f)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
# Conflicts:
# sstables/sstable_directory.cc
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes#20626.
(cherry picked from commit 999f1f1318)
(cherry picked from commit 38ce2c605d)
Refs #20737Closesscylladb/scylladb#20802
* github.com:scylladb/scylladb:
tablet: Fix single-sstable split when attaching new unsplit sstables
replica: Fix tablet split execute after restart
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.
Fixes: #21034
Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.
(cherry picked from commit 3969ffb39f)
Closesscylladb/scylladb#21106
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.
Fixes: #20240
(cherry picked from commit 5ac16e29e6)
Closesscylladb/scylladb#21023
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```
in this change, we redistribute `addr2line.py` as well. this
should address the issue above.
Fixesscylladb/scylladb#21077
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)
Closesscylladb/scylladb#21087
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:
- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
get_endpoints_for_reading(): the map is considered incorrect the number of
read replica nodes is higher than replication factor. The check is
applied only when built in non release mode.
Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.
Refs scylladb/scylladb#20625
(cherry picked from commit c373edab2d)
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.
(cherry picked from commit ae23d42889)
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.
Test for scylladb/scylladb#20625
(cherry picked from commit 132358dc92)
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.
This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.
(cherry picked from commit c2ba300f1c)
Closesscylladb/scylladb#21031
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.
Fixes: #20039
Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.
(cherry picked from commit 042825247f)
(cherry picked from commit adf453af3f)
(cherry picked from commit 9c5950533f)
(cherry picked from commit 47acdc1f98)
(cherry picked from commit 93d61d7031)
(cherry picked from commit 6676e47371)
(cherry picked from commit 2aabe7f09c)
(cherry picked from commit ee56bbfe61)
Refs #20208Closesscylladb/scylladb#21010
* github.com:scylladb/scylladb:
cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
cql: join new and old KS options in ALTER tablets KS
cql: fix validation of ALTERing RFs in tablets KS
cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
cql: validate RF change for new DCs in ALTER tablets KS
cql: extend test_alter_tablet_keyspace_rf
cql: refactor test_tablets::test_alter_tablet_keyspace
cql: remove unused helper function from test_tablets
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.
The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.
Refs: #18269
(cherry picked from commit 3ebb124eb2)
Closesscylladb/scylladb#20956
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.
since we decided to use Boost.Regex, let's catch `boost::regex_error`.
Refs a3db5401Fixesscylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit c7eafc4dc1)
Closesscylladb/scylladb#20953
Refs #20686
Refs #15607
In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.
(cherry picked from commit 60f8a9f39d)
Closesscylladb/scylladb#20704
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#19325
(cherry picked from commit 56c971373c)
(cherry picked from commit 0ac450de05)
Refs #20528Closesscylladb/scylladb#20871
* github.com:scylladb/scylladb:
scylla_raid_setup: configure SELinux file context
scylla_coredump_setup: fix SELinux configuration for RHEL9
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.
The patch makes sure that there is no end iterator in _live_iterators.
Fixesscylladb/scylladb#20874
(cherry picked from commit da084d6441)
Closesscylladb/scylladb#21004
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.
(cherry picked from commit ee56bbfe61)
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.
(cherry picked from commit 2aabe7f09c)
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
needs their RFs to be validated.
(cherry picked from commit 6676e47371)
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.
(cherry picked from commit 93d61d7031)
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.
Refs: #20039
(cherry picked from commit 47acdc1f98)
1. Renamed the testcase to emphasize that it only focuses on testing
changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8
(cherry picked from commit adf453af3f)
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.
(cherry picked from commit 042825247f)
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.
Fixes#20890.
(cherry picked from commit bcd358595f)
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:
if (strategy.supports_tablets) {
if (cql.with_tablets) {
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
throw "tablets are not enabled";
}
} else if (cql.with_tablets = off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
return create_keyspace_without_tablets();
}
}
} else { // strategy doesn't support tablets
if (cql.with_tablets == on) {
throw "invalid cql parameter";
} else if (cql.with_tablets == off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
return create_keyspace_without_tablets();
}
}
closes: #20088
In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.
There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20928
Fixes#20862
With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.
Test included.
(cherry picked from commit ee5e71172f)
Closesscylladb/scylladb#20914
For each new node added to the raft config populate it's ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper.
Fixes scylladb/scylladb#20600
Backport to all supported versions since the bug may cause bootstrapping failure.
(cherry picked from commit bddaf498df)
(cherry picked from commit 9e4cd32096)
Refs #20601Closesscylladb/scylladb#20848
* github.com:scylladb/scylladb:
test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
group0: make sure that address map has an entry for each new node in the raft configuration
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.
Fixes#20573
(cherry picked from commit 0ac450de05)
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.
Fixes#19325
(cherry picked from commit 56c971373c)
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
Fixes: https://github.com/scylladb/scylladb/issues/20629
Need to be backported since this is a regression
(cherry picked from commit 644e7a2012)
(cherry picked from commit c0939d86f9)
(cherry picked from commit 1b4c255ffd)
Closesscylladb/scylladb#20834
* github.com:scylladb/scylladb:
test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
topology coordinator:: mark node as being replaced earlier
topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
Increase workers for that used in method async_rmtree() that is used for
cleaning directories. This should help to reduce flakiness.
Increasing the workers count was introduced in f54b7f5427
but there is no need to backport the whole commit.
Closesscylladb/scylladb#20795
What it called "leader" is actually the destination of the RPC.
Trivial fix, should be backported to all affected versions.
(cherry picked from commit 84dd0e922b)
Closesscylladb/scylladb#20827
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.
(cherry picked from commit bddaf498df)
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.
(cherry picked from commit c0939d86f9)
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.
(cherry picked from commit 644e7a2012)
to explain for instance which setting takes effect if both
command line options and `scylla.yaml` configures the same parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1aa030a8cd)
Closesscylladb/scylladb#20775
This commit fixes a link to the Manager by adding a missing underscore
to the external link.
(cherry picked from commit aa0c95c95c)
Closesscylladb/scylladb#20707
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301Closesscylladb/scylladb#20471
* github.com:scylladb/scylladb:
cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
database::get_all_tables_flushed_at: fix return value
(cherry picked from commit 0e5b444777)
Backported from #20471 to 6.1.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#20581
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.
Fixesscylladb/scylladb#20791
Should be backported to all branches containing this test to reduce
flakiness.
(cherry picked from commit f390d4020a)
Closesscylladb/scylladb#20809
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.
This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd)
Closesscylladb/scylladb#20800
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.
So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.
This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.
It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.
Fixes#20626.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 38ce2c605d)
let's assume there are 2 nodes, n1, n2. n1 is the coordinator.
1) n1 emits split
2) n1 and n2 complete split work
3) n1 becomes aware all replicas are ready for split
4) n2 restarts, but places split sstable into main group[1]
5) n1 executes split
6) n2 handles split completion, but see the main group is not empty
[1]: During split, main group should only contain unsplit sstables.
If all sstables are split, main must be empty.
This is a result of replica not setting storage group to split mode on restart
(using tablet map) and therefore sstables are incorrectly placed on main group.
The fix is about looking at tablet map and setting group to split mode before
sstables are populated into it.
Refs #20626.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 999f1f1318)
The test configures write timeout to much smaller value to make the test
run faster since for some writes sleep is inserted to hit the timeout,
but it makes aarch64 debug flaky since timeout happens when it should
not because of a natural slowness.
(cherry picked from commit 71a5b1c6dd)
Closesscylladb/scylladb#20777
Bind variables in CQL have two formats: positional (?) where a variable is referred to by its relative position in the statement, and named (:var), where the user is expected to supply a name->value mapping.
In 19a6e69001 we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the dialect and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes https://github.com/scylladb/scylladb/issues/15559
This may be useful to users transitioning from Cassandra, so merits a backport.
(cherry picked from commit f9322799af)
(cherry picked from commit d69bf4f010)
(cherry picked from commit ea8441dfa3)
Refs https://github.com/scylladb/scylladb/pull/19493Closesscylladb/scylladb#20590
* github.com:scylladb/scylladb:
cql3: add option to not unify bind variables with the same name
cql3: introduce dialect infrastructure
cql3: prepared_statement_cache: drop cache key default constructor
Merge 'config: round-trip boolean configuration variables' from Avi Kivity
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.
Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.
Fixesscylladb/scylladb#20608
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 39ce358d82)
Closesscylladb/scylladb#20663
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant $maintenance, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.
This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:
version without $maintenance tenant -> version with $maintenance tenant guarded by a feature
version with $maintenance tenant but not guarded by a feature -> version with $maintenance tenant guarded by a feature
The PR adds enabled flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The $maintenance tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.
Fixes https://github.com/scylladb/scylladb/issues/20070
Refs https://github.com/scylladb/scylla-enterprise/issues/4403
(cherry picked from commit d44844241d)
(cherry picked from commit 71a03ef6b0)
(cherry picked from commit b4b91ca364)
Refs https://github.com/scylladb/scylladb/pull/19802Closesscylladb/scylladb#20674
* github.com:scylladb/scylladb:
message/messaging_service: guard adding maintenance tenant under cluster feature
message/messaging_service: add feature_service dependency
message/messaging_service: add `enabled` flag to statement tenants
This is a manual backport of #20212 to 6.1, superseding #20345 (which had run into conflicts).
Please see the individual commit messages for backport notes.
Fixes#10305Closesscylladb/scylladb#20355
* github.com:scylladb/scylladb:
generic_server: make server::stop() idempotent
generic_server: coroutinize server::shutdown()
generic_server: make server::shutdown() idempotent
test/generic_server: add test case
configure, cmake: sort the lists of boost unit tests
generic_server: convert connection tracking to seastar::gate
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.
This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.
(cherry-picked from d44844241d)
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes https://github.com/scylladb/scylladb/issues/19378.
Fixes https://github.com/scylladb/scylladb/issues/19416.
Please replace this line with justification for the backport/* labels added to this PR
(cherry picked from commit 239344ab55)
(cherry picked from commit 74612ad358)
Refs https://github.com/scylladb/scylladb/pull/19427Closesscylladb/scylladb#20595
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.
Ignore cleanup of deallocated storage groups.
Fixes: https://github.com/scylladb/scylladb/issues/19752.
Needs to be backported to all branches with tablets (6.0 and later)
(cherry picked from commit 20d6cf55f2)
(cherry picked from commit 2c4b1d6b45)
Refs https://github.com/scylladb/scylladb/pull/20584Closesscylladb/scylladb#20627
* github.com:scylladb/scylladb:
test: check if cleanup of deallocated sg is ignored
replica: ignore cleanup of deallocated storage group
To drop a semaphore it should not be held by anyone, so we need to
release out units before checking if a semaphore can be dropped.
Fixes: scylladb/scylladb#20602
(cherry picked from commit 9cc54932ae)
Closesscylladb/scylladb#20621
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.
Ignore cleanup of deallocated storage group.
(cherry picked from commit 20d6cf55f2)
Consider the following:
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 74612ad358)
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.
In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.
However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.
Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.
A unit test is added.
Fixes#15559
(cherry picked from commit ea8441dfa3)
(cherry picked from commit edb3068ecf)
A dialect is a different way to interpret the same CQL statement.
Examples:
- how duplicate bind variable names are handled (later in this series)
- whether `column = NULL` in LWT can return true (as is now) or
whether it always returns NULL (as in SQL)
Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.
The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
(cherry picked from commit d69bf4f010)
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.
Not a regression, so a backport isn't strictly necessary.
Closesscylladb/scylladb#19792
* github.com:scylladb/scylladb:
config: specialize from-string conversion for bool
config: wrap boost::lexical_cast<> when converting from strings
(cherry picked from commit 9eb47b3ef0)
c70f321c6f added an extra check if KS
exists. This check can throw `data_dictionary::no_such_keyspace`
exception, which is supposed to be caught and a more user-friendly
exception should be thrown instead.
This commit fixes the above problem and adds a testcase to validate it
doesn't appear ever again.
Also, I moved the check for the keyspace outside of the `for` loop, as
it doesn't need to be checked repeatedly.
Additionally, I added an extra comment to both `no_such_keyspace` and
`no_such_column_family` exceptions explaining they should not be
returned directly to the caller, as they lack error code, which may not
trigger correct exceptions handling mechanisms on the driver side.
Fixes: #20097
(cherry picked from commit f1e8976fbe)
Closesscylladb/scylladb#20553
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.
Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylladb#20558Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
This is a backport of the original PR without the tests, done avoid the need of resolving merge conflicts in that area.
Closesscylladb/scylladb#20557
* github.com:scylladb/scylladb:
hints: send hints with CL=ALL if target is leaving
hints: inline do_send_one_mutation
ScyllaDB doesn't support custom compressors. The available compressors
are the only available ones, not the default ones.
Adjust the text to reflect this.
(cherry picked from commit 08f109724b)
Closesscylladb/scylladb#20524
Even after 13caac7, we still have more files incorrect permission, since
we use "cp -r" and creating new file with redirect.
To fix this, we need to replace "cp -r" with "cp -pr", and "chmod <perm>" on
newly created files.
Fixes#14383
Related #19775
(cherry picked from commit 9d7fed40b5)
Closesscylladb/scylladb#20432
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:
- If the original destination is a natural endpoint of the hint, we only
send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.
There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.
As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.
Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.
Fixesscylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
(cherry picked from commit 61ac0a336d)
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.
(cherry picked from commit 8abb06ab82)
Because of https://github.com/scylladb/scylladb/issues/9285 hit weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.
(cherry picked from commit e06a772b87)
Closesscylladb/scylladb#20468
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).
Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.
Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.
Fixes: scylladb/scylladb#20111
(cherry picked from commit 52fdf5b4c9)
Closesscylladb/scylladb#20477
for following reasons:
1. the ppa in question does not provide the build for the latest ubuntu's LTS release. it only builds for trusty, xenial, bionic and jammy. according to https://wiki.ubuntu.com/Releases, the latest LTS release is ubuntu noble at the time of writing.
2. the ppa in question does not provide the packages used in production. it does provides the package for *building* scylla
3. after we introduced the relocatable package, there is no need to provide extra user space dependencies apart from scylla packages.
so, in this change, we remove all references to enabling the Scylla/PPA repository.
Fixesscylladb/scylladb#20449
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit fe0e961856)
Closesscylladb/scylladb#20453
The Alternator TTL scanning code uses an object "scan_ranges_context"
to hold the scanning context. One of the members of this object is
a service::query_state, and that in turn holds a reference to a
service::client_state. The existing constructor created a temporary
client_state object and saved a reference to it - which can result
in use after free as the temporary object is freed as soon as the
constructor ends.
The fix is to save a client_state in the scan_ranges_context object,
instead of a temporary object.
Fixes#19988
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 15f8046fcb)
Closesscylladb/scylladb#20436
in 372a4d1b79, we introduced a change
which was for debugging the logging message. but the logging message
intended for printing the temp_dir not prints an `optional<int>`. this
is both confusing, and more importantly, it hurts the debuggability.
in this change, the related change is reverted.
Fixesscylladb/scylladb#20408
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit d26bb9ae30)
Closesscylladb/scylladb#20434
before this change, if user does not have `/bin/sh` around, when
installing scylla packages, the script in `%pretrans" is executed,
and fails due to missing `/bin/sh`. per
https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#pretrans
> Note that the %pretrans scriptlet will, in the particular case of
> system installation, run before anything at all has been installed.
> This implies that it cannot have any dependencies at all. For this
> reason, %pretrans is best avoided, but if used it MUST (by necessity)
> be written in Lua. See
> https://rpm-software-management.github.io/rpm/manual/lua.html for more
> information.
but we were trying to warn users upgrading from scylla < 1.7.3, which
was released 7 years ago at the time of writing.
in this change, we drop the `%pretrans` section. hopefuly they will
find their way out if they still exist.
Fixesscylladb/scylladb#20321
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 6970c502c9)
Closesscylladb/scylladb#20384
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.
during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.
Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.
Fixes#20339.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 26facd807e)
Closesscylladb/scylladb#20343
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.
Throw if batchlog manager isn't initialized.
Fixes: #20236.
Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access.
(cherry picked from commit d8e4393418)
(cherry picked from commit f38bb6483a)
Refs #20251Closesscylladb/scylladb#20351
* github.com:scylladb/scylladb:
test: add test to ensure repair won't fail with uninitialized bm
repair: throw if batchlog manager isn't initialized
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.
The series adds the test for this scenario and the fix for the
chicken&egg problem above.
The series (or at least the fix itself) is needs to be backported because
this is a serious regression.
Fixes: https://github.com/scylladb/scylladb/issues/19025
(cherry picked from commit 84757a4ed3)
(cherry picked from commit b98282a976)
(cherry picked from commit 2f1b1fd45e)
(cherry picked from commit 17f4a151ce)
(cherry picked from commit 32a59ba98f)
Refs https://github.com/scylladb/scylladb/pull/20290Closesscylladb/scylladb#20374
* github.com:scylladb/scylladb:
topology coordinator: fix indentation after the last patch
topology coordinator: do not add replacing node without a ring to topology
test: add test for replace in clusters with encryption enabled
test.py: add server encryption support to cluster manager
.gitignore: fix pattern for resources to match only one specific directory
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.
Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.
To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.
The patch effectively reverts b8ee8911caFixes: scylladb/scylladb#19025
(cherry picked from commit 17f4a151ce)
Check whether we can stop a generic server without first asking it to
listen.
The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:
> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us
Backport notes for 6.1:
- Replace
#include "utils/assert.hh"
SCYLLA_ASSERT(false);
with
#include <cassert>
assert(false);
due to 6.1 lacking commit aa1270a00c ("treewide: change assert() to
SCYLLA_ASSERT()", 2024-08-05). The header file "utils/assert.hh"
wouldn't be difficult to backport, but separating it from the treewide
changes in commit aa1270a00c might not be the best idea.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit dbc0ca6354)
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.
Backport notes for 6.1:
- Conflicts in "configure.py", unsurprisingly. For the backport, I sorted
the boost unit test list manually, from scratch.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 931f2f8d73)
If we call server::stop() right after "server" construction, it hangs:
With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,
co_await _all_connections_stopped.get_future();
at the end of server::stop() deadlocks.
Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when
- cserver->start() (sharded<cql_server>::start()) constructs a
"server"-derived object,
- start_listening_on_tcp_sockets() throws an exception before reaching
listen_on_all_shards() (for example because it fails to set up client
encryption -- certificate file is inaccessible etc.),
- the "deferred_action"
cserver->stop().get();
is invoked during cleanup.
(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)
Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.
NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:
- server::shutdown() + server::stop()
- or just server::stop().
Fixes#10305
Backport notes for 6.1:
- Conflict in "generic_server.hh", due to 6.1 not having commit
324b3c43c0 ("generic_server: use async function in
`for_each_gently()`", 2024-08-08), which is part of the feature series
"service levels: update connections parameters automatically"
<https://github.com/scylladb/scylladb/pull/19085>.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 5a04743663)
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.
Batchlog manager cannot be initialized before repair as we have the
dependencies chain:
repair_service -> storage_service::join_cluster -> batchlog_manager.
Throw if batchlog manager isn't initialized. That won't cause repair
to fail.
(cherry picked from commit d8e4393418)
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.
Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.
Fixes#16826
Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.
* Requires backport to live versions. Issue hit in the filed with 2022.2.14
(cherry picked from commit 8b1877f3ca)
(cherry picked from commit 0419b1d522)
(cherry picked from commit b5d0ab092c)
(cherry picked from commit 9729dd21c3)
(cherry picked from commit 8665eef98c)
(cherry picked from commit 5f655e41e3)
Refs #16827Closesscylladb/scylladb#20228
* github.com:scylladb/scylladb:
raft_rebuild: propagate source_dc force option to rebuild_option
repair: do_rebuild_replace_with_repair: use source_dc only when safe
repair: replace_with_repair: pass the replace_node downstream
repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
repair: replace_rebuild_with_repair: pass ks_erms from caller
nodetool: rebuild: add force option
Add and use utils::optional_param to pass source_dc
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.
Fixes#20264
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit dc5c45e803)
Closesscylladb/scylladb#20273
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.
Fixes: https://github.com/scylladb/scylladb/issues/18786
Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1
(cherry picked from commit de5329157c)
(cherry picked from commit 46563d719f)
(cherry picked from commit 4e2d7aa2a2)
Refs #20109Closesscylladb/scylladb#20313
* github.com:scylladb/scylladb:
test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
replica/mutation_dump: enfore pinning of effective replication map
replica/mutation_dump: handle un-owned tokens (with tablets)
This series fixes an issue where histogram Summaries return an infinite value.
It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.
The series adds a test for summaries with a specific test case for this scenario.
Fixes#20255
Need backport to 6.0, 6.1 and 2023.1 and above
(cherry picked from commit 011aa91a8c)
(cherry picked from commit 644e6f0121)
Refs #20257Closesscylladb/scylladb#20303
* github.com:scylladb/scylladb:
test/estimated_histogram_test Add summary tests
utils/histogram.hh: Make summary support inifinite bucket.
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
(cherry picked from commit d385219a12)
(cherry picked from commit 333c0d7c88)
(cherry picked from commit b2abbae24b)
(cherry picked from commit 824bdf99d2)
(cherry picked from commit ea5a0cca10)
(cherry picked from commit 2bbbe2a8bc)
(cherry picked from commit 686a8f2939)
Refs #19758Closesscylladb/scylladb#20297
* github.com:scylladb/scylladb:
abstract_replication_strategy: make get_ranges async
database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
alternator: ttl: can pass const gms::gossiper& to ranges_holder
alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
alternator: ttl: refactor token_ranges_owned_by_this_shard
Commit 9f93dd9fa3 changed `tablet_sstable_set::_sstable_sets` to be a `absl::flat_hash_map` and in addition, `std::set<size_t> _sstable_set_ids` was added. `_sstable_set_ids` is set up in the `tablet_sstable_set(schema_ptr s, const storage_group_manager& sgm, const locator::tablet_map& tmap)` constructor, but it is not copied in `tablet_sstable_set(const tablet_sstable_set& o)`.
This affects the `tablet_sstable_set::tablet_sstable_set` method as it depends on the copy constructor. Since sstable set can be cloned when a new sstable set is added, the issue will cause ids not being copied into the new sstable set. It's healed only after compaction, since the sstable set is rebuilt from scratch there.
This PR fixes this issue by removing the existing copy constructor of `tablet_sstable_set` to enable the implicit default copy constructor.
Fixes#19519
(cherry picked from commit 44583eed9e)
(cherry picked from commit ec47b50859)
Refs #20115Closesscylladb/scylladb#20201
* github.com:scylladb/scylladb:
boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
replica: fix copy constructor of tablet_sstable_set
Currently it doesn't, one of the node crashes with std::out_of_range
exception and meaningless calltrace
[Botond]: this test checks the case of reading a partition via
MUTATION_FRAGMENTS from a node which doesn't own said partition.
refs: #18786
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4e2d7aa2a2)
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.
(cherry picked from commit 46563d719f)
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.
(cherry picked from commit de5329157c)
…utations vector
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls.
Fixes#18173
(cherry picked from commit 95a5fba0ea)
(cherry picked from commit 52234214e5)
Refs #18174Closesscylladb/scylladb#20246
* github.com:scylladb/scylladb:
schema_tables: calculate_schema_digest: filter the key earlier
schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.
This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.
\Fixes scylladb/scylladb#20242
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
\Closes scylladb/scylladb#20249
(cherry picked from commit 18c45f7502)
Closesscylladb/scylladb#20311
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.
Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.
Analogically for remove_table.
Fixes: #19833.
(cherry picked from commit 483d89ed6d)
Closesscylladb/scylladb#20244
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.
The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.
Relates to #20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 644e6f0121)
This patch handles an edge cases related to The infinite bucket
limit.
Summaries are the P50, P95, and P99 quantiles.
The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.
In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;
with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.
Fixes#20255
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 011aa91a8c)
This change fixes#17237, fixes#5361 and fixes#5362 by passing the limit value down the call chain in cql3. A test is also added.
fixes: #17237fixes: #5361fixes: #5362
The regression happened in 5.4 as we changed the way GROUP BY is processed in 432cb02 - to force aggregation when it is used. The LIMIT value was not passed to aggregations and thus we failed to adhere to it.
W want to backport this fix to 5.4 and 6.0 to have continuous correct results for the test case from #17237
This patch consists of 4 commits:
- fa4225ea0fac2057b7a9976f57dc06bcbd900cd4 - cql3: respect the user-defined page size in aggregate queries - a precondition for this patch to be implementable
- 8fbe69e74dca16ed8832d9a90489ca47ba271d0b - cql3/select_statement: simplify the get_limit function - the `do_get_limit()` function did a lot of legwork that should not be associated with it. This change makes it trivial and makes its callers do additional checks (for unset guards, or for an aggregate query)
- 162828194a2b88c22fbee335894ff045dcc943c9 - cql3: process LIMIT for GROUP BY queries - pass the limit value down the chain and make use of it. This is the actual fix to #17237
- b3dc6de6d6cda8f5c09b01463bb52f827a6a00b4 - test/cql-pytest: Add test for GROUP BY queries with LIMIT - tests
(cherry picked from commit 08f3219cb8)
(cherry picked from commit 3838ad64b3)
(cherry picked from commit e7ae7f3662)
(cherry picked from commit 9db272c949)
Refs: #18842Closesscylladb/scylladb#20154
* github.com:scylladb/scylladb:
test/cql-pytest: Add test for GROUP BY queries with LIMIT
cql3: process LIMIT for GROUP BY queries
cql3/select_statement: simplify the get_limit function
cql3: respect the user-defined page size in aggregate queries
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 686a8f2939)
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2bbbe2a8bc)
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.
Prepare for making get_keyspace_local_ranges async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ea5a0cca10)
Add static `make` methods to ranges_holder_{primary,secondary}
and use them to make the ranges objects and pass them
to `token_ranges_owned_by_this_shard`, rather than letting
token_ranges_owned_by_this_shard invoke the right constructor
of the ranges_holder class.
Prepare for making `make` async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 824bdf99d2)
To allow the class to be nothrow_move_constructable.
Prepare for returning it as a future value.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 333c0d7c88)
Rather than holding a variant member (and defining
both ranges_holder_{primary,secondary} in both
specilizations of the class, just make the internal
ranges_holder class first-class citizens
and parameterize the `token_ranges_owned_by_this_shard`
template by this class type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d385219a12)
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.
(cherry picked from commit d729d1b272)
Closesscylladb/scylladb#20156
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.
This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 52234214e5)
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.
For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
(inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
(inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```
This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.
Fixes#18173
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 95a5fba0ea)
It is unsafe to restrict the sync nodes for repair to
the source data center if we cannot guarantee a quorum
in the data center with network-topology replication strategy.
This change restricts the usage of source_dc in the following cases:
1. For SimpleStrategy - source_dc is ignored since there is no guarantee
that it contains remaining replicas for all tokens.
2. For EverywhereStrategy - use source_dc if there are remaining
live nodes in the datacenter.
3. For NetworkTopologyStrategy:
a. It is considered unsafe to use source_dc if number of nodes
lost in that DC (replaced/rebuilt node + additional ignored nodes)
is greater than 1, or it has 1 lost node and rf <= 1 in the DC.
b. If the source_dc arg is forced, as with the new
`nodetool rebuild --force <source_dc>` option,
we use it anyway, even if it's considered to be unsafe.
A warning is printed in this case.
c. If the source_dc arg is user-provided, (using nodetool rebuild),
an error exception is thrown, advising to use an alternative dc,
if available, omit source_dc to sync with all nodes, or use the
--force option to use the given source_dc anyhow.
d. Otherwise, we look for an alternative source datacenter,
that has not lost any node. If such datacenter is found
we use it as source_dc for the keyspace, and log a warning.
e. If no alternative dc is found (and source_dc is implicit), then:
log a warning and fall back to using replicas from all nodes in the cluster.
Fixes#16826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5f655e41e3)
To be used by the next path to count how many nodes
are lost in each datacenter.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 8665eef98c)
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,
Refs scylladb/scylladb#6403
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9729dd21c3)
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.
Note that it's already done on the rebuild path
for streaming based rebuild.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b5d0ab092c)
This commit extracts the information about the default for tables in keyspace creation
to a separate file in the _common folder. The file is then included using
the scylladb_include_flag directive.
The purpose of this commit is to make it possible to include a different file
in the scylla-enterprise repo - with a different default.
Refs https://github.com/scylladb/scylla-enterprise/issues/4585
(cherry picked from commit 107708434c)
Closesscylladb/scylladb#20220
The include flag directive now treats missing content as info logs instead of warnings. This prevents build failures when the enterprise-specific content isn't yet available.
If the enterprise content is undefined, the directive automatically loads the open-source content. This ensures the end user has access to some content.
address comments
(cherry picked from commit 30887d096f)
Closesscylladb/scylladb#20226
To be used to force usage of source_dc, even
when it is unsafe for rebuild.
Update docs and add test/nodetool/test_rebuild.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0419b1d522)
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.
This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 8b1877f3ca)
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.
It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.
Refs #19378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 239344ab55)
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.
Fixes#19519
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 44583eed9e)
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.
(cherry picked from commit 6a7fb18b52)
Closesscylladb/scylladb#20093
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.
Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.
The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.
The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
--------
seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
--------
seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::internal::coroutine_traits_base<void>::promise_type
--------
seastar::(anonymous namespace)::thread_wake_task
--------
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...
From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.
Fixes: #19873.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 5af1f41ecd)
Closesscylladb/scylladb#20107
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.
Use find_column_family at each table occurrence in insert_repair_meta
instead.
Fixes: #20057
(cherry picked from commit 719999b34c)
Refs #19953Closesscylladb/scylladb#20076
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required to execute this req,
2. global topo req is executed by topo coordinator, which reads data attached to the req.
The KS name is among the data attached to the req. There's a time window between these steps where a to-be-altered KS could have been DROPped, which results in topo coordinator forever trying to ALTER a non-existing KS. In order to avoid it, the code has been changed to first check if a to-be-altered KS exists, and if it's not the case, it doesn't perform any schema/tablets mutations, but just removes the global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected changes, which is due to the fact that the code is written badly and needs to be refactored - an effort that's already planned under #19126
(I suggest to disable displaying whitespace differences when reviewing this PR).
Fixes: #19576
Requires 6.0 backport
(cherry picked from commit 5b089d8e10)
(cherry picked from commit 0ea2128140)
(cherry picked from commit ddb5204929)
Refs #19666Closesscylladb/scylladb#20143
* github.com:scylladb/scylladb:
tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
cql: refactor rf_change indentation
Prevent ALTERing non-existing KS with tablets
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.
Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.
Fixesscylladb/scylla-enterprise#4380
(cherry picked from commit eeaf4c3443)
Closesscylladb/scylladb#20145
Remove xfail from all tests for #5361, as the issue is fixed.
Remove xfail from test_group_by_clustering_prefix_with_limit
It references #5362, but is fixed by #17237.
Refs #17237
(cherry picked from commit 9db272c949)
Currently LIMIT not passed to the query executor at all and it was just
an accident that it worked for the case referenced in #17237. This
change passes the limit value down the chain.
(cherry picked from commit e7ae7f3662)
The get_limit() function performed tasks outside of its scope - for
example checked if the statement was an aggregate. This change moves the
onus of the check to the caller.
(cherry picked from commit 3838ad64b3)
The comment in the code already states that we should use the
user-defined page size if it's provided. To avoid OOM conditions we'll
use the internally defined limit as the upper bound or if no page size
is provided.
This change lays ground work for fixing #5362 and is necessary to pass
the test introduced in #19392 once it is implemented.
(cherry picked from commit 08f3219cb8)
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.
(cherry picked from commit ddb5204929)
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
to execute this req,
2. global topo req is executed by topo coordinator, which reads data
attached to the req.
The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126Fixes: #19576
(cherry picked from commit 5b089d8e10)
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.
Turn exceptions from aborts into WARN level.
Also improve logging by printing the command that failed.
Fixesscylladb/scylladb#19754
(cherry picked from commit 7506709573)
Closesscylladb/scylladb#20071
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.
This also violates invariants about replica placement and this state
cannot be fixed by topology operations.
One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:
load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at: ...
Fixes#20032
(cherry picked from commit f5c74a5df2)
Closesscylladb/scylladb#20066
Add more logging for raft-based topology operations in INFO and DEBUG
levels.
Improve the existing logging, adding more details.
Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).
Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.
These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).
Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945
(cherry picked from commit e8d5974961)
Closesscylladb/scylladb#20048
This commit extracts the information about the configuration the user should do
right after installation (especially running scylla_setup) to a separate file.
The file is included in the relevant pages, i.e., installing with packages
and installing with Web Installer.
In addition, the examples on the Web Installer page are updated
with supported versions of ScyllaDB.
Fixes https://github.com/scylladb/scylladb/issues/19908
(cherry picked from commit 849856b964)
Closesscylladb/scylladb#20050
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.
As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.
In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.
Fixes: scylladb/scylladb#19439
This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.
(cherry picked from commit 6af7882c59)
(cherry picked from commit 5ec8c06561)
Refs #19765Closesscylladb/scylladb#19895
* github.com:scylladb/scylladb:
test: regression test for MV crash with tablets during decommission
db/view: drop view updates to replaced node marked as left
This commit adds OS support for version 6.1 and removes OS support for 5.4
(according to our support policy for versions).
(cherry picked from commit eca2dfd8c3)
Closesscylladb/scylladb#20019
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.
But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!
So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).
The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit ca8b91f641)
Closesscylladb/scylladb#19940
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.
We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".
This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.
Fixes#19911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit d293a5787f)
Closesscylladb/scylladb#19938
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
single tablet replica
In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.
Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes. This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.
Refs scylladb/scylladb#18862
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 26abad23d9)
Closesscylladb/scylladb#19919
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.
Fixes#19879
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 27b305b9d1)
Closesscylladb/scylladb#19897
scylla_raid_setup may fail on Ubuntu minimal image since it calls
update-initramfs without installing.
(cherry picked from commit 02b20089cb)
Closesscylladb/scylladb#19869
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.
Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.
Fixes: #19351.
(cherry picked from commit dfe3af40ed)
Closesscylladb/scylladb#19839
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.
Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.
This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.
Fixesscylladb/scylladb#19772
Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.
(cherry picked from commit 781eb7517c)
Closesscylladb/scylladb#19814
This PR adds the 6.0-to-6.1 upgrade guide (including metrics) and removes the 5.4-to-6.0 upgrade guide.
Compared 5.4-to-6.0, the the 6.0-to-6.1 guide:
- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
so now there's no scenario that would require the user to follow the validation procedure.
- Removed the references to the Enable Consistent Topology Updates page (which was in version 6.0 and is removed with this PR) across the docs.
See the individual commits for more details.
Fixes https://github.com/scylladb/scylladb/issues/19853
Fixes https://github.com/scylladb/scylladb/issues/19933
This PR must be backported to branch-6.1 as it is critical in version 6.1.
(cherry picked from commit 9972e50134)
(cherry picked from commit 32fa5aa938)
Refs #19983Closesscylladb/scylladb#20038
* github.com:scylladb/scylladb:
doc: remove the 5.4-to-6.0 upgrade guide
doc: add the 6.0-to-6.1 upgrade guide
This commit removes the 5.4-to-6.0 upgrade guide and all references to it.
It mainly removes references to the Enable Consistent Topology Updates page,
which was added as enabling the feature was optional.
In rare cases, when a reference to that page is necessary,
the internal link is replaced with an external link to version 6.0.
Especially the Handling Cluster Membership Change Failures page was modified
for troubleshooting purposes rather than removed.
(cherry picked from commit 32fa5aa938)
This commit adds the 6.0-to-6.1 upgrade guide.
Compared to the previous upgrade guide:
- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
so now there's no scenario that would require the user to follow the validation procedure.
(cherry picked from commit 9972e50134)
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.
Fixesscylladb/scylla-enterprise#4475Closesscylladb/scylladb#20018
Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixes scylladb/scylladb#19223
(cherry picked from commit 2dbe9ef2f2)
(cherry picked from commit 5dfc50d354)
Refs #19860Closesscylladb/scylladb#19970
* github.com:scylladb/scylladb:
raft: fix the shutdown phase being stuck
raft: use the abort source reference in raft group0 client interface
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.
This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.
This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3
Fixesscylladb/scylladb#19223
(cherry picked from commit 5dfc50d354)
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.
(cherry picked from commit 2dbe9ef2f2)
The Raft-topology upgrade procedure must not be run concurrently with
version upgrade.
(cherry picked from commit bb0c3cdc65)
Closesscylladb/scylladb#19836
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.
This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.
As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.
Fixes: scylladb/scylladb#19439
(cherry picked from commit 6af7882c59)
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.
The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.
Fixes#19735
(cherry picked from commit 14cd7b5095)
Closesscylladb/scylladb#19863
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.
The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.
The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().
But the hard part of this test is the testing:
1. An existing multi-node test for "/localnodes" assummed that right after
a new node was created, it appears on "/localnodes". But after this
patch, it may take a bit more time for the bootstrapping to complete
and the new node to appear in /localnodes - so I had to add a retry loop.
2. I added a test that reproduces the bug fixed here, and verifies its
fix. The test is in the multi-node topology framework. It adds an
injection which delays the bootstrap, which leaves a new node in JOINING
state for a long time. The test then verifies that the new node is
alive (as checked by the REST API), but is not returned by "/localnodes".
3. The new injection for delaying the bootstrap is unfortunately not
very pretty - I had to do it in three places because we have several
code paths of how bootstrap works without repair, with repair, without
Raft and with Raft - and I wanted to delay all of them.
Fixes#19694.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0d1aa399f9)
Closesscylladb/scylladb#19855
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722
Closes scylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
(cherry picked from commit 591876b44e)
Backported from #19717 to 6.1
Closesscylladb/scylladb#19828
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.
Should be backported to all supported versions.
(cherry picked from commit 87beebeed0)
(cherry picked from commit 4178589826)
Refs #19699Closesscylladb/scylladb#19819
* github.com:scylladb/scylladb:
test: add test to check that coordinator lwt semaphore continues functioning after locking failures
paxos: do not signal semaphore if it was not acquired
Use the rolling restart to avoid spurious driver reconnects.
This can be eventually reverted once the scylladb/python-driver#295 is fixed.
Fixes scylladb/scylladb#19154
(cherry picked from commit ef3393bd36)
(cherry picked from commit a89facbc74)
Refs #19771Closesscylladb/scylladb#19820
* github.com:scylladb/scylladb:
test: raft: fix the flaky `test_raft_recovery_stuck`
test: raft: code cleanup in `test_raft_recovery_stuck`
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.
Fixes: https://github.com/scylladb/scylladb/issues/19698
(cherry picked from commit 87beebeed0)
Fixes#19753
SSTable file open provides an `io_error_handler` instance which is applied to a file-wrapper to process any IO errors happing during read/write via the handler in `storage_service`, which in turn will effectively disable the node. However, this is not applied to the actual open operation itself, i.e. any exception generated by the file open call itself will instead just escape to caller.
This PR adds filtering via the `error_handler` to sstable open + makes `storage_service` "isolate" mechanism non-module-static (thus making it testable) and adds tests to check we exhibit the same behaviour in both cases.
The main motivation for this issue it discussions that secondary level IO issues (i.e. caused by extensions) should trigger the same behaviour as, for example, running out of disk space.
Closesscylladb/scylladb#19766
* github.com:scylladb/scylladb:
memtable_test: Add test for isolate behaviour on exceptions during flush
cql_test_env: Expose storage service
storage_service: Make isolate guard non-static and add test accessor
sstable: apply error_handler on open exceptions
cql-pytest's config_value_context is used to run a code sequence with
different ScyllaDB configuration applied for a while. When it reads
the original value (in order to restore it later), it applies
ast.literal_eval() to it. This is strange, since the config variable isn't
a Python literal.
It was added in 8c464b2ddb ("guardrails: restrict replication
strategy (RS)"). Presumably, as a workaround for #19604 - it sufficiently
massaged the input we read via SELECT to be acceptable later via UPDATE.
Now that #19604 is fixed, we can remove the call to ast.literal_eval,
but have to fix up the parameters to config_value_context to something
that will be accepted without further massaging.
This is a step towards fixing #15559, where we want to run some tests
with a boolean configuration variable changed, and literal_eval is
transforming the string representation of integers to integers and
confusing the driver.
Closesscylladb/scylladb#19696
`GitHub Actions / Analyze #includes in source files` keeps reporting
that the include shouldn't be present in the file. The reason is
that we use FMT with version >10, so the fragment of the code that
uses the include is not compiled. We move the include to a place
where it's used, which should fix the warnings.
Closesscylladb/scylladb#19776
Makes storage service isolate repeatable in same process and more testable.
Note, since the test var now is shard-local we need to check twice: once
on error, once on reaching shard zero for actual shutdown.
The python driver might currently trigger spurios reconnects that cause
the `NoHostAvailable` to be thrown, which is not expected.
This patch adds a retry mechanism to the test to make skip this failure
if it occurs, as a work-around.
The proper fix is expected to be done in the scylladb/python-driver#295,
once fixed there this work-around can be reverted.
Fixes: scylladb/scylla#18547Closesscylladb/scylladb#19759
There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.
Fixes#16065
The bug is present in all supported versions, so the patch has to be backported to all of them.
Closesscylladb/scylladb#19695
* github.com:scylladb/scylladb:
sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
```
DEBUG 2024-07-03 00:59:58,291 [shard 0:main] compaction_manager - Compaction task 0x51800002a480 for table tests.3 compaction_group=0 [0x503000062050]: switch_state: none -> pending: pending=2 active=0 done=0 errors=0
DEBUG 2024-07-03 01:00:02,868 [shard 0:main] compaction - Checking droppable sstables in tests.3, candidates=0
DEBUG 2024-07-03 01:00:02,868 [shard 0:main] compaction - time_window_compaction_strategy::newest_bucket:
now 1720314000000000
buckets = {
key=1720314000000000, size=2
key=1720310400000000, size=2
1720314000000000: GMT: Sunday, July 7, 2024 1:00:00 AM
1720310400000000: GMT: Sunday, July 7, 2024 12:00:00 AM
```
the test failed to complete when ran across different clock hours, as it
expected all sstables produced to belong to same window of 1h size.
let's fix it by reusing timestamps, so it's always consistent.
Fixes#13280.
Fixes#18564.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19749
The leader_host API handler was eventually using the `req` unique_ptr
after it has been already destroyed (passed down to the future lambda
by reference). This was causing an occassional crash in some tests.
Reworked the leader_host handler to use the req only outside of the
future lambda.
Also updated the code to handle the possibility that the non-default
leader group (other than Group 0) might reside on a different shard
than the shard 0 - using the same concept of calling on all shards via
`invoke_on_all()` as done for the other requests.
Fixesscylladb/scylladb#19714Closesscylladb/scylladb#19715
utils::in uses std::aligned_storage, which is deprecated. Rather than fixing it, replace its only
user with simpler code and remove it.
No backport needed as this isn't fixing a bug.
Closesscylladb/scylladb#19683
* github.com:scylladb/scylladb:
utils: remove utils/in.hh
gossiper: remove initializer-list overload of add_local_application_state()
Since Python 3.12, version parsing becomes strict, parse_version() does
not accept the version string like '6.1.0~dev'.
To fix this, we need to pass acceptable version string to parse_version() like
'6.1.0.dev0', which is allowed on Python version scheme.
Also, release canditate version like '6.0.0~rc3' has same issue, it
should be replaced to '6.0.0rc3' to compare in parse_version().
reference: https://packaging.python.org/en/latest/specifications/version-specifiers/Fixes#19564Closesscylladb/scylladb#19572
This reverts commit 65fbf72ed0, since
it breaks scylla-housekeeping and SCT because the patch modified
version string.
We shoudn't modify version string directly, need to pass
modified string just for parse_version() instead.
Setting the error condition for all nodes in the cluster to avoid
having to check which one is the coordinator. This should make the test
more stable and avoid the flakiness observed when the coordinator node
is the one that got the error condition injected.
Randomizing the retrieved running servers to reproduce the issue more
frequently and to avoid making any assumptions about the order of the
servers.
Note that only the "raft_topology_barrier_fail" needs to run
on a non-coordinator node, the other error "stream_ranges_fail" can be
injected on any node (including the coordinator).
Fixes: scylladb/scylladb#18614Closesscylladb/scylladb#19663
This patch is a follow-up to scylladb/scylladb#16585.
Once we have service levels on raft, we can get rid of update loop, which updates the configuration in a configured interval (default is 10s).
Instead, this PR introduces methods to `group0_state_machine` which look through table ids in mutations in `write_mutation` and update submodules based on that ids.
Fixes: scylladb/scylladb#18060Closesscylladb/scylladb#18758
* github.com:scylladb/scylladb:
test: remove `sleep()`s which were required to reload service levels configuration
test/cql_test_env: remove unit test service levels data accessors
service/storage_service: reload SL cache on topology_state_load()
service/qos/service_level_controller: move semaphore breaking to stop
service/qos/service_level_controller: maybe start and stop legacy update loop
service/qos/service_level_controller: make update loop legacy
raft/group0_state_machine: update submodules based on table_id
service/storage_service: add a proxy method to reload sl cache
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.
This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.
As of this patch, those static_casts are actually invalid in some cases
(they cast to the wrong type) because of an oversight.
A later patch will fix that. But to even write a reliable reproducer
for the problem, we must force the invalid casts to manifest as a crash
(instead of weird results).
This patch both allows writing a reproducer for the bug and serves
as a bit of defensive programming for the future.
If the index was created on collection (both frozen or not), its description wasn't a correct create statement.
This patch fixes the bug and includes functions like `full()`, `keys()`, `values()`, ... used to create index on collections.
Fixesscylladb/scylladb#19278Closesscylladb/scylladb#19381
* github.com:scylladb/scylladb:
cql-pytest/test_describe: add a test for describe indexes
schema/schema: fix column names in index description
instead of hardwiring the toolchain image in github workflows, read it
from `tools/toolchain/image`. a dedicated reusable workflow is added to
read from this file, and expose its content with an output parameter.
also, switch iwyu.yaml workflow to this image, more maintainable this
way. please note, before this change, we are also using the latest
stable build of clang, and since fedora 40 is also using the clang 18,
so the behavior is not change. but with this change, we don't have
the flexibility of using other clang versions provided
https://apt.llvm.org in future.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19655
this changeset includes two changes:
- service: move storage_service::~storage_service() into .cc
- transport: move the cql_server::~cql_server() into .cc
they intends to address the compile failures when building scylladb with clang-19. clang-19 is more picky when generating the defaulted destructors with incomplete types. but its behavior makes sense regarding to standard compliance. so let's update accordingly.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#19668
* github.com:scylladb/scylladb:
transport: move the cql_server::~cql_server() into .cc
service: move storage_service::~storage_service() into .cc
`sstables::write()` has multiple overloads, which are defined in
`sstables/writer.hh`. two of these overloads are template functions,
which have a template parameter named `W`, which has a type constraint
requiring it to fulfill the `Writer` concept. but in `types.hh`, when
the compiler tries to instantiate the template function with signature
of `write(sstable_version_types v, W& out, const T& t)` with
`file_writer` as the template parameter of `w`, `file_writer` is only
forward-declared using `class file_writer` in the same header file, so
this type is still an incomplete type at that moment. that's why the
compiler is not able to determine if `file_writer` fulfills the
constraint or not. actually, the declaration of `file_writer` is located
in `sstables/writer.hh`, which in turn includes `types.hh`. so they
form a cyclic dependency.
in this change, in order to break this cycle, we extract file_writer out
into a separate header file, so that both `sstables/writer.hh` and
`sstables/types.hh` can include it. this address the build failure.
Fixesscylladb/scylladb#19667
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19669
Adds a convenience function for inspecting the coroutine frame of a given
seastar task.
Short example of extracting a coroutine argument:
```
(gdb) p *$coro_frame(seastar::local_engine->_current_task)
$1 = {
__resume_fn = 0x2485f80 <sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&)>,
...
PointerType_7 = 0x601008e67880,
...
__coro_index = 0 '\000'
...
(gdb) p $downcast_vptr($->PointerType_7)
$2 = (schema *) 0x601008e67880
```
Closesscylladb/scylladb#19479
The default configuration for replication_strategy_warn_list is
["SimpleStrategy"], but one cannot set this via CQL:
cqlsh> select * from system.config where name = 'replication_strategy_warn_list';
name | source | type | value
--------------------------------+---------+---------------------------+--------------------
replication_strategy_warn_list | default | replication strategy list | ["SimpleStrategy"]
(1 rows)
cqlsh> update system.config set value = '[NetworkTopologyStrategy]' where name = 'replication_strategy_warn_list';
cqlsh> select * from system.config where name = 'replication_strategy_warn_list';
name | source | type | value
--------------------------------+--------+---------------------------+-----------------------------
replication_strategy_warn_list | cql | replication strategy list | ["NetworkTopologyStrategy"]
(1 rows)
cqlsh> update system.config set value = '["NetworkTopologyStrategy"]' where name = 'replication_strategy_warn_list';
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for system.config - received 0 responses and 1 failures from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 1}
Fix by allowing quotes in enum_set parsing.
Bug present since 8c464b2ddb ("guardrails: restrict
replication strategy (RS)", 6.0).
Fixes#19604.
Closesscylladb/scylladb#19605
we got a failure during check-commit action:
```
Run python .github/scripts/label_promoted_commits.py --commit_before_merge 30e82a81e8 --commit_after_merge f31d5e3204 --repository scylladb/scylladb --ref refs/heads/master
Commit sha is: d5a149fc01
Commit sha is: 415457be2b
Commit sha is: d3b1ccd03a
Commit sha is: 1fca341514
Commit sha is: f784be6a7e
Commit sha is: 80986c17c3
Commit sha is: 492d0a5c86
Commit sha is: 7b3f55a65f
Commit sha is: 78d6471ce4
Commit sha is: 7a69d9070f
Commit sha is: a9e985fcc9
master branch, pr number is: 19213
Traceback (most recent call last):
File "/home/runner/work/scylladb/scylladb/.github/scripts/label_promoted_commits.py", line 87, in <module>
main()
File "/home/runner/work/scylladb/scylladb/.github/scripts/label_promoted_commits.py", line 81, in main
pr = repo.get_pull(pr_number)
File "/usr/lib/python3/dist-packages/github/Repository.py", line 2746, in get_pull
headers, data = self._requester.requestJsonAndCheck(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
return self.__check(
File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/pulls/pulls#get-a-pull-request", "status": "404"}
Error: Process completed with exit code 1.
```
The reason for this failure is since in one of the promoted commits
(a9e985fcc9) had a reference of `Closes`
to an issue.
Fixes: https://github.com/scylladb/scylladb/issues/19677Closesscylladb/scylladb#19678
Tablets are no longer in experimental_features since 83d491a, so remove them from the experimental_features section documentation.
Also, expand the documentation for the `enable_tablets` option.
Fixes#19456
Needs backport to 6.0
Closesscylladb/scylladb#19516
* github.com:scylladb/scylladb:
conf: scylla.yaml: enable_tablets: expand documentation
conf: scylla.yaml: remove tablets from experimental_features doc comment
The initializer_list overload uses a too-clever technique to avoid copies.
While copies here are unlikely to pose any real problem (we're allocating
map nodes anyway), it's simple enough to provide a copy-less replacement
that doesn't require questionable tricks.
We replace the initializer_list<..., in<>> overload with a variadic
template that constructs a temporary map.
Previously, some service levels tests requires to sleep in order to
ensure in-memory configuration of service levels was updated.
Now, when we are updating the configuration as the raft log is applied,
doing read barrier (for instance to execute `DROP TABLE IF EXISTS
non_existing_table`) is enough and the sleeps are not needed.
Unit test data accessors were created to avoid starting update loop in
unit test and to update controller's configuration directly.
With raft data accessor and configuration updates on applying raft log,
we can get rid of unit test data accessors and use the raft one.
This also make unit test env a bit like real Scylla environment.
Since SL cache is no longer updated in a loop, it needs to be
initialized on startup and because we are updating the cache while
applying raft commands, we can initialize it on topology_state_load().
Before this, the notification semaphore was broken() in do_abort(),
which was triggered by early abort source.
However we are going to reload sl cache on topology state reload
and it can happen after the early abort source is triggered, so
it may throw broken_semaphore exception.
We can move semaphore breaking to stop() method. Legacy update loop
is still stopped in do_abort(), so it doesn't change the order of
service level controller shutdown.
loop
In previous commit, we marked the update loop as legacy.
For compatibility reasons, we need to start legacy update loop
when the cluster is in recovery mode or it hasn't been upgraded to raft topology.
Then, in the update loop we check if all conditions are met and stop the
loop.
This commit also moves start of update loop later (after topology state is loaded) in main.cc.
There is no risk in doing it later.
Rename method which started update loop to better reflect
what it does.
Previously the method was named `update_from_distributed_data`,
however it doesn't update anything but only start the update loop,
which we are making legacy.
We want to update service levels cache when any new mutations are
applied to service levels table.
To not create new raft command type, this commit changes design of
`write_mutations` to updated in-memory structures based on
mutations' table_id.
In this series of patches, we want to reload service levels cache
when any changes to SL table are applied.
Firstly we need to have a way to trigger reload of the cache from
`group0_state_machines`.
To not introduce another dependency, we can use `storage_service` (which
has access to SL controller) and add a proxy method to it.
Counter updates break under tablet migration (#18180), and for this reason counters need to be disabled until the problem is fixed. It's enough to forbid creating a table with counters, as altering a table without counters already cannot result in the table having counters:
1) Adding a counter column to a table without counters:
```
cqlsh> ALTER TABLE temp.cf ADD (col_name counter);
ConfigurationException: Cannot add a counter column (col_name) in a non counter column family
```
2) Altering a column to be of the counter type:
```
cqlsh> ALTER TABLE temp.cf ALTER col_name TYPE counter;
ConfigurationException: Cannot change col_name from type int to type counter: types are incompatible.
```
Fixes: #19449
Fixes: https://github.com/scylladb/scylladb/issues/18876
Need to backport to 6.0, as this is broken there.
Closesscylladb/scylladb#19518
* github.com:scylladb/scylladb:
doc: add notes to feature pages which don't support tablets
cql: adjust warning about tablets
cql: forbid having counter columns in tablets tables
because transport/server.cc has the complete definition of event_notifier, the
compiler can default-generate the destructor of `cql_server` with the necessary
information. otherwise, clang-19 would fail to build, like:
```
FAILED: CMakeFiles/scylla.dir/Dev/main.cc.o
/home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DDEVEL -DFMT_SHARED -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Dev\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/abseil -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -MD -MT CMakeFiles/scylla.dir/Dev/main.cc.o -MF CMakeFiles/scylla.dir/Dev/main.cc.o.d -o CMakeFiles/scylla.dir/Dev/main.cc.o -c /home/kefu/dev/scylladb/main.cc
In file included from /home/kefu/dev/scylladb/main.cc:11:
In file included from /usr/include/yaml-cpp/yaml.h:10:
In file included from /usr/include/yaml-cpp/parser.h:11:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/memory:78:
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'cql_transport::cql_server::event_notifier'
91 | static_assert(sizeof(_Tp)>0,
| ^~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:398:4: note: in instantiation of member function 'std::default_delete<cql_transport::cql_server::event_notifier>::operator()' requested here
398 | get_deleter()(std::move(__ptr));
| ^
/home/kefu/dev/scylladb/transport/server.hh:135:7: note: in instantiation of member function 'std::unique_ptr<cql_transport::cql_server::event_notifier>::~unique_ptr' requested here
135 | class cql_server : public seastar::peering_sharded_service<cql_server>, public generic_server::server {
| ^
/home/kefu/dev/scylladb/transport/server.hh:135:7: note: in implicit destructor for 'cql_transport::cql_server' first required here
/home/kefu/dev/scylladb/transport/server.hh:149:11: note: forward declaration of 'cql_transport::cql_server::event_notifier'
149 | class event_notifier;
| ^
1 error generated.
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
as repair/repair.cc has the complete definition of node_ops_meta_data, the
compiler can default-generate the destructor of `storage_service` with the necessary
information. otherwise, clang-19 would fail to build, like:
```
FAILED: repair/CMakeFiles/repair.dir/Dev/repair.cc.o
/home/kefu/.local/bin/clang++ -DDEVEL -DFMT_SHARED -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Dev\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -MD -MT repair/CMakeFiles/repair.dir/Dev/repair.cc.o -MF repair/CMakeFiles/repair.dir/Dev/repair.cc.o.d -o repair/CMakeFiles/repair.dir/Dev/repair.cc.o -c /home/kefu/dev/scylladb/repair/repair.cc
In file included from /home/kefu/dev/scylladb/repair/repair.cc:9:
In file included from /home/kefu/dev/scylladb/repair/repair.hh:11:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/unordered_map:41:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unordered_map.h:33:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable.h:35:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:34:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/tuple:38:
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_pair.h:291:11: error: field has incomplete type 'service::node_ops_meta_data'
291 | _T2 second; ///< The second member
| ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/ext/aligned_buffer.h:93:28: note: in instantiation of template class 'std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>' requested here
93 | : std::aligned_storage<sizeof(_Tp), __alignof__(_Tp)>
| ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:334:43: note: in instantiation of template class '__gnu_cxx::__aligned_buffer<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>>' requested here
334 | __gnu_cxx::__aligned_buffer<_Value> _M_storage;
| ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:373:7: note: in instantiation of template class 'std::__detail::_Hash_node_value_base<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>>' requested here
373 | : _Hash_node_value_base<_Value>
| ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable.h:1662:21: note: in instantiation of template class 'std::__detail::_Hash_node_value<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>, false>' requested here
1662 | ._M_bucket_index(declval<const __node_value_type&>(),
| ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unordered_map.h:109:11: note: in instantiation of member function 'std::_Hashtable<utils::tagged_uuid<node_ops_id_tag>, std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>, std::allocator<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>>, std::__detail::_Select1st, std::equal_to<utils::tagged_uuid<node_ops_id_tag>>, std::hash<utils::tagged_uuid<node_ops_id_tag>>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>>::~_Hashtable' requested here
109 | class unordered_map
| ^
/home/kefu/dev/scylladb/service/storage_service.hh:109:7: note: forward declaration of 'service::node_ops_meta_data'
109 | class node_ops_meta_data;
| ^
In file included from /home/kefu/dev/scylladb/repair/repair.cc:9:
In file included from /home/kefu/dev/scylladb/repair/repair.hh:11:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/unordered_map:41:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unordered_map.h:33:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable.h:35:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:34:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/tuple:38:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_pair.h:60:
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
rwlock was added to protect iterations against concurrent updates to the map.
the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).
the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).
to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.
Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.
Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.
Fixes#18821.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.
Fixes#18506.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There's already a page which lists which features are not working with
tablets: architecture/tablets.html#limitations-and-unsupported-features,
but it's also helpful for users to be warned about this when visiting a
specific feature doc page.
We currently disable tombstone GC for compaction done on the read path of streaming and repair, because those expired tombstones can still prevent data resurrection. With time-based tombstone GC, missing a repair for long enough can cause data resurrection because a tombstone is potentially GC'd before it could be spread to every node by repair. So repair disseminating these expired tombstones helps clusters which missed repair for long enough. It is not a guarantee because compaction could have done the GC itself, but it is better than nothing.
This last resort is getting less important with repair-based tombstone GC. Furthermore, we have seen this cause huge repair amplification in a cluster, where expired tombstones triggered repair replicating otherwise identical rows.
This series makes tombstone GC on the streaming/repair compaction path configurable with a config item. This new config item defaults to `false` (current behaviour), setting it to `true`, will enable tombstone GC.
Fixes: https://github.com/scylladb/scylladb/issues/19015
Not a regression, no backport needed
Closesscylladb/scylladb#19016
* github.com:scylladb/scylladb:
test/topology_custom/test_repair: add test for enable_tombstone_gc_for_streaming_and_repair
replica/table: maybe_compact_for_streaming(): toggle tombstone GC based on the control flag
replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming()
db/config: introduce enable_tombstone_gc_for_streaming_and_repair
Counter updates break under tablet migration (#18180), and for this
reason they need to be disabled until the problem is fixed.
It's enough to forbid creating a table with counters, as altering a
table without counters already cannot result in the table having
counters:
1) Adding a counter column to a table without counters:
```
cqlsh> ALTER TABLE temp.cf ADD (col_name counter);
ConfigurationException: Cannot add a counter column (col_name) in a non counter column family
```
2) Altering a column to be of the counter type:
```
cqlsh> ALTER TABLE temp.cf ALTER col_name TYPE counter;
ConfigurationException: Cannot change col_name from type int to type counter: types are incompatible.
```
Fixes: #19449
The following command had been executed to get the
list of headers that did not contain '#pragma once':
'grep -rnw . -e "#pragma once" --include *.hh -L'
This change adds missing include guard to headers
that did not contain any guard.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#19626
Previously optimized clang installation was not used standard build
script, it overwrites preinstalled Fedora's clang binaries instead.
However this breaks on clang-18.1.8, since libLTO versioning convention.
To avoid such problem, let's switch to standard installation method and
swith install prefix to /usr/local.
Fixes#19203Closesscylladb/scylladb#19505
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.
This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.
Fixes https://github.com/scylladb/scylladb/issues/19552Closesscylladb/scylladb#19617
* github.com:scylladb/scylladb:
mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
logalloc: add hold_reserve
logalloc: generalize refill_emergency_reserve()
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.
Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.
In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.
Fixesscylladb/scylladb#19529Closesscylladb/scylladb#19586
Introduce REST API for triggering a read barrier.
This is to make sure the database schema is up to date on the node where
the read barrier is triggered. One of the use cases is the database
backup via the Scylla Manager, which requires that the schema backed up
is matching the data or newer (data can be migrated, but an older schema
would cause issues).
Fixesscylladb/scylladb#19213Closesscylladb/scylladb#19597
* github.com:scylladb/scylladb:
raft: add the read barrier REST API
raft: use `raft_timeout` in trigger_snapshot
raft: use bad_param_exception for consistency
test: raft: verify schema updated after read barrier
for better debugging experience.
before this change, we have
```
fatal error: in "sstable_directory_test_generation_sanity": critical check sst->generation() == sst1->generation() has failed
```
after this change, we have
```
fatal error: in "sstable_directory_test_generation_sanity": critical
check sst->generation() == sst1->generation() has failed [3ghm_0ntw_29vj625yegw7jodysc != 3ghm_0ntw_29vj625yegw7jodysd]
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19639
before this change, we provide `boost_test_print_type()` for all types
which can be formatted using {fmt}. these types includes those who
fulfill the concept of range, and their element can be formatted using
{fmt}. if the compilation unit happens to include `fmt/ranges.h`.
the ranges are formatted with `boost_test_print_type()` as well. this
is what we expect. in other words, we use {fmt} to format types which
do not natively support {fmt}, but they fulfill the range concept.
but `boost::unit_test::basic_cstring` is one of them
- it can be formatted using operator<<, but it does not provide
fmt::format specialization
- it fulfills the concept of range
- and its element type is `char const`, which can be formatted using
{fmt}
that's why it's formatted like:
```
test/boost/sstable_directory_test.cc(317): fatal error: in "sstable_directory_test_generation_sanity": critical check ['s', 's', 't', '-', '>', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'o', 'n', '(', ')', ' ', '=', '=', ' ', 's', 's', 't', '1', '-', '>', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'i', 'o', 'n', '(', ')'] has failed`
```
where the string is formatted as a sequence-alike container. this
is far from readable.
so, in this change, we do not define `boost_test_print_type()` for
the types which natively support `operator<<` anymore. so they can
be printed with `operator<<` when boost::test prints them.
Fixesscylladb/scylladb#19637
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19638
The equivalent of small-objects, but for large objects (spans).
Allows listing object of a large-class, and therefore investigating a
run-away class, by attempting to identify the owners of the objects in
it.
Written to investigate #16493Closesscylladb/scylladb#16711
This will allow to trigger the read barrier directly via the API,
instead of doing work-arounds (like dropping a non-existent table).
The intended use-case is in the Scylla Manager, to make sure that
the database schema is up to date after the data has been backed up
and before attempting to backup the database schema.
The database schema in particular is being backed up just on a single
node, which might not yet have the schema at least as new as the data
(data can be migrated to a newer schema, but not a vice-versa).
The read barrier issued on the node should ensure that the node should
have the schema at least as new as the data or newer.
Closes#19213
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.
This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.
Fixesscylladb/scylladb#19552
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.
This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.
When debugging the issue of high LWT contention metric, we (the
drivers team) discovered that at least 3 drivers (Go, Java, Rust)
cause high numbers in that metrics in LWT workloads - we doubted that
all those drivers route LWT queries badly. We tried to understand that
metric and its semantics. It took 3 people over 10 hours to figure out
what it is supposed to count.
People from core team suspected that it was the drivers sending
requests to different shards, causing contention. Then we ran the
workload against a single node single shard cluster... and observed
contention. Finally, we looked into the Scylla code and saw it.
**Uninitialized stack value.**
The core member was shocked. But we, the drivers people, felt we always
knew it. It's yet another time that we are blamed for a server-side
issue. We rebuilt scylla with the variable initialized to 0 and the
metric kept being 0.
To prevent such errors in the future, let's consider some lints that
warn against uninitialized variables. This is such an obvious feature
of e.g. Rust, and yet this has shown to be cause a painful bug in 2024.
Closesscylladb/scylladb#19625
On Ubuntu/Debian, we have to install systemd-coredump before
running has_ztd(), since it detect ZSTD support by running coredumpctl.
Move pkg_install('systemd-coredump') to the head of the script.
Fixes#19643Closesscylladb/scylladb#19648
in cccec07581, we started using a featured introduced by {fmt} v10.
but we are still using the {fmt} cooked using seastar, and it is
9.1.0, so this breaks the build when running the clang-tidy workflow.
in this change, instead of building on ubuntu jammy, we use the
scylladb/scylla-toolchain image based on fedora 40, which provides
{fmt} v10.2.1. since we are have clang 18 in fedora 40, this change
does not sacrifice anything.
after this change, clang-tidy workflow should be back to normal.
Fixesscylladb/scylladb#19621
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19628
Although Scylla already exposes metrics keeping track of various information related to hinted handoff, all of them correspond to either storing or sending hints. However, when debugging, it's also crucial to be aware of how many hints are coming to a given node and what their size is. Unfortunately, the existing metrics are not enough to obtain that information.
This PR introduces the following new metrics:
* `sent_bytes_total` – the total size of the hints that have been sent from a given shard,
* `received_hints_total` – the total number of hints that a given shard has received,
* `received_hints_bytes_total` – the total size of the hints a given shard has received.
It also renames `hints_manager_sent` to `hints_manager_sent_total` to avoid conflicts of prefixes between that metric and `sent_bytes_total` in tests.
Fixesscylladb/scylladb#10987Closesscylladb/scylladb#18976
* github.com:scylladb/scylladb:
db/hints: Add a metric for the size of sent hints
service/storage_proxy: Add metrics for received hints
The view builder is doing write operations to the database.
In order for the view builder to shutdown gracefully without errors, we
need to ensure the database can handle writes while it is drained.
The commit changes the drain order, so that view builder is drained
before the database shuts down.
Fixesscylladb/scylladb#18929Closesscylladb/scylladb#19609
Recently, the code in paxos_state::prepare(), paxos_state::accept() and
paxos_state::learn() was coroutinized by 58912c2cc1, 887a5a8f62 and
2b7acdb32c respectively. This introduced a regression: the latency
histogram updater code, was moved from a finally() to a defer(). Unlike
the former, the latter runs in a noexcept context so the possible
replica::no_such_column_family raised from the latency update code now
crashes the node, instead of failing just the paxos operation as before.
Fix by only updating the latency histogram if the table still exists.
Fixes: scylladb/scylladb#19620Closesscylladb/scylladb#19623
The page was missing from the docs. I created the page based on
the information in the download center (which will be closed down soon)
and other ScyllaDB resources.
Closesscylladb/scylladb#19577
In 2446cce, we stopped trying to attempt to create
endpoint managers for invalid hint directories
even when their names represented IP addresses or
host IDs. In this commit, we add logging informing
the user about it.
Refs scylladb/scylladb#19173Closesscylladb/scylladb#19618
Now that the CPU concurency limit is configurable, new reads might be
ready to execute right after the current one was executed. So move the
poll for admitting new reads into the inner loop, to prevent the
situation where the inner loop yields and a concurrent
do_wait_admission() finds that there are waiters (queued because at the
time they arrived to the semaphore, the _ready_list was not empty) but it
is is possible to admit a new read. When this happens the semaphore will
dump diagnostics to help debug the apparent contradiction, which can
generate a lot of log spam. Moving the poll into the inner loop prevents
the false-positive contradiction detection from firing.
Refs: scylladb/scylladb#19017Closesscylladb/scylladb#19600
This is the first patch from series which would allow us to unify raft command code. Property we want to achieve is that all modifications performed by a single raft command can be made visible atomically. This helps to exclude accidental dependencies across subsystem updates and make easier to reason about state.
Here we alter functions schema code so that changes are first applied to a copy of declared functions and then made visible atomically. Later work will apply similar strategy to the whole schema.
Relates scylladb/scylladb#19153Closesscylladb/scylladb#19598
* github.com:scylladb/scylladb:
cql3: functions: make modification functions accessible only via batch class
db: replica: batch functions schema modifications
cql3: functions: introduce class for batching functions modifications
cql3: functions: make functions class non-static
cql3: functions: remove reduntant class access specifiers
cql3: functions: remove unused java snippet
Before each function change was immediately visible as
during event notification logic yielded.
Now we first gather the modifications and then commit them.
Further work will broaden the scope of atomicity to the whole
schema and even across other subsystems.
In the next patch, we will want to do the thing as
refill_emergency_reserve() does, just with a quantity different
than _emergency_reserve_max. So we split off the shareable part
to a new function, and use it to implement refill_emergency_reserve().
When balancer fails to find a node to balance drained tablets into, it
throws an exception with tablet id and node id, but it's also good to
know more details about the balancing state that lead to failure
refs: #19504
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19588
It will hold a temporary shallow copy of declared functions.
Then each modification adds/removes/replaces stored function object.
At the end change is commited by moving temporary copy to the
main functions class instance.
This series is another approach of https://github.com/scylladb/scylladb/pull/18646 and https://github.com/scylladb/scylladb/pull/19181. In this series we only change where the view backlog gets
updated - we do not assure that the view update backlog returned in a response is necessarily the backlog
that increased due to the corresponding write, the returned backlog may be outdated up to 10ms. Because
this series does not include this change, it's considerably less complex and it doesn't modify the common
write patch, so no particular performance considerations were needed in that context. The issue being fixed
is still the same, the full description can be seen below.
When a replica applies a write on a table which has a materialized view
it generates view updates. These updates take memory which is tracked
by `database::_view_update_concurrency_sem`, separate on each shard.
The fraction of units taken from the semaphore to the semaphore limit
is the shard's view update backlog. Based on these backlogs, we want
to estimate how busy a node is with its view updates work. We do that
by taking the max backlog across all shards.
To avoid excessive cross-shard operations, the node's (max) backlog isn't
calculated each time we need it, but up to 1 time per 10ms (the `_interval`) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it):
```
update_backlog node_update_backlog::fetch() {
auto now = clock::now();
if (now >= _last_update.load(std::memory_order_relaxed) + _interval) {
_last_update.store(now, std::memory_order_relaxed);
auto new_max = boost::accumulate(
_backlogs,
update_backlog::no_backlog(),
[] (const update_backlog& lhs, const per_shard_backlog& rhs) {
return std::max(lhs, rhs.load());
});
_max.store(new_max, std::memory_order_relaxed);
return new_max;
}
return std::max(fetch_shard(this_shard_id()), _max.load(std::memory_order_relaxed));
}
```
For the same reason, even when we do calculate the new node's backlog,
we don't read from the `_view_update_concurrency_sem`. Instead, for
each shard we also store a update_backlog atomic which we use for
calculation:
```
struct per_shard_backlog {
// Multiply by 2 to defeat the prefetcher
alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog();
need_publishing need_publishing = need_publishing::no;
update_backlog load() const {
return backlog.load(std::memory_order_relaxed);
}
};
std::vector<per_shard_backlog> _backlogs;
```
Due to this distinction, the update_backlog atomic need to be updated
separately, when the `_view_update_concurrency_sem` changes.
This is done by calling `storage_proxy::update_view_update_backlog`, which reads the `_view_update_concurrency_sem` of the shard (in `database::get_view_update_backlog`)
and then calls node`_update_backlog::add` where the read backlog
is stored in the atomic:
```
void storage_proxy::update_view_update_backlog() {
_max_view_update_backlog.add(get_db().local().get_view_update_backlog());
}
void node_update_backlog::add(update_backlog backlog) {
_backlogs[this_shard_id()].backlog.store(backlog, std::memory_order_relaxed);
_backlogs[this_shard_id()].need_publishing = need_publishing::yes;
}
```
For this implementation of calculating the node's view update backlog to work,
we need the atomics to be updated correctly when the semaphores of corresponding
shards change.
The main event where the view update backlog changes is an incoming write
request. That's why when handling the request and preparing a response
we update the backlog calling `storage_proxy::get_view_update_backlog` (also
because we want to read the backlog and send it in the response):
backlog update after local view updates (`storage_proxy::send_to_live_endpoints` in `mutate_begin`)
```
auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable {
return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state())
.then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] {
// make mutation alive until it is processed locally, otherwise it
// may disappear if write timeouts before this future is ready
got_response(response_id, my_address, get_view_update_backlog());
});
};
backlog update after remote view updates (storage_proxy::remote::handle_write)
auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr,
shard, response_id, p->get_view_update_backlog()));
```
Now assume that on a certain node we have a write request received on shard A,
which updates a row on shard B (A!=B). As a result, shard B will generate view
updates and consume units from its `_view_update_concurrency_sem`, but will
not update its atomic in `_backlogs` yet. Because both shards in the example
are on the same node, shard A will perform a local write calling `lmutate` shown
above. In the `lmutate` call, the `apply_locally` will initiate the actual write on
shard B and the `storage_proxy::update_view_update_backlog` will be called back
on shard A. In no place will the backlog atomic on shard B get updated even
though it increased in size due to the view updates generated there.
Currently, what we calculate there doesn't really matter - it's only used for the
MV flow control delays, so currently, in this scenario, we may only overload
a replica causing failed replica writes which will be later retried as hints. However,
when we add MV admission control, the calculated backlog will be the difference
between an accepted and a rejected request.
Fixes: https://github.com/scylladb/scylladb/issues/18542
Without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none
Closesscylladb/scylladb#19341
* github.com:scylladb/scylladb:
test: add test for view backlog not being updated on correct shard
test: move auxiliary methods for waiting until a view is built to util
mv: update view update backlog when it increases on correct shard
This is done to ease code reuse in the following commit.
It'd also help should we ever want properly mount functions
class to schema object instead of static storage.
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to use fmt::streamed() when
the corresponding format_as() as available.
simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19594
to avoid warning like
```
DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
```
and to be future-proof, let's use the offset-aware timestamp.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19536
zstd.h is a header provided by libzstd, so let's include it with
brackets, more consistent this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19538
We disabled coredump compression by default because it was too slow,
but recent versions of systemd-coredump supports faster zstd based compression,
so let's enable compression by default when zstd support detected.
Related scylladb/scylla-machine-image#462Closesscylladb/scylladb#18854
This short series fixes Alternator's "/localnodes" request to allow a node's external IP address - configured with `broadcast_rpc_address` - to be listed instead of its usual, internal, IP address.
The first patch fixes a bug in gossiper::get_rpc_address(), which the second patch needs to implement the feature. The second patch also contains regression tests.
Fixes#18711.
Closesscylladb/scylladb#18828
* github.com:scylladb/scylladb:
alternator: fix "/localnodes" to use broadcast_rpc_address
gossiper: fix get_rpc_address() for this node
They don't need to modify the captured objects. In fact, they must not
do it in the first place, because the request can be called more than
once and the buffers must not change between those invocations.
For the memory_sink_buffers there must be const method to get the vector
of temporary_buffers themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19599
Log the sstable origin when its bloom filter is being rebuilt. The
origin has to be passed to the method by the caller as it is not
available in the sstable object when the filter is rebuilt.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#19601
This patch adds a test for reproducing issue https://github.com/scylladb/scylladb/issues/18542
The test performs writes on a table with a materialized view and
checks that the view backlog increases. To get the current view
update backlog, a new metric "view_update_backlog" is added to
the `storage_proxy` metrics. The metric differs from the metric
from `database` metric with the same name by taking the backlog
from the max_view_update_backlog which keeps view update backlogs
from all shards which may be a bit outdated, instead of taking
the backlog by checking the view_update_semaphore which the backlog
is based on directly.
In many materialized view tests we need to wait until a view is built before
actually working on it, future tests will also need it. In existing tests
we use the same, duplicated method for achieving that.
In this patch the method is deduplicated and moved to pylib/util.py
and existing tests are modified to use it instead.
When performing a write, we should update the view update backlog
on the shard where the mutation is actually applied. Instead,
currently we only update it on the shard that initially received
the write request (which didn't change at all) and as a result,
the backlog on the correct shard and the aggregated max view update
backlog are not updated at all.
This patch enables updating the backlog on the correct shard. The
update is now performed just after the view generation and propagation
finishes, so that all backlog increases are noted and the backlog is
ready to be used in the write response.
Additionally, after this patch, we no longer (falsely) assume that
the backlog is modified on the same shard as where we later read it
to attach to a response. However, we still compare the aggregated
backlog from all shards and the backlog from the shard retrieving
the max, as with a shard-aware driver, it's likely the exact shard
whose backlog changed.
forward_service is nondescriptive and misnamed, as it does more than
forward requests. It's a classic map/reduce algorithm (and in fact one
of its parameters is "reducer"), so name it accordingly.
The name "forward" leaked into the wire protocol for the messaging
service RPC isolation cookie, so it's kept there. It's also maintained
in the name of the logger (for "nodetool setlogginglevel") for
compatibility with tests.
Closesscylladb/scylladb#19444
before this change, when linking scylla-main, the linker discards
the unreferenced symbols defined by zstd.cc. but we use constructor
of static variable `registerator` to register the zstd compressor,
this variable is not used from the linker's point of view. but we
do rely on the side effect of its constructor.
that's why the rules generated by CMake fails to build tests and
scylla executables with zstd support. that's why we have following
test failure:
```
boost.sstable_3_x_test.test_uncompressed_collections_read
...
[Exception] - no_such_class: unable to find class 'org.apache.cassandra.io.compress.ZstdCompressor'
== [File] - seastar/src/testing/seastar_test.cc
== [Line] - 43
```
in this change, we single out zstd.cc and build it as an archive,
so that scylla-main can include as a whole. an alternative is to
link scylla-main as a whole archive, but that might increase the disk
foot print when building lots of tests -- some of them do not use all
symbols exposed by scylla-main, and can potentially have smaller
size if linker can discard the unused symbols.
Refs https://github.com/scylladb/scylladb/issues/2717
---
cmake related change, hence no need to backport.
Closesscylladb/scylladb#19539
* github.com:scylladb/scylladb:
build: cmake: include the whole archive of zstd.a
build: cmake: find libzstd before using it
If an httpd body writer is called with output_stream<>, it mist close the stream on its own regardless of any exceptions it may generate while working, otherwise stream destructor may step on non-closed assertion. Stepped on with different handler, see #19541
Coroutinize the handler as the first step while at it (though the fix would have been notably shorter if done with .finally() lambda)
Closesscylladb/scylladb#19543
* github.com:scylladb/scylladb:
api: Close response stream of get_compaction_history()
api: Flush output stream in get_compaction_history() call
api: Coroutinize get_compaction_history inner function
when running `make setup`, we could have following failure:
```
Installing the current project: scylla (4.3.0)
The current project could not be installed: No file/folder found for package scylla
If you do not want to install the current project use --no-root
```
because docs is not a proper python project named "scylla",
and do not have a directory structure expected by poetry. what we
expect from poetry, is to manage the dependencies for building
the document.
so, in this change, we install in the `non-package` mode when running
`poetry install`, this skips the root package, which does not exist.
as an alternative, we could put an empty `scylla.py` under `docs`
directory, but that'd be overkill. or we could pass `--no-root`
to `poetry install`, but would be ideal if we can keep the settings
in a single place.
see also https://python-poetry.org/docs/basic-usage/#operating-modes,
and https://python-poetry.org/docs/cli/#options-2, for more
details on the settings and command line options of poetry.
please note this setting was added to poetry 1.8, so the required
poetry version is updated. we might need to upgrade poetry in existing
installation.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19498
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.
This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.
This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.
The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.
The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.
Fixesscylladb/scylladb#19152Closesscylladb/scylladb#19488
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.
This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.
Fixes: https://github.com/scylladb/scylladb/issues/19017
This fixes a regression introduced in 5.0, so we have to backport to all currently supported releases
Closesscylladb/scylladb#19018
* github.com:scylladb/scylladb:
test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrenc Please enter the commit message for your changes. Lines starting
test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
reader_concurrency_semaphore: wire in the configurable cpu concurrency
reader_concurrency_semaphore: add cpu_concurrency constructor parameter
db/config: introduce reader_concurrency_semahore_cpu_concurrency
Co-routinize paxos_state functions to make them more readable.
* 'gleb/coroutineze-paxos-state' of github.com:scylladb/scylla-dev:
paxos: simplify paxos_state::prepare code to not work with raw futures
paxos: co-routinize paxos_state::learn function
paxos: remove no longer used with_locked_key functions
paxos: co-routinize paxos_state::accept function
paxos: co-routinize paxos_state::prepare function
paxos: introduce get_replica_lock() function to take RAII guard for local paxos table access
this changeset adds a filter to customize the rendering of default
values, and enables the `scylladb_cc_properties` extension to display
the logging message related options. it prepares for the further
improvements in
https://opensource.docs.scylladb.com/master/reference/configuration-parameters.html.
this changeset also prepare for the improvements requested by #19463
---
it's an improvement in the document, hence no need to backport.
Closesscylladb/scylladb#19483
* github.com:scylladb/scylladb:
config: add descriptions for default_log_level and friends
config: define log_to_syslog in a different line
docs: parse log_legacy_value as declarations of config option
A node may wait in the topology coordinator queue for awhile before been
joined. Since the local address is added as expiring entry to the raft
address map it may expire meanwhile and the bootstrap will fail. The
series makes the entry non expiring.
Fixes scylladb/scylladb#19523
Needs to be backported to 6.0 since the bug may cause bootstrap to fail.
Closesscylladb/scylladb#19557
* github.com:scylladb/scylladb:
test: add test that checks that local address cannot expire between join request placemen and its processing
storage_service: make node's entry non expiring in raft address map
before this change, when linking scylla-main, the linker discards
the unreferenced symbols defined by zstd.cc. but we use constructor
of static variable `registerator` to register the zstd compressor,
this variable is not used from the linker's point of view. but we
do rely on the side effect of its constructor.
that's why the rules generated by CMake fails to build tests and
scylla executables with zstd support. that's why we have following
test failure:
```
boost.sstable_3_x_test.test_uncompressed_collections_read
...
[Exception] - no_such_class: unable to find class 'org.apache.cassandra.io.compress.ZstdCompressor'
== [File] - seastar/src/testing/seastar_test.cc
== [Line] - 43
```
in this change, we single out zstd.cc and build it as an archive,
so that scylla-main can include as a whole. an alternative is to
link scylla-main as a whole archive, but that might increase the disk
foot print when building lots of tests -- some of them do not use all
symbols exposed by scylla-main, and can potentially have smaller
size if linker can discard the unused symbols.
Refs #2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we use libzstd in zstd.cc. so let's find this library before using
it. this helps user to identify problem when preparing the building
environment, instead of being greeted by a compile-time failure.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, docs/_ext/scylladb_cc_properties.py parses
the options line by line, because `log_to_stdout` and `log_to_syslog`
are defined in a single line, this script is not able to parse them,
hence fails to display them on the `reference/configuration-parameters/`
web page.
after this change, these two member variables are defined on different
lines. both of them can be displayed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we only consider "named_value<type>" as the
declaration of option, and the "Type" field of the corresponding
option is displayed if its declaration is found. otherwise, "Type"
field is not rendered. but some logging related options are declared
using `log_legacy_value`, so they are missing.
after this change, they are displayed as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the compiler to use the
definition of `cql3::attributes` to generate the defaulted
destructor in .cc file. but with clang-19, it insists that
we should have a complete definition available for defining
the defaulted destructor, otherwise it fails the build:
```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT CMakeFiles/scylla-main.dir/RelWithDebInfo/table_helper.cc.o -MF CMakeFiles/scylla-main.dir/RelWithDebInfo/table_helper.cc.o.d -o CMakeFiles/scylla-main.dir/RelWithDebInfo/table_helper.cc.o -c /home/kefu/dev/scylladb/table_helper.cc
In file included from /home/kefu/dev/scylladb/table_helper.cc:10:
In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/coroutine.hh:25:
In file included from /home/kefu/dev/scylladb/seastar/include/seastar/core/future.hh:30:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/memory:78:
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'cql3::attributes'
91 | static_assert(sizeof(_Tp)>0,
| ^~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:398:4: note: in instantiation of member function 'std::default_delete<cql3::attributes>::operator()' requested here
398 | get_deleter()(std::move(__ptr));
| ^
/home/kefu/dev/scylladb/cql3/statements/modification_statement.hh:40:7: note: in instantiation of member function 'std::unique_ptr<cql3::attributes>::~unique_ptr' requested here
40 | class modification_statement : public cql_statement_opt_metadata {
| ^
/home/kefu/dev/scylladb/cql3/statements/modification_statement.hh:40:7: note: in implicit destructor for 'cql3::statements::modification_statement' first required here
/home/kefu/dev/scylladb/cql3/statements/modification_statement.hh:28:7: note: forward declaration of 'cql3::attributes'
28 | class attributes;
| ^
```
so, in this change, we define the destructor in .cc file, where
the complete definition of `cql3::attributes` is available.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19545
If client stops reading response early, the server-side stream throws but must be closed anyway. Seen in another endpoint and fixed by #19541Closesscylladb/scylladb#19542
* github.com:scylladb/scylladb:
api: Fix indentation after previous patch
api: Close response stream on error
api: Flush response output stream before closing
All streams used by httpd handlers are to be closed by the handler itself, caller doesn't take care of that.
fixes: #19494Closesscylladb/scylladb#19541
* github.com:scylladb/scylladb:
api: Fix indentation after previous patch
api: Close output_stream on error
api: Flush response output stream before closing
in 3c7af287, cqlsh's reloc package was marked as "noarch", and its
filename was updated accordingly in `configure.py`, so let's update
the CMake building system accordingly.
this change should address the build failure of
```
08:48:14 [3325/4124] Generating ../Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 FAILED: Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 cd /jenkins/workspace/scylla-master/scylla-ci/scylla/build/dist && /usr/bin/cmake -E copy /jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14 Error copying file "/jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz" to "/jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz".
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19544
Alternator's non-standard "/localnodes" HTTP request returns a list of
live nodes on this DC, to consider for load balancing. The returned
node addresses should be external IP addresses usable by the clients.
Scylla has a configuration parameter - broadcast_rpc_address - which
defines for a node an external IP address. If such a configuration
exists, we need to use those external IP addresses, not the internal
ones.
Finding these broadcast_rpc_address of all nodes is easy, because the
gossiper already gossips them.
This patch also tests the new feature:
1. The existing single-node test is extended to verify that without
broadcast_rpc_address we get the usual IP address.
2. A new two-node test is added to check that when broadcast_rpc_address
is configured, we get that address and not the usual internal IP
addresses.
Fixes#18711.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Commit dd46a92e23 introduced a function gossiper::get_rpc_address()
as a shortcut for get_application_state_ptr(endpoint, RPC_ADDRESS) -
i.e., it fetches the endpoint's configured broadcast_rpc_address
(despite its confusing name, this is the endpoint's external IP address
that clients can use to make CQL connections).
But strangely, the implementation get_rpc_address() made an exception
for asking about the *current* host - where instead of getting this
node's broadcast_rpc_address, it returns its internal address, which
is not what this function was supposed to do - it's not useful for
it to do one thing for this node, and a different thing for other
nodes, and when I wrote code that uses this function (see the next
patch), this resulted in wrong results for the current node.
The fix is simple - drop the wrong if(), and get the
broadcast_rpc_address stored by the gossiper unconditionally - the
gossiper knows it for this node just like for other nodes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
podman does not allocate a tty by default, so without `-t` or `--tty`,
one cannot use a functional terminal when interacting with the
container. that what one can expect when running `dbuild -i --`, and
we are greeted with :
```
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
```
after this change, one can enjoy the good-old terminal as usual
after being dropped to the container provided by `dbuild -i --`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19550
Switch the C++ standard from C++20 to C++23. This is straightforward, but there are a few
fallouts (mostly due to std::unique_ptr that became constexpr) that need to be fixed first.
Internal enhancement - no backport required
Closesscylladb/scylladb#19528
* github.com:scylladb/scylladb:
build: switch to C++23
config: avoid binding an lvalue reference to an rvalue reference
readers: define query::partition_slice before using it in default argument
test: define table_for_tests earlier
compaction: define compaction_group::table_state earlier
compaction: compaction_group: define destructor out-of-line
compaction_manager: define compaction_manager::strategy_control earlier
In 90a6c3bd7a ("build: reduce release mode inline tuning on aarch64") we
reduced inlining on aarch64, due to miscompiles.
In 224a2877b9 ("build: disable -Og in debug mode to avoid coroutine
asan breakage") we disabled optimization in debug mode, due to miscompiles.
With clang 18.1, it appears the miscompiles are gone, and we can remove
the two workarounds.
Closesscylladb/scylladb#19531
It's currently implicitly flushed on its close, but in that case close
can throw while flusing. Next patch wants close not to throw and that's
possible if flushing the stream in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The handler returns a function which is then invoked with output_stream
argument to render the json into. This function is converted into
coroutine. It has yet another inner lambda that's passed into
compaction_manager::get_compaction_history() as consumer lambda. It's
coroutinized too.
The indentation looks weird as preparation for future patching.
Hopefullly it's still possible to understand what's going on.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The handler's lambda is called with && stream object and must close the
stream on its own regardless of what.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The .close() method flushes the stream, but it may throw doing it. Next
patch will want .close() not to throw, for that stream must be flushed
explicitly before closing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the get_snapshot_details() lambda throws, the output stream remains
non-closed which is bad. Close it regardless of what.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this PR has 2 commits
- [test: pass Scylla extra CMD args from test.py args](6b367a04b5)
- [test: adjust scylla_cluster.merge_cmdline_options behavior](c60b36090a)
the main goal is to solve [test.py: provide an easy-to-remember, univeral way to run scylla with trace level logging](https://github.com/scylladb/scylladb/issues/14960) issue
but also can be used to easily apply additional arguments for all UnitTests and PythonTests on the fly from the test.py CMD
Closesscylladb/scylladb#19509
* github.com:scylladb/scylladb:
test: adjust scylla_cluster.merge_cmdline_options behavior
test: pass scylla extra CMD args from test.py args
This short series removed some ancient legacy code from
migration_manager and schema_tables, before I make further changes in this area.
We have more such code under the cql3 hierarchy but it can be dealt with as a follow up.
No backport required
Closesscylladb/scylladb#19530
* github.com:scylladb/scylladb:
schema_tables: remove dead code
migration_manager: remove dead code
This check is already in place, but isn't fully working, i.e.
switching from a vnode KS to a tablets KS is not allowed, but
this check doesn't work in the other direction. To fix the
latter, `ks_prop_defs::get_initial_tablets()` has been changed
to handle 3 states: (1) init_tablets is set, (2) it was skipped,
(3) tablets are disabled. These couldn't fit into std::optional,
so a new local struct to hold these states has been introduced.
Callers of this function have been adjusted to set init_tablets
to an appropriate value according to the circumstances, i.e. if
tablets are globally enabled, but have been skipped in the CQL,
init_tablets is automatically set to 0, but if someone executes
ALTER KS and doesn't provide tablets options, they're inherited
from the old KS.
I tried various approaches and this one resulted in the least
lines of code changed. I also provided testcases to explain how
the code behaves.
Fixes: #18795Closesscylladb/scylladb#19368
Well, even after 10 years, the c++ compilers still
do not compile Java...
And having that legacy code laying around
not only it doesn't help anyone understand what's
going on, but on the contrary, it's confusing and distracting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Well, even after 10 years, the c++ compilers still
do not compile Java...
And having that legacy code laying around
not only it doesn't help anyone understand what's
going on, but on the contrary, it's confusing and distracting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
config_file::add_deprecated_options() returns an lvalue reference
to a parameter which itself is an rvalue reference. In C++20 this
is bad practice (but not a bug in this case) as rvalue references
are not expected to live past the call. In C++23, it fails to compile.
Fix by accepting an lvalue reference for the parameter, and adjust the
caller.
C++23 made std::unique_ptr constexpr. A side effect of this (presumably)
is that the compiler compiles it more eagerly, requiring the full definition
of the class in std::make_unique, while it previously was content with
finding the definition later.
One victim of this change is the default argument of make_reversing_reader;
define it earlier (by including its header) to build with C++23.
Those tests are sometimes failing on CI and we have two hypothesis:
1. Something wrong with consistency of statements
2. Interruption from another test run (e.g. same queries performed concurrently or data remained after previous run)
To exclude or confirm 2. we add random marker to avoid potential collision, in such case it will be clearly visible that wrong data comes from a different run.
Related scylladb/scylladb#18931
Related scylladb/scylladb#18319
backport: no, just a test fix
Closesscylladb/scylladb#19484
* github.com:scylladb/scylladb:
test: auth: add random tag to resources in test_auth_v2_migration
test: extend unique_name with random sufix
C++23 made std::unique_ptr constexpr. A side effect of this (presumably)
is that the compiler compiles it more eagerly, requiring the full definition
of the class in std::make_unique, while it previously was content with
finding the definition later.
One victim of this change is table_for_tests; define it earlier to
build with C++23.
C++23 made std::unique_ptr constexpr. A side effect of this (presumably)
is that the compiler compiles it more eagerly, requiring the full definition
of the class in std::make_unique, while it previously was content with
finding the definition later.
One victim of this change is compaction_group::table_state; define
it earlier to build with C++23.
Define compaction_group::~compaction_group() out-of-line to prevent
problems instantiating compaction_group::_table_state, which is an
std::unique_ptr. In C++23, std::unique_ptr is constexpr, which means
its methods (in this case the destructor) require seeing the definition
of the class at the point of instantiation.
C++23 made std::unique_ptr constexpr. A side effect of this (presumably)
is that the compiler compiles it more eagerly, requiring the full definition
of the class in std::make_unique, while it previously was content with
finding the definition later.
One victim of this change is compaction_manager::strategy_control; define
it earlier to build with C++23.
Fixes: https://github.com/scylladb/scylladb/issues/19489
There is already a check that Scylla binary is executable, but it's done on later stage. So in logs for specific test file there will be a message about something wrong with binary, but in console there will be now signs of that. Moreover, there will be an error that completely misleads what actually happened and why test run failed. With this check test will fail earlier providing the correct reason why it's failed
Closesscylladb/scylladb#19491
Before this patch, the semaphore was hard-wired to stop admission, if
there is even a single permit, which is in the need_cpu state.
Therefore, keeping the CPU concurrency at 1.
This patch makes use of the new cpu_concurrency parameter, which was
wired in in the last patches, allowing for a configurable amount of
concurrent need_cpu permits. This is to address workloads where some
small subset of reads are expected to be slow, and can hold up faster
reads behind them in the semaphore queue.
The exiting documentation comment for `enable_tablets`
is very terse and lacks details about the effect of enabling
or disabling tablets.
This change adds more details about the impact of `enable_tablets`
on newly created keyspaces, and hot to disable tablets when
keyspaces are created.
Also, a note was added to warn about the irreversibility
of the tablets enablement per keyspace.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`prs = response.json().get("items", [])` will return empty when there are no merged PRs, and this will just skip the all-label replacement process.
This is a regression following the work done in #19442
Adding another part to handle closed PRs (which is the majority of the cases we have in Scylla core)
Fixes: https://github.com/scylladb/scylladb/issues/19441Closesscylladb/scylladb#19497
Currently proxy initialization is pretty disperse, in particular it's
stopped in several steps -- first drain_on_shutdown() then
stop_remote(). In between there's nothing that needs proxy in any
particular sate, so those two steps can be merged into one.
refs: scylladb/scylladb#2737
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19344
Real S3 server is known to actively close connections, thus breaking S3
storage backend at random places. The recent http client update is more
robust against that, but the needed feature is OFF by default.
refs: scylladb/seastar#1883
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19461
adjust merge_cmdline_options behaviour to
append --logger-log-level option instead of merge
this behaviour can be changed(if needed)
to previour version(all merge):
merge_cmdline_options(list1, list2, appending_options=[])
or, to append different cmd options:
merge_cmdline_options(list1, list2, appending_options=[option1,option2])
this commit introduces a test.py option --extra-scylla-cmdline-options
to pass extra scylla cmdline options for all tests.
Options should be space-separated:
'--logger-log-level raft=trace --default-log-level error'
The nodetool tests does not set the asan/ubsan options
to abort on error and create core dumps
Fix by setting the environment variables in nodetool tests.
Closesscylladb/scylladb#19503
Those tests are sometimes failing on CI and we have two hypothesis:
1. Something wrong with consistency of statements
2. Interruption from another test run (e.g. same queries performed
concurrently or data remained after previous run)
To exclude or confirm 2. we add random marker to avoid potential collision,
in such case it will be clearly visible that wrong data comes from
a different run.
Related scylladb/scylladb#18931
Related scylladb/scylladb#18319
This commit updates the instuctions on how to download and run Scylla Doctor,
following the changes in how Scylla Doctor is released.
Closesscylladb/scylladb#19510
This short series enhances utils::chunked_vector so it could be used more easily to convert dht::partition_range_vector to chunked_vector, for example.
- utils: chunked_vector: document invalidation of iterators on move
- utils: chunked_vector: add ctor from std::initializer_list
- utils: chunked_vector: add ctor from a single value
No backport required
Closesscylladb/scylladb#19462
* github.com:scylladb/scylladb:
chunked_vector_test: add tests for value-initialization constructor
utils: chunked_vector: add ctor from std::initializer_list
utils: chunked_vector: document invalidation of iterators on move
This commit adds a page listing the ScyllDB limits
we know today.
The page can and should be extended when other limits
are confirmed.
Closesscylladb/scylladb#19399
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19508
this change updates the cqlsh submodule:
* tools/cqlsh/ ba83aea3...73bdbeb0 (4):
> install.sh: replace tab with spaces
> define the the debug packge is empty
> tests: switch from using cqlsh bash to the test the python file
> package python driver as wheels
it also includes follow change to package cqlsh as a regular
rpm instead of as a "noarch" rpm:
so far cqlsh bundles the python-driver in, but only as source.
meaning the package wasn't architecture, and also didn't
have the libev eventloop compiled in.
Since from python 3.12 and up, that would mean we would
fallback into asyncio eventloop (which still exprimental)
or into error (once we'll sync with the driver upstream)
so to avoid those, we are change the packaging of cqlsh
to be architecture specific, and get cqlsh compiled, and bundle
all of it's requirements as per architecture installed bundle of wheels.
using `shiv`, i.e. one file virtualenv that we'll be packing
into our artifacts
Ref: https://github.com/scylladb/scylla-cqlsh/issues/90
Ref: https://github.com/scylladb/scylla-cqlsh/pull/91
Ref: https://github.com/linkedin/shivClosesscylladb/scylladb#19385
* tools/cqlsh ba83aea...242876c (1):
> Merge 'package python driver as wheels' from Israel Fruchter
Update tools/cqlsh/ submodule
in which, the change of `define the the debug packge is empty`
should address the build failure like
```
Processing files: scylla-cqlsh-debugsource-6.1.0~dev-0.20240624.c7748f60c0bc.aarch64
error: Empty %files file /jenkins/workspace/scylla-master/next/scylla/tools/cqlsh/build/redhat/BUILD/scylla-cqlsh/debugsourcefiles.list
RPM build errors:
Empty %files file /jenkins/workspace/scylla-master/next/scylla/tools/cqlsh/build/redhat/BUILD/scylla-cqlsh/debugsourcefiles.list
```
Closesscylladb/scylladb#19473
Now enable_tombstone_gc_for_streaming_and_repair is wired in all the way
to maybe_compact_for_streaming(), so we can implement the toggling of
tombstone GC based on it.
To control whether the compacting reader (if enabled) for streaming and
repair can garbage-collect tombstones.
Default is false (previous behaviour).
Not wired yet.
When preparing statement, the server code first does it on non-local
shards, then on local one. The former call is done the hard way, while
there's a short sugar sharded<> class method doing it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19485
The series adds a step during node's boot process, just before completing
the initialization, in which the node sends a notification to all other
normal nodes in the cluster that it is UP now. Other nodes wait for this
node to be UP and in normal state before replying. This ensures that,
in a healthy cluster, when a node start serving queries the entire
cluster knows its up-to-date state. The notification is a best effort
though. If some nodes are down or do not reply in time the boot process
continues. It is somewhat similar to shutdown notification in this regard.
* 'gleb/notify-up-v2' of github.com:scylladb/scylla-dev:
gossiper: wait for a bootstrapping node to be seen as normal on all nodes before completing initialization
Wait for booting node to be marked UP before complete booting.
gossiper: move gossip verbs to the idl
Delete 10s timeout from read barrier in table_sync_and_check,
so that the function always considers all previous group0 changes.
Fixes: #18490.
Closesscylladb/scylladb#18752
The `system.batchlog` table has a partition for each batch that failed to complete. After finally applying the batch, the partition is deleted. Although the table has gc_grace_second = 0, tombstones can still accumulate in memory, because we don't purge partition tombstones from either the memtable or the cache. This can lead to the cache and memtable of this table to accumulate many thousands of even millions of tombstones, making batchlog replay very slow. We didn't notice this before, because we would only replay all failed batches on unbootstrap, which is rare and a heavy and slow operation on its own right already.
With repair-based tombstone-gc however, we do a full batchlog replay at the beginning of each repair, and now this extra delay is noticeable.
Fix this by making sure batchlog replays don't have to scan through all the tombstones generated by previous replays:
* flush the `system.batchlog` memtable at the end of each batchlog replay, so it is cleared of tombstones
* bypass the cache
Fixes: https://github.com/scylladb/scylladb/issues/19376
Although this is not a regression -- replay was like this since forever -- now that repair calls into batchlog replay, every release which uses repair-based tombstone-gc should get this fix
Closesscylladb/scylladb#19377
* github.com:scylladb/scylladb:
db/batchlog_manager: bypass cache when scanning batchlog table
db/batchlog_manager: replace open-coded paging with internal one
db/batchlog_manager: implement cleanup after all batchlog replay
cql3/query_processor: for_each_cql_result(): move func to the coro frame
The bloom filters are built with partition estimates because the actual
partition count might not be available in all cases. If the estimate is
inaccurate, the bloom filters might end up being too large or too small
compared to their optimal sizes. This PR rebuilds bloom filters with
inaccurate partition estimates using the actual partition count before
the filter is written to disk. A bloom filter is considered to have an
inaccurate estimate if its false positive rate based on the current
bitmap size is either less than 75% or more than 125% of the configured
false positive rate.
Fixes#19049
A manual test was run to check the impact of rebuild on compaction.
Table definition used : CREATE TABLE scylla_bench.simple_table (id int PRIMARY KEY);
Setup : 3 billion random rows with id in the range [0, 1e8) were inserted as batches of 5 rows into scylla_bench.simple_table via 80 threads.
Compaction statistics :
scylla_bench.simple_table :
(a) Total number of compactions : `1501`
(b) Total time spent in compaction : `9h58m47.269s`
(c) Number of compactions which rebuilt bloom filters : `16`
(d) Total time taken by these 16 compactions which rebuilt bloom filters : `2h55m11.89s`
(e) Total time spent by these 16 compactions to rebuild bloom filters : `8m6.221s` which is
- `4.63%` of the total time taken by the compactions which rebuilt filters (d)
- `1.35%` of the total compaction time (b).
(f) Total bytes saved by rebuilding filters : `388 MB`
system.compaction_history :
(a) Total number of compactions : `77`
(b) Total time spent in compaction : `21.24s`
(c) Number of compactions which rebuilt bloom filters : `74`
(d) Time taken by these 74 compactions which rebuilt bloom filters : `20.48s`
(e) Time spent by these 74 compactions to rebuild bloom filters : `377ms` which is
- `1.84%` of the total time taken by the compactions which rebuilt filters (d)
- `1.77%` of the total compaction time (b).
(f) Total bytes saved by rebuilding filters : `20 kB`
The following tables also had compactions and the bloom filter was rebuilt in all those compactions.
However, the time taken for every rebuild was observed as 0ms from the logs as it completed within a microsecond :
system.raft :
(a) Total number of compactions : `2`
(b) Total time spent in compaction : `106ms`
(c) Total bytes saved by rebuilding filters : `960 B`
system_schema.tables :
(a) Total number of compactions : `1`
(b) Total time spent in compaction : `25ms`
(c) Total bytes saved by rebuilding filter : `312 B`
system.topology :
(a) Total number of compactions : `1`
(b) Total time spent in compaction : `25ms`
(c) Total bytes saved by rebuilding filter : `320 B`
Closesscylladb/scylladb#19190
* github.com:scylladb/scylladb:
bloom_filter_test: add testcase to verify filter rebuilds
test/boost: move bloom filter tests from sstable_datafile_test into a new file
sstables/mx/writer: rebuild bloom filters with bad partition estimates
sstables/mx/writer: add variable to track number of partitions consumed
sstable: introduce sstable::maybe_rebuild_filter_from_index()
sstable: add method to return filter format for the given sstable version
utils/i_filter: introduce get_filter_size()
API endpoints that need a particular service to get data from are registered next to this service (#2737). In /storage_proxy function there live some endpoints that work with config, so this PR moves them to the existing config.cc with config-related endpoints. The path these endpoints are registered with remains intact, so some tweak in proxy API registration is also here.
Closesscylladb/scylladb#19417
* github.com:scylladb/scylladb:
api: Use provided db::config, not the one from ctx
api: Move some config endpoints from proxy to config
api: Split storage_proxy api registration
api: Unset config endpoints
Scans should not pollute the cache with cold data, in general. In the
case of the batchlog table, there is another reason to bypass the cache:
this table can have a lot of partition tombstones, which currently are
not purged from the cache. So in certain cases, using the cache can make
batch replay very slow, because it has to scan past tombstones of
already replayed batches.
We have a commented code snippet from Origin with cleanup and a FIXME to
implement it. Origin flushes the memtables and kicks a compaction. We
only implement the flush here -- the flush will trigger a compaction
check and we leave it up to the compaction manager to decide when a
compaction is worthwhile.
This method used to be called only from unbootstrap, so a cleanup was
not really needed. Now it is also called at the end of repair, if the
table is using repair-based tombstone-gc. If the memtable is filled with
tombstones, this can add a lot of time to the runtime of each repair. So
flush the memtable at the end, so the tombstones can be purged (they
aren't purged from memtables yet).
Said method has a func parameter (called just f), which it receives as
rvalue ref and just uses as a reference. This means that if caller
doesn't keep the func alive, for_each_cql_result() will run into
use-after-free after the first suspention point. This is unexpected for
callers, who don't expect to have to keep something alive, which they
passed in with std::move().
Adjust the signature to take a value instead, value parameters are moved
to the coro frame and survive suspention points.
Adjust internal callers (query_internal()) the same way.
There are no known vulnerable external callers.
Today after Mergify opened a Backport PR, it will stay open until someone manually close the backport PR , also we can't track using labels which backport was done or not since there is no indication for that except digging into the PR and looking for a comment or a commit ref
The following changes were made in this PR:
* trigger add-label-when-promoted.yaml also when the push was made to `branch-x.y`
* Replace label `backport/x.y` with `backport/x.y-done` in the original PR (this will automatically update the original Issue as well)
* Add a comment on the backport PR and close it
Fixes: https://github.com/scylladb/scylladb/issues/19441Closesscylladb/scylladb#19442
Currently, the returned `ranges` vector is first initialized
to `product_size` and then the returned partition ranges are
copied into it.
Instead, we can simply reserve the vector capacity,
without initializing it, and then emplace all partition ranges
onto it using std::back_inserter.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19457
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
the callers in alternator/streams.cc is updated to use `fmt::print()`
to format the `bytes` instances.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19448
chunked_vector differs from std::vector where
the latter's move constructor is required to preserve
and iterators to the moved-from vector.
In contrast, chunked_vector::iterator keeps a pointer
to the chunked_vector::_chunks data, which is
a utils::small_vector, and when moved, it might
invalidate the iterator since the moved-to _chunks
might copy the contents of the internal capacity
rather than moving the allocated capacity.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The Ninja makefile (build.ninja) generated by the ./configure.py script
is smart enough to notice when the configure.py script is modified and
re-runs the script in order to regenerate itself. However, this
operation is currently not idempotent and quickly breaks because
information about the Ninja makefile's name is not passed properly.
This is the rule used for makefile's regeneration:
```
rule configure
command = {python} configure.py --out={buildfile}.new $configure_args && mv {buildfile}.new {buildfile}
generator = 1
description = CONFIGURE $configure_args
```
The `buildfile` variable holds the value of the `--out` option which is
set to `build.ninja` if not provided explicitly.
Note that regenerating the makefile passes a name with the `.new` suffix
added to the end; we want to first write the file in full and then
overwrite the old file via a rename. However, notice that the script was
called with `--out=build.ninja.new`; the `configure` rule in the
regenerated file will have `configure.py --out=build.ninja.new.new` and
then `mv build.ninja.new.new build.ninja.new`. So, second regeneration
will just leave a build.ninja.new file which is not useful.
Fix this by introducing an additional parameter `--out-final-name`.
This parameter is only supposed to be used in the regeneration rule and
its purpose is to preserve information about the original file name.
After this change I no longer see `build.ninja.new` being created after
a sequence of `touch configure.py && ninja` calls.
Closesscylladb/scylladb#19428
The docs [1] clearly say "install-dependencies.sh" should be run as
"root"; however, the script silently assumes that the umask inherited from
the calling environment is 0022. That's not necessarily the case, and
there's an argument to be made for "root" setting umask 0077 by default.
The script behaves unexpectedly under such circumstances; files and
directories it creates under /opt and /usr/local are then not accessible
to unprivileged users, leading to compilation failures later on.
Set the creation mask explicitly to 0022.
[1] https://github.com/scylladb/scylladb/blob/master/HACKING.md#dependencies
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Closesscylladb/scylladb#19464
Since some time executing our ninja build
targets generates also build.ninja.new file.
Adding it to .gitignore for convenience as we
won't commit this file.
Closesscylladb/scylladb#19367
Since 3afbd21f, we are able to selectively choose a single test
in a boost test executable which represents a test suite, and to
choose a single test in a pytest script with the syntax of
"test_suite::test_case". it's very handy for manual testing.
so let's document in the command line help message as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19454
before this change, when linking an executable referencing `marker`,
we could have following error:
```
13:58:02 ld.lld: error: undefined symbol: alternator::event_id::marker
13:58:02 >>> referenced by streams.cc
13:58:02 >>> build/dev/alternator/streams.o:(from_string_helper<rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>, alternator::event_id>::Set(rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>&, alternator::event_id, rjson::internal::throwing_allocator&))
13:58:02 clang-16: error: linker command failed with exit code 1 (use -v to see invocation)
```
it turns out `event_id::marker` is only declared, but never defined.
please note, the non-inline static member variable in its class
definition is not considered as a definition, see
[class.static.data](https://eel.is/c++draft/class.static.data#3)
> The declaration of a non-inline static data member in its class
> definition is not a definition and may be of an incomplete type
> other than cv void.
so, let's declare it as a `constexpr` instead. it implies `inline`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19452
before this change, when running test like:
```console
./test.py --mode release topology_experimental_raft/test_tablets
/home/kefu/dev/scylladb/test/pylib/scylla_cluster.py:333: SyntaxWarning: invalid escape sequence '\('
deleted_sstable_re = f"^.*/{keyspace}/{table}-[0-9a-f]{{32}}/.* \(deleted\)$"
```
we could have the warning above. because `\(` is not a valid escape
sequence, but the Python interpreter accepts it as two separated
characters of `\(` after complaining. but it's still annoying.
so, let's use a raw string here, as we want to match "(deleted)".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19451
The bloom filters are built with partition estimates, as the actual
partition count might not be available in all the cases. If the estimate
was bad, the bloom filters might end up too large or too small than
their optimal sizes. Rebuild such bloom filters with actual partition
count before the filter is written to disk and the sstable is sealed.
Fixes#19049
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add method sstable::maybe_rebuild_filter_from_index() that rebuilds
bloom filters which had bad partition estimates when they were built.
The method checks the false positive rate based on the current bitset
size against the configured false positive rate to decide whether a
filter needs to be rebuilt. If the current false positive rate is within
75% to 125% of the configured false positive rate, the bloom filter will
not be rebuilt. Otherwise, the filter will be rebuilt from the index
entries. This method should only be called before an SSTable is sealed
as the bloom filter is updated in-place.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Extract out the filter format computing logic from sstable::read_filter
into a separate function. This is done so that the subsequent patches
can make use of this function.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Currently, the only way to get the size of a filter, for certain
parameters is to actually create one. This requires a seastar thread
context and potentially also allocates huge amount of memory.
Provdide a method which just calculates the size, without any of the
above mentioned baggage.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
since we are now able to use C++20, there is no need to use the
homebrew rotl64(). so in this change, we replace rotl64() with
std::rotl(), and remove the former from the source tree.
the underlying implementations of these two solutions are equivalent,
so no performance changes are expected. all caller sites have been
audited: all of them pass `uint64` as the first parameter.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19447
Default role creation in auth-v1 is asynchronous and all nodes race to
create it so we'd need to delay the test and wait. Checking this particular
role doesn't bring much value to the test as we check other roles
to demonstrate correctness.
Fixesscylladb/scylladb#19039Closesscylladb/scylladb#19424
Gossiper has two blocs of endpoints, both are registered in legacy/random place in main. This PR moves them next to gossiper start and adds unregistration for both.
refs: #2737Closesscylladb/scylladb#19425
* github.com:scylladb/scylladb:
api: Remove dedicated failure_detector registration method
api: Move failure_detector endpoints set/unset to gossiper
api: Unset failure detector endpoints method
api: (Un)Register gossiper API in correct place
api: Unset gossiper endpoints on stop
asi: Coroutinize set_server_gossip()
these jobs are scheduled to verify the builds of scylla, like
if it builds with the latest Seastar, if scylla can generated
reproducible builds, and if it builds with the nightly build of
clang. the failure of these workflow are not very visible without
clicking into the corresponding workflow in
https://github.com/scylladb/scylladb/actions.
in this change, we add their badges in the testing section of README.md,
so one can identify the test failures of them if any,
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19430
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19432
The sharded<database> is used as a map_reduce0() method provider,
there's no real need in database itself. Simple smp::map_reduce()
would work just as good.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19364
SELECT's "LIMIT" feature is tested in combination with other features
in different test/cql-pytest/*.py source files - for examples the
combination of LIMIT and GROUP BY is tested in test_group_by.py.
This patch adds a new test file, test_limit.py, for testing aspects
basic usage of LIMIT that weren't already tested in other files.
The new file also has a comment saying where we have other tests
for LIMIT combined with other features.
All the new tests pass (on both Scylla and Cassandra). But they can
be useful as regression tests to test patches which modify the
behavior of LIMIT - e.g., pull reques #18842.
This patch also adds another test in test_group_by.py. This adds to
one of the tests for the combination of LIMIT and GROUP BY (in this
case, GROUP BY of clustering prefix, no aggregation) also a check
for paging, that was previously missing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19392
These two api functions both need gossiper service and only it, and thus
should have set/unset calls next to each other. It's worth putting them
into a single place
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's one more set of endpoints that need gossiper -- the
failure_detector ones. They are registered, but not unregistered, so
here's the method to do it. It's not called by any code yet, because
next patch would need to rework the caller anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One of the next patches will add more async calls here, so not to
create then-chains, convert it into a coroutine
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
`absl::headers` is a library, not the path to its headers.
before this change, the command lines of genereated build rule look
like:
```
-I/home/kefu/dev/scylladb/repair/absl::headers
```
this does not hurt, as other libraries might add the intended include
dir to the compiler command line, but this is just wrong.
so let's remove it. please note, `repair` target already links against
`absl::headers`. so we don't need to add `absl::headers` to its linkage
again.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19384
this change was created in the same spirit of ebff5f5d.
despite that we include Seastar as a submodule, Seastar is not a
part of scylla project. so we'd better include its headers using
brackets.
ebff5f5d addressed this cosmetic issue a while back. but probably
clangd's header-insertion helped some of contributor to insert
the missing headers with `"`. so this style of `include` returned
to the tree with these new changes.
unfortunately, clangd does not allow us to configure the style
of `include` at the time of writing.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19406
Before these changes, it could happen that Scylla initialized
endpoint managers for hint directories representing
* host IDs before migrating hinted handoff to using host IDs,
* IP addresses after the migration.
One scenario looked like this:
1. Start Scylla and upgrade the cluster to using host IDs.
2. Create, by hand, a hint directory representing an IP address.
3. Trigger changing the host filter in hinted handoff; it could
be achieved by, for example, restricting the set of data
centers Scylla is allowed to save hints for.
When changing the host filter, we browse the hint directories
and create endpoint managers if we can send hints towards
the node corresponding to a given hint directory. We only
accepted hint directories representing IP addresses
and host IDs. However, we didn't check whether the local node
has already been upgraded to host-ID-based hinted handoff
or not. As a result, endpoint managers were created for
both IP addresses and host IDs, no matter whether we were
before or after the migration.
These changes make sure that any time we browse the hint
directories, we take that into account.
Fixesscylladb/scylladb#19172Closesscylladb/scylladb#19173
also add `auth` and `cdc` to iwyu's `CLEANER_DIR` setting.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#19410
* github.com:scylladb/scylladb:
.github: add auth and cdc to iwyu's CLEANER_DIR
cdc: do not include unused headers
The set_server_config() already has the db::config reference for
endpoints to work with, there's no need to obtain one via ctx and
database.
This change kills two birds with one stone -- less users of database as
config provider, less places that need http context -> database
dependency.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Those getting (and setting, but these are not implemented) various
timeouts work on config, whilst register themselves in storage_proxy
function. Since the "service" they need to work with is config, move the
endpoints to config endpoints code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_server_storage_proxy() does two things -- registers
storage_proxy "function" and sets proxy routes, that depend on it. Next
patches will move some /storage_proxy/... endpoints registration to
earlier stage, so the function should be ready in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in 7952200c, we changed the `selected_format` from `mc` to `me`,
but to be backward compatible the cluster starts with "md", so
when the nodes in cluster agree on the "ME_SSTABLE_FORMAT" feature,
the format selector believes that the node is already using "ME",
which is specified by `_selected_format`. even it is actually still
using "md", which is specified by `sstable_manager::_format`, as
changed by 54d49c04. as explained above, it was specified to "md"
in hope to be backward compatible when upgrading from an existign
installation which might be still using "md". but after a second
thought, since we are able to read sstables persisted with older
formats, this concern is not valid.
in other words, 7952200c introduced a regression which changed the
"default" sstable format from `me` to `md`.
to address this, we just change `sstable_manager::_format` to "me",
so that all sstables are created using "me" format.
a test is added accordingly.
Fixes#18995
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19293
in
b8c705bc54
i modified the even name to `pull_request_target`,
This caused skipping sync process when PR label was added/removed
Fixing it
Closesscylladb/scylladb#19408
The node booting in gossip topology waits until all NORMAL
nodes are UP. If we removed a different node just before,
the booting node could still see it as NORMAL and wait for
it to be UP, which would time out and fail the bootstrap.
This issue caused scylladb/scylladb#17526.
Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.
Although the issue fixed by this PR caused only test flakiness,
it could also manifest in real clusters. It's best to backport this
PR to 5.4 and 6.0.
Fixesscylladb/scylladb#17526Closesscylladb/scylladb#19387
* github.com:scylladb/scylladb:
join_token_ring, gossip topology: update obsolete comment
join_token_ring, gossip topology: fix indendation after previous patch
join_token_ring, gossip topology: recalculate sync nodes in wait_alive
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.
Fixesscylladb/scylladb#19349Closesscylladb/scylladb#19350
* github.com:scylladb/scylladb:
test/boost/cql_query_test: add test for single-partition aggregation
cql3/select_statement: do not parallelize single-partition aggregations
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.
Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.
That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.
Fixes#16514.
Closesscylladb/scylladb#18137
* github.com:scylladb/scylladb:
compaction: Reduce twcs off-strategy space overhead to 10% of free space
compaction: wire storage free space into reshape procedure
sstables: Allow to get free space from underlying storage
replica: don't expose compaction_group to reshape task
so that "wasm" target is built. "wasm" generates the text format
of wasm code. and these wasm applications are used by the test_wasm
tests.
the rules generated by `configure.py` adds these .wat files as a
dependency of `{mode}-build`, which is in turn a dependency of `{mode}`.
in this change, let's mirror this behavior by making `wasm` ALL,
so it is built by the default target.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19391
The service in question is pretty small one, but it has its API endpoint that lives in /storage_service group. Currently when a service starts and has any endpoints that depend on it, the endpoint registration should follow it (#2737). Here's the PR that does it for load meter. Another goal of this change is that http context now has one less dependency onboard.
Closesscylladb/scylladb#19390
* github.com:scylladb/scylladb:
api: Remove ctx->load_meter dependency
api: Use local load_meter reference in handlers
api: Fix indentation after previous patch
api: Coroutinize load_meter::get_load_map handler
api: Move load meter handlers
api: Add set/unset methods for load_meter
When a node bootstraps it may happen that some nodes still see it as
bootstrapping while the node itself already is in normal state and ready
to serve queries. We want to delay the bootstrap completion until all
nodes see the new node as normal. Piggy back on UP notification to do so
and what of the node that sent the notification to be seen as normal.
Fixes#18678
It seems that we skip the sync label process between PR and linked
Issues
Adding those debug prints will allow us to understand why
Closesscylladb/scylladb#19393
Currently a node does not wait to be marked UP by other nodes before
complete booting which creates a usability issue: during a rolling restart
it is not enough to wait for local CQL port to be opened before
restarting next node, but it is also needed to check that all other
nodes already see this node as alive otherwise if next node is restarted
some nodes may see two node as dead instead of one.
This patch improves the situation by making sure that boot process does
not complete before all other nodes do not see the booting one as alive.
This is still a best effort thing: if some nodes are unreachable or
gossiper propagation takes too much time the boot process continues
anyway.
Fixesscylladb/scylladb#19206
since we've switched almost all callers of the operator<< to {fmt}, let's drop the unused operator<<:s.
there are more occurrences of unused operator<< in the tree, but let's do the cleanup piecemeal.
---
this is a cleanup, so no need to backport
Closesscylladb/scylladb#19346
* github.com:scylladb/scylladb:
types: remove unused operator<<
node_ops: remove unused operator<<
lang: remove unused operator<<
gms: remove unused operator<<
dht: remove unused operator<<
test: do not use operator<< for std::optional
Now they are in storage service set/unset helper, but there's the
dedicated set/unset pair for meter's enpoints.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The meter is pretty small sevice and its API is also tiny. Still, it's a
standalone top-level service, and its API should come next to it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently if task_manager::task::impl::abort preempts before children are recursively aborted and then the task gets unregistered, we hit use after free since abort uses children vector which is no longer alive.
Modify abort method so that it goes over all tasks in task manager and aborts those with the given parent.
Fixes: #19304.
Requires backport to all versions containing task manager
Closesscylladb/scylladb#19305
* github.com:scylladb/scylladb:
test: add test for abort while a task is being unregistered
tasks: fix tasks abort
before this change, because it seems that we move away from `p2` in
each iteration, so the succeeding iterations are moving from an empty
`p2`, clang-tidy warns at seeing this.
but we only move from `p2._static_row` in the first iteration when the
dest `mutation_partition` instance's static row is empty. and in the
succeeding iterations, the dest `mutation_partition` instance's static
row is not empty anymore if it is set. so, this is a false alarm.
in this change, we silence this warning. another option is to extract
the single-shot mutation out of the loop, and pass the `std::move(p2)`
only for the single-shot mutation, but that'd be a much more intrusive
change. we can revisit this later.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19331
Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.
This issue caused scylladb/scylladb#17526.
Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.
This commit adds files that contain Open Source-specific information
and includes these files with the .. scylladb_include_flag:: directive.
The files include a) a link and b) Table of Contents.
The purpose of this update is to enable adding
Open Source/Enterprise-specific information in the Reference section.
Closesscylladb/scylladb#19362
Currently, when calculating the view update backlog for gossip,
we start with `db::view::update_backlog()` and compare it to backlogs
from all shards. However, this backlog can't be compared to other
backlogs - it has size 0 and we compare the fraction current/size
when comparing backlogs, causing us to compare with `NaN`.
This patch fixes it by starting the comparisons with an empty backlog.
The patch introducing this issue (f70f774e40) wasn't backported, so this one doesn't need to be either
Closesscylladb/scylladb#19247
* github.com:scylladb/scylladb:
mv: make the view update backlog unmofidiable
mv: fix value of the gossiped view update backlog
Such field is no longer needed as the information comes
directly from group0_batch.
Fixesscylladb/scylladb#19365
Backport: no, we don't backport code cleanups
Closesscylladb/scylladb#19366
* github.com:scylladb/scylladb:
cql: remove global_req_id from schema_altering_statement
cql: switch alter keyspace prepare_schema_mutations to use group0_batch
Before these changes, compilation was failing with the following
error:
In file included from test/boost/hint_test.cc:12:
/usr/include/fmt/ranges.h:298:7: error: no member named 'parse' in 'fmt::formatter<db::hints::sync_point::host_id_or_addr>'
298 | f.parse(ctx);
| ~ ^
We add the missing callback.
Closesscylladb/scylladb#19375
Currently, a view update backlog may reach an invalid state, when
its max is 0 and its relative_size() is NaN as a result. This can
be achieved either by constructing the backlog with a 0 max or by
modifying the max of an existing backlog. In particular, this
happens when creating the backlog using the default constructor.
In this patch the the default constructor is deleted and a check
is added to make sure that the max is different than 0 is added
to its constructor - if the check fails, we construct an empty
backlog instead, to handle the possibility of getting an invalid
backlog sent from a node with a version that's missing this check.
Additionally, we make the backlogs members private, exposing them
only through const getters.
The goal is the same as in 29768a2d02 (gitattributes: Mark *.svg as
binary) -- prevent grep from searching patterns in those files.
Despite those files are, in fact, javascript code, the way they are
formatted is not suitable for human reading, so it's unlikely that anyone
would be interested in grep-ing patters in it. At the same time, those
files consist of of very long lines, so if a grep finds a pattern in one
of those, the output is spoiled.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19357
Replace the reserve_partial loop in large_bitset constructor with a new function - reserve_gently() that can reserve memory without stalling by repeatedly calling reserve_partial() method of the passed container.
Closesscylladb/scylladb#19361
* github.com:scylladb/scylladb:
utils/large_bitset: replace reserve_partial with utils::reserve_gently
utils/stall_free: introduce reserve_gently
With big number of shards in the cluster (e.g. 500+) due to cache
periodic refresh we experience high load on role_permissions table
(e.g. 1k op/s). The load on roles table is amplified because to populate
single entry in the cache we do several selects on roles table. Some
of this can't be avoided because roles are arranged in a tree-like
structure where permissions can be inherited.
This patch tries to reuse queries which are simply duplicated. It should
reduce the load on roles table by up to 50%.
Fixesscylladb/scylladb#19299Closesscylladb/scylladb#19300
* github.com:scylladb/scylladb:
auth: reuse roles select query during cache population
auth: coroutinize service::get_uncached_permissions
auth: coroutinize service::has_superuser
Add reserve_gently() that can reserve memory without stalling by
repeatedly calling reserve_partial() method of the passed container.
Update the comments of existing reserve_partial() methods to mention
this newly introduced reserve_gently() wrapper.
Also, add test to verify the functionality.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Currently reads with WHERE clause which limits them to be
single-partition reads, are unnecessarily parallelized.
This commit checks this condition and the query doesn't use
forward_service in single-partition reads.
Related: https://github.com/scylladb/scylladb/issues/17851
Fix the issue that test logs were not deleted
Fix the issue that the URL to the failed test directory was incorrectly shown even when artifacts_dir_url option was not provided
Fix the issue that there were no node logs when it failed to join the cluster
Closesscylladb/scylladb#19115
* github.com:scylladb/scylladb:
[test.py] Fix logs had multiplication of lines
[test.py] Fix log not deleted
[test.py] Fix log for failed node was nod added to failed directory
[test.py] Fix URl for failed logs directory in CI
Currently if task_manager::task::impl::abort preempts before children
are recursively aborted and then the task gets unregistered, we hit
use after free since abort uses children vector which is no
longer alive.
Modify abort method so that it goes over all tasks in task manager
and aborts those with the given parent.
Fixes: #19304.
This PR removes the 5.x.y to 5.x.z upgrade guide and adds the 6.x.y to 6.x.z upgrade guide.
The previous maintenance upgrade guides, such as from 5.x.y to 5.x.z, consisted of several documents - separate for each platform.
The new 6.x.y to 6.x.z upgrade guide is one document - there are tabs to include platform-specific information (we've already done it for other upgrade guides as one generic document is more convenient to use and maintain).
I did not modify the procedures. At some point, they have been reviewed for previous upgrade guides.
Fixes https://github.com/scylladb/scylladb/issues/19322
- This PR must be backported to branch-6.0, as it adds 6.x specific content.
Closesscylladb/scylladb#19340
* github.com:scylladb/scylladb:
doc: remove the 5.x.y to 5.x.z upgrade guide
doc: add the 6.x.y to 6.x.z upgrade guide-6
Currently, when calculating the view update backlog for gossip,
we start with `db::view::update_backlog()` and compare it to backlogs
from all shards. However, this backlog can't be compared to other
backlogs - it has size 0 and we compare the fraction current/size
when comparing backlogs, causing us to compare with `NaN`.
This patch fixes it by starting the comparisons with an empty backlog.
so we can be awared that if scylla builds with seastar master HEAD,
and to be prepared if a build failure is found.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19135
Since the test name was not unique across the run and when we were using a --repeat option, there were several handlers for the same file. With this change test name and accordingly, the log name will be different for the same test but different repeat case. Remove mode from the test name since it's already in mode directory.
One of the created log files was not deleted at all, because there was no delete command. Unlink moved on later stage explicitly after removing the handler that writing to this file to avoid the possibility that something will be added after removing the file.
in 94cdfcaa94, we added commitlog_cleanup_test to `configure.py`,
but didn't add it to the CMake building system.
in this change, let's add it to the CMake building system.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19314
seastar::format() does not use operator<< under the hood, it uses
{fmt}, so update the comment accordingly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19315
In issue #15561 some doubts were raised regarding the way ScyllaDB sorts
UUID values. This patch adds a heavily-commented cql-pytest test that
helps understand - and verify that understanding - of the way Scylla sorts
UUIDs, and shows there is some reason in the madness (in particular,
Version 1 UUIDs (time uuids) are sorted like timeuuids, and not as byte
arrays.
The new tests check the different cases (see the comments in the test),
and as usual for cql-pytest tests - they passes also on Cassandra, which
allows us to confirm that the sort order we used is identical to the one
used by Cassandra and not something that Scylla mis-implemented.
Having this test in our suite will also ensure that the UUID ordering
never changes accidentally in the future. If it ever changes, it can
break access to existing tables that use UUID clustering keys, so
it shouldn't change.
Fixes#15561
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19343
Making the count resources on the maintenance (streaming) semaphore live update via config. This will allow us to improve repair speed on mixed-shard clusters, where we suspect that reader trashing -- due to the combination of high number of readers on each shard and very conservative reader count limit (10) -- is the main cause of the slowness.
Making this count limit confgurable allows us to start experimenting with this fix, without committing to a count limit increase (or removal), addressing the pain in the field.
Refs: #18269
No OSS backport needed.
Closesscylladb/scylladb#19248
* github.com:scylladb/scylladb:
replica/database: wire in maintenance_reader_concurrency_semaphore_count_limit
db/config: introduce maintenance_reader_concurrency_semaphore_count_limit
reader_concurrency_semaphore: make count parameter live-update
Fixes#19334
Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.
Note: required to make enterprise CI happy again.
Closesscylladb/scylladb#19337
* github.com:scylladb/scylladb:
schema: Make "describe" use extensions to string
schema_extensions: Add an option to string method
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we don't provide it anymore, and if any of existing type provides
constructor accepting an `optional<>`, and hence can be formatted
using operator<< after converting it, neither shall we rely on this
behavior, as it is fragile.
so, in this change, we switch to `fmt::print()` to use {fmt} to
print `optional<>`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add allure-pytest pip dependency to be able to use it for generating the allure report later.
Main benefits of the allure report:
1. Group test failures
2. Possibility to attach log files to she test itself
3. Timeline of test run
4. Test description on the report
5. Search by test name or tag
[avi: regenerate toolchain]
Closesscylladb/scylladb#19335
thrift support was deprecated since ScyllaDB 5.2
> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.
so let's drop it. in this change,
* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is preserved for backward compatibility, as we could load from an existing system.local table which still contains this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for backward compatibility with java-based nodetool.
Fixes#3811Fixes#18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- [x] not a fix, no need to backport
Closesscylladb/scylladb#18453
* github.com:scylladb/scylladb:
config: expand on rpc_keepalive's description
api: s/rpc/thrift/
db/system_keyspace: drop thrift_version from system.local table
transport: do not return client_type from cql_server::connection::make_client_key()
treewide: drop thrift support
cql3: always return created event in create ks/table/type/view statement
In case multiple clients issue concurrently CREATE KEYSPACE IF NOT EXISTS
and later USE KEYSPACE it can happen that schema in driver's session is
out of sync because it synces when it receives special message from
CREATE KEYSPACE response.
Similar situation occurs with other schema change statements.
In this patch we fix only create keyspace/table/type/view statements
by always sending created event. Behavior of any other schema altering
statements remains unchanged.
Fixes https://github.com/scylladb/scylladb/issues/16909
**backport: no, it's not a regression**
Closesscylladb/scylladb#18819
* github.com:scylladb/scylladb:
cql3: always return created event in create ks/table/type/view statement
cql3: auth: move auto-grant closer to resource creation code
cql3: extract create ks/table/type/view event code
With big number of shards in the cluster (e.g. 500+) due to cache
periodic refresh we experience high load on role_permissions table
(e.g. 1k op/s). The load on roles table is amplified because to populate
single entry in the cache we do several selects on roles table. Some
of this can't be avoided because roles are arranged in a tree-like
structure where permissions can be inherited.
This patch tries to reuse queries which are simply duplicated. It should
reduce the load on roles table by up to 50%.
Fixesscylladb/scylladb#19299
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19313
Allow an extension to describe itself as the CQL property
string that created it (and is serialized to schema tables)
Only paxos extension requires override.
in 51c53d8db6, we check `self.old_env[env]` for None, but there
are chances `self.old_env` does not contain a value with `env`.
in that case, we'd have following failure:
```
Traceback (most recent call last):
File "/home/kefu/dev/scylladb/test/pylib/minio_server.py", line 307, in <module>
asyncio.run(main())
File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/kefu/dev/scylladb/test/pylib/minio_server.py", line 304, in main
await server.stop()
File "/home/kefu/dev/scylladb/test/pylib/minio_server.py", line 274, in stop
self._unset_environ()
File "/home/kefu/dev/scylladb/test/pylib/minio_server.py", line 211, in _unset_environ
if self.old_env[env] is not None:
~~~~~~~~~~~~^^^^^
KeyError: 'S3_CONFFILE_FOR_TEST'
```
this happens if we run `pylib/minio_server.py` as a standalone
application.
in this change, instead of getting the value with index, we use
`dict.get()`, so that it does not throw when the dict does not
have the given key.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19291
If something happens during nod adding to the cluster, it will not be registered as a part of the cluster. This leads to situations during log gathering that logs for a such node will be missing.
Incorrect passing of the artifacts_dir_url parameter from test.py to pytest leads to the situation when it will pass None as a string and pytest will generate incorrect URL.
This test in topology_experimental_raft/test_alternator.py wants to
check that during Alternator TTL's expiration scans, ALL of the CPU was
used in the "streaming" scheduling group and not in the "statement"
scheduling group. But to allow for some fluke requests (e.g., from the
driver), the test actually allows work in the statement group to be
up to 1% of the work.
Unfortunately, in one test run - a very slow debug+aarch64 run - we
saw the work on the statement group reach 1.4%, failing the test.
I don't know exactly where this work comes from, perhaps the driver,
but before this bug was fixed we saw more than 58% of the work in the
wrong scheduling group, so neither 1% or 1.4% is a sign that the bug
came back. In fact, let's just change the threshold in the test to 10%,
which is also much lower than the pre-fix value of 58%, so is still a
valid regression test.
Fixes#19307
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19323
Seastar has functions implementing locking a `seastar::shared_mutex`.
We should use those now instead of reimplementing them in Scylla.
Closesscylladb/scylladb#19253
In CI test always executed with option --repeat=3 that leads to generate 3 test results with the same name. Junit plugin in CI cannot distinguish correctly the difference between these results. In case when we have two passes and one fail, the link to test result will sometimes be redirected to the incorrect one because the test name is the same. To fix this ReportPlugin added that will be responsible to modify the test case name during junit report generation adding to the test name mode and run id.
Fixes: https://github.com/scylladb/scylladb/issues/17851
Fixes: https://github.com/scylladb/scylladb/issues/15973Closesscylladb/scylladb#19235
* github.com:scylladb/scylladb:
[test.py] Add uniqueness to the test name
[test.py] Refactor alternator, nodetool, rest_api
these two arguments are critical when tablets are enabled.
Fixes https://github.com/scylladb/scylladb/issues/19296
---
6.0 is the first release with tablets support. and `nodetool ring` is an important tool to understand the data distribution. so we need to backport this document change to 6.0
Closesscylladb/scylladb#19297
* github.com:scylladb/scylladb:
doc: document `keyspace` and `table` for `nodetool ring`
doc: replace tab with space
this change is created in the same spirit of 1186ddef16, which updated the rule for generating the stripped dist pkg, but it failed to update the one for generating the unstripped dist pkg. what's why we have build failure when the workflow is looking for the unstripped tar.gz:
```
08:02:47 ++ ls /jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo/dist/tar/scylla-unstripped-6.1.0~dev-0.20240613.d5bdddaeb40b.x86_64.tar.gz
08:02:47 ls: cannot access '/jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo/dist/tar/scylla-unstripped-6.1.0~dev-0.20240613.d5bdddaeb40b.x86_64.tar.gz': No such file or directory`
```
so, in this change, we fix the path.
Refs #2717
---
* cmake related change, hence no need to backport.
Closesscylladb/scylladb#19290
* github.com:scylladb/scylladb:
build: cmake: use per-mode path for building unstripped_dist_pkg
build: cmake: use path to be compatible with CI
when `--use-cmake` option is passed to `configure.py`,
- before this change, all modes are selected if no
`--mode` options are passed to `configure.py`.
- after this change, only the modes whose `build_by_default` is
`True` are selected, if no `--mode` options are specfied.
the new behavior matches the existing behavior. otherwise,
`ninja -C build mode_list` would list the mode which
is not built by default.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19292
utils/chunked_vector::reserve_partial: fix usage in callers
The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.
Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.
This PR updates the method comment and all the callers to use the
right way.
Fixes#19254Closesscylladb/scylladb#19279
* github.com:scylladb/scylladb:
utils/large_bitset: remove unused includes identified by clangd
utils/large_bitset: use thread::maybe_yield()
test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
utils/lsa/chunked_managed_vector: fix reserve_partial()
utils/chunked_vector: return void from reserve_partial and make_room
test/boost/chunked_vector_test: fix testcase tests_reserve_partial
utils/chunked_vector::reserve_partial: fix usage in callers
before this change, we use runtime format string to format error
messages. but it does not have the compile time format check.
if we pass arguments which are not formattable, {fmt} throws at
runtime, instead of error out at compile-time. this could be very
annoying, because we format error messages at the error handling
path. but if user ends up seeing an exception for {fmt} instead
of a nice error message, it would be far from helpful.
in this change, we
- use compile-time format string
- fix two caller sites, where we pass `std::exception_ptr` to
{fmt}, but `std::exception_ptr` is not formattable by {fmt} at
the time of writing. we do have operator<< based formatter for
it though. so we delegate to `fmt::streamed` to format it.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19294
This PR fixes two problems with the `expected_error`
parameter in `server_add` and `servers_add`.
1. It didn't work in `server_add` if the cluster was empty
because of an incorrect attempt to connect the driver.
2. It didn't work in `servers_add` completely because the
`seeds` parameter was handled incorrectly.
This PR only adds improvements in the testing framework,
no need to backport it.
Closesscylladb/scylladb#19255
* github.com:scylladb/scylladb:
test: manager_client, scylla_cluster: fix type annotations in add_servers
test: manager_client: don't connect driver after failed server_{add, start}
test: scylla_cluster: pass seeds to add_servers
On each shard of each node we store the view update backlogs of
other nodes to, depending on their size, delay responses to incoming
writes, lowering the load on these nodes and helping them get their
backlog to normal if it were too high.
These backlogs are propagated between nodes in two ways: the first
one is adding them to replica write responses. The seconds one
is gossiping any changes to the node's backlog every 1s. The gossip
becomes useful when writes stop to some node for some time and we
stop getting the backlog using the first method, but we still want
to be able to select a proper delay for new writes coming to this
node. It will also be needed for the mv admission control.
Currently, the backlog is gossiped from shard 0, as expected.
However, we also receive the backlog only on shard 0 and only
update this shard's backlogs for the other node. Instead, we'd
want to have the backlogs updated on all shards, allowing us
to use proper delays also when requests are received on shards
different than 0.
This patch changes the backlog update code, so that the backlogs
on all shards are updated instead. This will only be performed
up to once per second for each other node, and is done with
a lower priority, so it won't severly impact other work.
Fixes: scylladb/scylladb#19232Closesscylladb/scylladb#19268
In CI test always executed with option --repeat=3 that leads to generate 3 test results with the same name. Junit plugin in CI cannot distinguish correctly the difference between these results. In case when we have two passes and one fail, the link to test result will sometimes be redirected to the incorrect one because the test name is the same.
To fix this ReportPlugin added that will be responsible to modify the test case name during junit report generation adding to the test name mode and run id.
Fixes: https://github.com/scylladb/scylladb/issues/17851
Fixes: https://github.com/scylladb/scylladb/issues/15973
For various reasons, a view building write may fail. When that
happens, the view building should not finish until these writes
are successfully retried and they should not interfere with any
writes that are performed to the base table while the view is
building.
The test introduced in this patch confirms that this is the case.
Refs scylladb/scylladb#19261Closesscylladb/scylladb#19263
Update the maximum size tested by the testcase. The test always created
only one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (12.8 KB). So, use twice the
max_chunk_capacity as the test size distribution upper limit to verify
that partial_reserve can reserve multiple chunks.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Fix the method comment and return types of chunked_managed_vector's
reserve_partial() similar to chunked_vector's reserve_partial() as it
has the same issues mentioned in #19254. Also update the usage in the
chunked_managed_vector_test.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Since reserve_partial does not depend on the number of remaining
capacity to be reserved, there is no need to return anything from it and
the make_room method.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Fix the usage of reserve_partial in the testcase. Also update the
maximum chunk size used by the testcase. The test always created only
one chunk as the maximum size tested by it (1 << 12 = 4KB) was less
than the default max chunk size (128 KB). So, use smaller chunk size,
512 bytes, to verify that partial_reserve can reserve multiple chunks.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
in 57c408ab, we dropped operator<< for `parsed::path`, but we forgot
to drop the friend declaration for it along with the operator. so in
this change, let's drop the friend declaration.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19287
these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning.
also, add api to iwyu github workflow's CLEANER_DIR, to prevent future violations.
---
it's a cleanup, hence no need to backport.
Closesscylladb/scylladb#19269
* github.com:scylladb/scylladb:
.github: add api to iwyu's CLEANER_DIR
api: do not include unused headers
`before this change, we use "scylla" as the dependecy of
unstripped_dist_pkg, but that's implies the scylla built with the
default mode. if the build rules is generated using the
multi-config generator, the default mode does not necessarily
identical to the current `$<CONFIG>`, so let's be more explicit.
otherwise, we could run into built failure like
```
FAILED: dist/RelWithDebInfo/scylla-unstripped-6.1.0~dev-0.20240614.5f36888e7fbd.x86_64.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/dist/RelWithDebInfo/scylla-unstripped-6.1.0~dev-0.20240614.5f36888e7fbd.x86_64.tar.gz
cd /jenkins/workspace/scylla-master/scylla-ci/scylla && scripts/create-relocatable-package.py --build-dir /jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo --node-exporter-dir /jenkins/workspace/scylla-master/scylla-ci/scylla/build/node_exporter --debian-dir /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debian /jenkins/workspace/scylla-master/scylla-ci/scylla/build/dist/RelWithDebInfo/scylla-unstripped-6.1.0~dev-0.20240614.5f36888e7fbd.x86_64.tar.gz
ldd: /jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo/scylla: No such file or directory
Traceback (most recent call last):
File "/jenkins/workspace/scylla-master/scylla-ci/scylla/scripts/create-relocatable-package.py", line 109, in <module>
libs.update(ldd(exe))
^^^^^^^^
File "/jenkins/workspace/scylla-master/scylla-ci/scylla/scripts/create-relocatable-package.py", line 37, in ldd
for ldd_line in subprocess.check_output(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ldd', '/jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo/scylla']' returned non-zero exit status 1.
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is created in the same spirit of 1186ddef16, which
updated the rule for generating the stripped dist pkg, but it
failed to update the one for generating the unstripped dist pkg.
what's why we have build failure when the workflow is looking for
the unstripped tar.gz:
```
08:02:47 ++ ls /jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo/dist/tar/scylla-unstripped-6.1.0~dev-0.20240613.d5bdddaeb40b.x86_64.tar.gz
08:02:47 ls: cannot access '/jenkins/workspace/scylla-master/scylla-ci/scylla/build/RelWithDebInfo/dist/tar/scylla-unstripped-6.1.0~dev-0.20240613.d5bdddaeb40b.x86_64.tar.gz': No such file or directory`
```
so, in this change, we fix the path.
Refs #2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
backport not needed, these are just cleanups.
Closesscylladb/scylladb#19260
* github.com:scylladb/scylladb:
replica: simplify perform_cleanup_compaction()
replica: return storage_group by reference on storage_group_for*()
replica: devirtualize storage_group_of()
Before work on tablets was completed, it was noticed that — due to some missing pieces of implementation — Scylla doesn't properly close sstables for migrated-away tablets. Because of this, disk space wasn't being reclaimed properly.
Since the missing pieces of implementation were added, the problem should be gone now. This patch adds a test which was used to reproduce the problem earlier. It's expected to pass now, validating that the issue was fixed.
Should be backported to branch-6.0, because the tested problem was also affecting that branch.
Fixes#16946Closesscylladb/scylladb#18906
* github.com:scylladb/scylladb:
test_tablets: add test_tablet_storage_freeing
test: pylib: add get_sstables_disk_usage()
with tablets, we're expected to have a worst of ~100 tablets in a given
table and shard, so let's avoid linear search when looking for the
memtable_list in a range scan. we're bounded by ~100 elements, so
shouldn't be a big problem, but it's an inefficiency we can easily
get rid of.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19286
The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.
Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.
This commit updates the method comment and all the callers to use the
right way.
Fixes#19254
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.
To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.
Fixes#16514.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be used in turn to restrict reshape to 10% of available space
in underlying storage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As it turns out, each sstable carries its own schema in its serialization header (Statistics component). This schema is incomplete -- the names of the key columns are not stored, just their type. Static and regular columns do have names and types stored however. This bare-bones schema is enough to parse and display the content of the sstable. Another thing missing is schema options (the stuff after the `WITH` keyword, except the clustering order). The only options stored are the compression options (in the CompressionInfo component), this is actually needed to read the Data component.
This series adds a new method to `tools/schema_loader.cc` to extract the schema stored in the sstable itself. This new schema load method is used as the last fall-back for obtaining the schema, in case scylla-sstable is trying to autodetect the schema of the sstable. Although, right now this bare-bones schema is enough for everything scylla-sstable does, it is more future proof to stick to the "full" schema if possible, so this new method is the last resort for now.
Fixes: https://github.com/scylladb/scylladb/issues/17869
Fixes: https://github.com/scylladb/scylladb/issues/18809
New functionality, no backport needed.
Closesscylladb/scylladb#19169
* github.com:scylladb/scylladb:
tools/scylla-sstable: log loaded schema with trace level
tools/scylla-sstable: load schema from the sstable as fallback
tools/schema_loader: introduce load_schema_from_sstable()
test/lib/random_schema: remove assert on min number of regular columns
sstables: introduce load_metadata()
Making the count resources on the maintenance (streaming) semaphore live
update via config. This will allow us to improve repair speed on
mixed-shard clusters, where we suspect that reader trashing -- due to
the combination of high number of readers on each shard and very
conservative reader count limit (10) -- is the main cause of the
slowness.
Making this count limit confgurable allows us to start experimenting
with this fix, without committing to a count limit increase (or
removal), addressing the pain in the field.
So that the amount of count resources can be changed at run-time,
triggered by a e.g. a config change.
Previous constant-count based constructor is left intact, to avoid
patching all clients, as only a small subset will want the new
functionality.
This patch includes extensive testing for what happens to an ongoing
paged query when a secondary index is suddenly added or dropped.
Issue #18992 was opened suggesting that this would be broken, and indeed
the tests included here show that it is indeed broken.
The four tests included in this patch are heavily commented to explain
what they are testing and why, but here is a short summary of what is
being tested by each of them:
1. A paged query filtering on v=17 continues correctly even if an
index is created on v.
2. A paged query filtering on v1 and v2 where v2 is indexed,
continues correctly even if an index is created on v1 (remember
that Scylla prefers to use the first index mentioned in the query).
3. A paged query using an index on v continues correctly even if that
index is deleted.
4. However, if the query doesn't say "ALLOW FILTERING", it cannot
be continued after the index is deleted.
All these tests pass on Cassandra, but all of them except the fourth
fail on Scylla, reproducing issue #18992. Somewhat to my suprise, the
failure of the query in all the failed tests is silent (i.e., trying to
fetch the next page just fetches nothing and says the iteration is done).
I was expecting more dramatic failures ("marshaling error" messages,
crashes, etc.) but didn't get them.
Refs #18992
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19000
The schema of the sstable can be interesting, so log it with trace
level. Unfortunately, this is not the nice CQL statement we are used to
(that requires a database object), but the not-nearly-so-nice CFMetadata
printout. Still, it is better then nothing.
When auto-detecting the schema of the sstable, if all other methods
failed, load the schema from the sstable's serialization header. This
schema is incomplete. It is just enough to parse and display the content
of the sstable. Although parsing and displaying the content of the
sstable is all scylla-sstable does, it is more future-compatible to us
the full schema when possible. So the always-available but minimal
schema that each sstable has on itself, is used just as a fallback.
The test which tested the case when all schema load attempts fail,
doesn't work now, because loading the serialization header always
succeeds. So convert this test into two positive tests, testing the
serialization header schema fallback instead.
Allows loading the schema from an sstable's serialization header. This
schema is incomplete, but it is enough to parse and display the content
of the sstable.
Change the format of sync points to use host ID instead of IPs, to be consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores host IDs instead of IPs.
The decoding supports both formats with host IDs and IPs, so a sync point contains now a variant of either types, and in the case of new type the translation is avoided.
Fixes#18653Closesscylladb/scylladb#19134
* github.com:scylladb/scylladb:
db/hints: migrate sync point to host ID
db/hints: rename sync point structures with _v1 suffix to _v1_v2
In this commit, we add a new metric `sent_total_size`
keeping track of how many bytes of hints a node
has sent. The metric is supposed to complement its
counterpart in storage proxy that counts how many
bytes of hints a node has received. That information
should prove useful in analyzing statistics of
a cluster -- load on given nodes and where it comes
from.
We also change the name of the matric `sent`
to `sent_total` to avoid the conflict of prefixes
between the two metrics.
those functions cannot return nullptr, will throw when group is not
found, so better return ref instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If adding or starting a server fails expectedly, there is no reason
to update or connect the driver. Moreover, before this patch, we
couldn't use `server_add` and `servers_add` with `expected_error`
if the cluster was empty. After expected bootstrap failures, we
tried to connect the driver, which rightfully failed on
`assert len(hosts) > 0` in `cluster_con`.
This parameter was incorrectly missing. For this reason,
`expected_error` was passed from `add_servers` to `add_server` as
`seeds`, which caused strange crashes.
Some time ago it turned out that if unrecognized feature name is met in scylla.yaml, the whole experimental features list is ignored, but scylla continues to boot. There's UNUSED feature which is the proper way to deprecate a feature, and this PR improves its handling in several ways.
1. The recently removed "tablets" feature is partially brought back, but marked as UNUSED
2. Any UNUSED features met while parsing are printed into logs
3. The enum_option<> helper is enlightened along the way
refs: #18968Closesscylladb/scylladb#19230
* github.com:scylladb/scylladb:
config: Mark tablets feature as unused
main: Warn unused features
enum_option: Carry optional key on board
enum_option: Remove on-board _map member
The API endpoints are registered for particular services (with rare exceptions), and once the corresponding service is ready, its endpoints section can be registered too. Same but reversed is for shutdown, and it's automatic with deferred actions.
refs: #2737Closesscylladb/scylladb#19208
* github.com:scylladb/scylladb:
main: Register task manager API next to task manager itself
main: Register messaging API next to messaging service
main: Register repair API next to repair service
its declaration was removed in 84a9d2fa, which failed to remove
the implementation from .cc file.
in this change, let's remove operator<< for role_or_anonymous
completely.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19243
If we receive a message in the same term but from a different leader
than we expect, we print:
```
Got append request/install snapshot/read_quorum from an unexpected leader
```
For some reason the message did not include the details (who the leader
was and who the sender was) which requires almost zero effort and might
be useful for debugging. So let's include them.
Ref: scylladb/scylla-enterprise#4276Closesscylladb/scylladb#19238
Commit 882b2f4e9f (cql3, schema_tables: Generalize function creation)
erroneously says that optional<context> is not suitable for future<>
type, but in fact it is.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19204
since gcc-13 is packaged by ppa:ubuntu-toolchain-r, and GCC-13 was
released 1 year ago, let's use it instead. less warnings, as the
standard library from GCC-13 is more standard compliant.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19162
to be prepared for changes from clang, and enjoy the new warnings/errors from this compiler.
* it is an improvement in our CI, no need to backport.
Closesscylladb/scylladb#19164
* github.com:scylladb/scylladb:
.github: add workflow to build with clang nightly
.github: rename clang-tidy-matcher.json to clang-matcher.json
Commit 47dbf23773 (Rework view services and system-distributed-keyspace
dependencies) made streaming and repair services depend on view builder,
but missed the fact that the builder itself starts much later.
Move view builder earlier, that's safe, no activity is started upon
that, real building is kicked much later when invoke_on_all(start)
happens.
Other than than, start system distributed keyspace earlier, which also
looks safe, as it's also started "for real" later, by storage service
when it joins the ring.
fixes: #19133
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19250
upgrade/_common are document fragments included by other documents.
but quite a few the documents previously including these fragments
were removed. but we didn't remove these fragments along with them.
in this change, we drop them.
Fixes#19245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19251
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.
after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.
Fixes https://github.com/scylladb/scylladb/issues/19034
---
the issue applies to both 5.4 and 6.0, and this issue hurts the CI stability, hence we should backport it.
Closesscylladb/scylladb#19252
* github.com:scylladb/scylladb:
test: memtable_test: increase unspooled_dirty_soft_limit
test: memtable_test: replace BOOST_ASSERT with BOOST_REQURE
In this commit, we add two new metrics to storage proxy:
* `received_hints_total`,
* `received_hints_bytes_total`.
Before these changes, we had to rely solely on other
metrics indicating how many hints nodes have written,
rejected, sent, etc. Because hints are subject to
many more or less controllable factors, e.g. a target
node still being a replica for a mutation, it was
very difficult to approximate how many hints a given
node might have received or what part of its load
they were. The newly introduced metrics are supposed
to help reason about those.
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.
after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.
Fixes#19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.
after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.
Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in this changeset, we tighten the clang-include-cleaner workflow, and address the warnings in two more subdirectories in the source tree.
* it's a cleanup, no need to backport
Closesscylladb/scylladb#19155
* github.com:scylladb/scylladb:
.github: add alternator to iwyu's CLEANER_DIR
alternator: do not include unused headers
.github: change severity to error in clang-include-cleaner
exceptions: do not include unused headers
since we stopped using the generic container formatters which in turn
use operator<< for formatting the elemements. we can drop more
operator<< operators.
so, in this change, we drop operator<< for proposal.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19156
ConfigParser.readfp was deprecated in Python 3.2 and removed in Python 3.12.
Under Fedora 40, the container fails to launch because it cannot parse its
configuration.
Fix by using the newer read_file().
Closesscylladb/scylladb#19236
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.
Fixes: https://github.com/scylladb/scylladb/issues/18674
- [x] This patch has to be backported because it fixes broken functionality
Closesscylladb/scylladb#18705
* github.com:scylladb/scylladb:
test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
test/pylib: rest_client: add get_injection()
api/error_injection: add getter for error_injection
utils/error_injection: add set_parameter()
replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
in 44e85c7d, we remove coverage compiling options from the cflags
when building abseil. but in 535f2b21, these options were brought
back as parts of cxx_flags.
so we need to remove them again from cxx_flags.
Fixes#19219
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19220
This features used to be there for a while, but then it was removed by
83d491af02. This patch partially takes it
back, but maps to UNUSED, so that if met in config, it's warned, but
other features are parsed as well.
refs: #18968
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When seeing an UNUSED feature -- print it into log. This is where the
enum_option::key is in use. The thing is that experimental features map
different unused feature names into the single UNUSED feature enum
value, so once the feature is parsed its configured name only persists
in the option's key member (saved by previous patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It facilitates option formatting, but the main purpose is to be able to
find out the exact keys, not values, later (see next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The map in question is immutable and can obtained from the Mapper type
at any time, there's no need in keeping its copy on each enum_option
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Change the format of sync points to use host ID instead of IPs, to be
consistent with the use of host IDs in hinted handoff module.
Introduce sync point v3 format which is the same as v2 except it stores
host IDs instead of IPs.
The encoding of sync points now always uses the new v3 format with host
IDs.
The decoding supports both formats with host IDs and IPs, so a sync point
contains now a variant of either types, and in the case of the new
format the translation from IP to host ID is avoided.
* tools/java 88809606c8...01ba3c196f (3):
> Revert "build: don't add nonexistent directory 'lib' to relocatable packages"
> build: run antlr in a separate process
> build: don't add nonexistent directory 'lib' to relocatable packages
Allow external code to obtain information about an error injection
point, including whether it is enabled, and importantly, what its
parameters are. Together with the `set_parameter()` added in the
previous patch, this allows tests to read out the values of internal
parameters, via a set_parameter() injection point.
Allow injection points to write values into the parameter map, which
external code can then examine. This allows exfiltrating the values if
internal variables, to be examined by tests, without exposing these
variables via an "official" path.
Most of the time this test spends waiting for a node to die. Helps 3x times
Was
real 9m21,950s
user 1m11,439s
sys 1m26,022s
Now
real 3m37,780s
user 0m58,439s
sys 1m13,698s
refs: #17764
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19222
Fixes /scylladb/scylla-enterprise#4168
Unless listing all (including system) keyspaces, filter out "extension internal"
keyspaces. These are to be considered "system" for the purposes of exposing to
end user.
Closesscylladb/scylladb#19214
This config item is propagated to the table object via table::config.
Although the field in table::config, used to propagate the value, was
utils::updateable_value<T>, it was assigned a constant and so the
live-update chain was broken.
This patch fixes this.
Consider the following:
1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in
Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"
which means loss of write on pending replica.
The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.
Fixes#19052.
Fixes#19033.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#19099
Fixesscylladb/scylla-pkg#3845
Don't overwrite (or rather change) AWS credentials variables if already set in
enclosing environment. Ensures EAR tests for AWS KMS can run properly in CI.
v2:
* Allow environment variables in reading obj storage config - allows CI to
use real credentials in env without risking putting them info less seure
files
* Don't write credentials info from miniserver into config, instead use said
environment vars to propagate creds.
v3:
* Fix python launch scripts to not clear environment, thus retaining above aws envs.
Closesscylladb/scylladb#19086
This is a translation of Cassandra's CQL unit test source file
DistinctQueryPagingTest.java into our cql-pytest framework.
The 5 tests did not reproduce any previously-unknown bug, but did provide
additional reproducers for one already-known issue:
Refs #10354: SELECT DISTINCT should allow filter on static columns,
not just partition keys
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18971
Adds an util for measuring the disk usage of the given table on the given
node.
Will be used in a follow-up patch for testing that sstables are freed
properly.
This commit removes the information that tablets are an experimental feature
from the CREATE KEYSPACE section.
In addition, it removes the notes and cautions that are redundant when
a feature is GA, especially the information and warnings about the future
plans.
Fixes https://github.com/scylladb/scylladb/issues/18670Closesscylladb/scylladb#19063
C++ standard enforces copy elision in this case. and copy elision is
more performant than constructing the return value with a move
constructor, so no need to use `std:move()` here.
and GCC-14 rightfully points this out:
```
/home/kefu/dev/scylladb/lang/lua.cc: In member function ‘data_value {anonymous}::from_lua_visitor::operator()(const utf8_type_impl&)’:
/var/ssd/scylladb/lang/lua.cc:797:25: error: redundant move in return statement [-Werror=redundant-move]
797 | return std::move(s);
| ~~~~~~~~~^~~
/home/kefu/dev/scylladb/lang/lua.cc:797:25: note: remove ‘std::move’ call
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19187
before this change, when building abseil, we don't pass cxxflags
to compiler, and abseil libraries are build with the default
optimization level. in the case of clang, its default optimization
level is `-O0`, it compiles the fastest, but the performance of
the emitted code is not optimized for runtime performance. but we
expect good performance for the release build. a typical command line
for building abseil looks like
```
clang++ -I/home/kefu/dev/scylladb/master/abseil -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o -MF absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o.d -o absl/base/CMakeFiles/scoped_set_env.dir/internal/scoped_set_env.cc.o -c /home/kefu/dev/scylladb/master/abseil/absl/base/internal/scoped_set_env.cc
```
so, in this change, we populate cxxflags to abseil, so that the
per-mode `-O` option can be populated when building abseil.
after this change, the command line building abseil in release mode
looks like
```
clang++ -I/home/kefu/dev/scylladb/master/abseil -ffunction-sections -fdata-sections -O3 -mllvm -inline-threshold=2500 -fno-slp-vectorize -DSCYLLA_BUILD_MODE=release -g -gz -ffile-prefix-map=/home/kefu/dev/scylladb/master=. -march=westmere -std=gnu++20 -Wall -Wextra -Wcast-qual -Wconversion -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wfor-loop-analysis -Wformat-security -Wgnu-redeclared-enum -Winfinite-recursion -Winvalid-constexpr -Wliteral-conversion -Wmissing-declarations -Woverlength-strings -Wpointer-arith -Wself-assign -Wshadow-all -Wshorten-64-to-32 -Wsign-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-zero-compare -Wundef -Wuninitialized -Wunreachable-code -Wunused-comparison -Wunused-local-typedefs -Wunused-result -Wvla -Wwrite-strings -Wno-float-conversion -Wno-implicit-float-conversion -Wno-implicit-int-float-conversion -Wno-unknown-warning-option -DNOMINMAX -MD -MT absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o -MF absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o.d -o absl/flags/CMakeFiles/flags_commandlineflag_internal.dir/internal/commandlineflag.cc.o -c /home/kefu/dev/scylladb/master/abseil/absl/flags/internal/commandlineflag.cc
```
Refs 0b0e661a85Fixes#19161
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19160
The check query may be executed on a node which doesn't yet see that
the downed server is down, as it is not shut down gracefully. The
query coordinator can choose the down node as a CL=1 replica for read
and time out.
To fix, wait for all nodes to notice the node is down before executing
the checking query.
Fixes#17938Closesscylladb/scylladb#19137
After wasm udf appeared, code in main, create_function_statement and schema_tables got some involvements into details of wasm engine management. Also, even prior to this, there was duplication in how function context is created by statement code and schema_tables code.
This PR generalizes function context creation and encapsulates the management in sharded<lang::manager> service. Also it removes the wasm::startup_context thing and makes wasm start/stop be "classical" (see #2737)
Closesscylladb/scylladb#19166
* github.com:scylladb/scylladb:
code: Enlighten wasm headers usage
lang: Unfriend wasm context from manager
lang, cql3, schema_tables: Don't mess with db::config
lang: Don't use db::config to create lua context
lang: Don't use db::config to create wasm context
lang: Drop manager::precompile() method
cql3, schema_tables: Generalize function creation
wasm: Replace startup_context with wasm_config
lang: Add manager::start() method
lang: Move manager to lang namespace
lang: Move wasm::manager to its .cc/.hh files
in general, the value_type of a `const_iterator` is `T` instead of
`const T`, what has the const specifier is `reference`. because,
when dereferencing an iterator, the value type does not matter any
more, as it always a copy.
and GCC-14 points this out:
```
/home/kefu/dev/scylladb/sstables/compress.hh:224:13: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
224 | value_type operator*() const {
| ^~~~~~~~~~
/home/kefu/dev/scylladb/sstables/compress.hh:228:13: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
228 | value_type operator[](ssize_t i) const {
| ^~~~~~~~~~
```
so, in this change, let's change the value_type to `uint64_t`.
please note, it's not typical to return `value_type` from `operator*`
or `operator[]` of an iterator. but due to the design of
segmented_offsets, we cannot return a reference, so let's keep it
this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19186
Alternator has a custom TTL implementation. This is based on a loop, which scans existing rows in the table, then decides whether each row have reached its end-of-life and deletes it if it did. This work is done in the background, and therefore it uses the maintenance (streaming) scheduling group. However, it was observed that part of this work leaks into the statement scheduling group, competing with user workloads, negatively affecting its latencies. This was found to be causes by the reads and writes done on behalf of the alternator TTL, which looses its maintenance scheduling group when these have to go to a remote node. This is because the messaging service was not configured to recognize the streaming scheduling group, when statement verbs like read or writes are invoked. The messaging service currently recognizes two statement "tenants": the user tenant (statement scheduling group) and system (default scheduling group), as we used to have only user-initiated operations and sytsem (internal) ones. With alternator TTL, there is now a need to distinguish between two kinds of system operation: foreground and background ones. The former should use the system tenant while the latter will use the new maintenance tenant (streaming scheduling group).
This series adds a streaming tenant to the messaging service configuration and it adds a test which confirms that with this change, alternator TTL is entirely contained in the maintenance scheduling group.
Fixes: #18719
- [x] Scans executed on behalf of alternator TTL are running in the statement group, disturbing user-workloads, this PR has to be backported to fix this.
Closesscylladb/scylladb#18729
* github.com:scylladb/scylladb:
alternator, scheduler: test reproducing RPC scheduling group bug
main: add maintenance tenant to messaging_service's scheduling config
GCC-14 rightfully points out:
```
/var/ssd/scylladb/mutation/mutation_rebuilder.hh: In member function ‘const mutation& mutation_rebuilder::consume_new_partition(const dht::decorated_key&)’:
/var/ssd/scylladb/mutation/mutation_rebuilder.hh:24:36: error: redundant move in initialization [-Werror=redundant-move]
24 | _m = mutation(_s, std::move(dk)); | ~~~~~~~~~^~~~
/var/ssd/scylladb/mutation/mutation_rebuilder.hh:24:36: note: remove ‘std::move’ call
```
as `dk` is passed with a const reference, `std::move()` does not help
the callee to consume from it. so drop the `std::move()` here.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19188
The Alternator test test_metrics.py::test_item_latency confirms that
for several operation types (PutItem, GetItem, DeleteItem, UpdateItem)
we did not forget to measure their latencies.
The test checked that a latency was updated by checking that two metrics
increases:
scylla_alternator_op_latency_count
scylla_alternator_op_latency_sum
However, it turns out that the "sum" is only an approximate sum of all
latencies, and when the total sum grows large it sometimes does *not*
increase when a short latency is added to the statistics. When this
happens, this test fails on the assertion that the "sum" increases after
an operation. We saw this happening sometimes in CI runs.
The simple fix is to stop checking _sum at all, and only verify that
the _count increases - this is really an integer counter that
unconditionally increases when a latency is added to the histogram.
Don't worry that the strength of this test is reduced - this test was
never meant to check the accuracy or correctness of the histograms -
we should have different (and better) tests for that, unrelated to
Alternator. The purpose of *this* test is only to verify that for some
specific operation like PutItem, Alternator didn't forget to measure its
latency and update the histogram. We want to avoid a bug like we had
in counters in the past (#9406).
Fixes#18847.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19080
struct compress if forward declared right before its definition. At some
point in the past there was probably some code there using it, but now
its gone so remove it.
Closesscylladb/scylladb#19168
in e7d4e080, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of #17025, which was fixed last week. so we can
enable background writes with tablets.
in this change,
* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
to fulfill the needs of tablets, which enforces that the number
of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
magic numbers in the test, and make the data dependencies
more obvious.
Fixes#17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18707
In 58784cd, aa4b06a and other commits migrating
hinted handoff from IPs to host IDs (scylladb/scylladb#15567),
we started ignoring hint directories of invalid names,
i.e. those that represent neither an IP address, nor a host ID.
They remain on disk and are taken into account while computing
e.g. the total size of hints, but they're not used in any way.
These changes add logs informing the user when Scylla
encounters such a directory.
Closesscylladb/scylladb#17566
storage_proxy has a throttling mechanism which attempts to limit the number
of background writes by forcefully raising CL to ALL
(it's not implemented exactly like that, but that's the effect) when
the amount of background and queued writes is above some fixed threshold.
If this is applied to a write, it becomes "throttled",
and its ID is appended to into _throttled_writes.
Whenever the amount of background and queued writes falls below the threshold,
writes are "unthrottled" — some IDs are popped from _throttled_writes
and the writes represented by these IDs — if their handlers still exist —
have their CL lowered back.
The problem here is that IDs are only ever removed from _throttled_writes
if the number of queued and background writes falls below the threshold.
But this doesn't have to happen in any finite time, if there's constant write
pressure. And in fact, in one load test, it hasn't happened in 3 hours,
eventually causing the buffer to grow into gigabytes and trigger OOM.
This patch is intended to be a good-enough-in-practice fix for the problem.
Fixesscylladb/scylladb#17476Fixesscylladb/scylladb#1834Closesscylladb/scylladb#19136
Currently they both run in streaming group and it may become busy during
repair/mv building and affect group0 functionality. Move it to the
gossiper group where it should have more time to run.
Fixesscylladb/scylladb#18863Closesscylladb/scylladb#19138
Now when function context creation is encapsulated in lang::manager,
some .cc files can stop using wasm-specific headers and just go with the
lang/manager.hh one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The friendship was needed to get engine and instance cache from manager,
but there's a shorter way to create cotnext with the info it needs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Not function context creation is encapsulated in lang::manager so it's
possible to patch-out few more places that use database as config
provider.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to previous patch, lua context needs db::config for creation.
It's better to get the configurables via lang::manager::config.
One thing to note -- lua config carries updateable_values on board, but
respective db::config options and _not_ LiveUpdate-able, so the lua
config could just use simple data types. This patch keeps updateable
values intact for brevity.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The managerr needs to get two "fuel" configurables from db::config in
order to create context. Instead of carrying db config from callers,
keep the options on existing lang::manager::config and use them.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a function is created with the CREATE FUNCTION statement, the
statement handler does all the necessary preparations on its own. The
very same code exists in schema_tables, when the function is loaded on
boot. This patch generalizes both and keeps function language-specific
context creation inside lang/ code.
The creation function returns context via argument reference. It would
have been nicer if it was returned via future<>, but it's not suitable
for future<T> type :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lang::manager starts with the help of a context because it needs to
have std::shared_ptr<> pointg to cross-shard shared wasm engine and
runner thread. For that a context is created in advance, that then helps
sharing the engine and runner across manager instances.
This patch removes the "context" and replaces it with classical
manager::config. With it, it's lang::manager who's now responsible for
initializing itself.
In order to have cross-shard engine and thread pointers, the start()
method uses invoke_on_others() facility to share the pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just like any other sharded<> service, the lang::manager now starts and
stops in a classical sequence of
await sharded<manager>::start()
defer([] { await sharded<manager>::stop() })
await sharded<manager>::invoke_on_all(&manager::start)
For now the method is no-op, next patches will start using it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And, while at it, rename local variable to refer to it to as "manager"
not "wasm". Query processor and database also have getters named
"wasm()", these are not renamed yet to keep patch smaller (and those
getters are going to be reworked further anyway).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's going to become a facade in front of both -- wasm and lua, so keep
it in files with language independent names.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In case multiple clients issue concurrently CREATE KEYSPACE IF NOT EXISTS
and later USE KEYSPACE it can happen that schema in driver's session is
out of sync because it synces when it receives special message from
CREATE KEYSPACE response.
Similar situation occurs with other schema change statements.
In this patch we fix only create keyspace/table/type/view statements
by always sending created event. Behavior of any other schema altering
statements remains unchanged.
This should reduce the risk of re-introducing issue similar to
the one fixed in ab6988c52f
When grant code is closer to actual creation code (announcing mutations)
there is lower chance of those two effects being triggered differently,
if we ever call grant_permissions_to_creator and not announce mutations
that's very likely a security vulnerability.
Additionally comment was rewritten to be more accurate.
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog - this can be observed in the following scenario:
1. Cluster starts, all nodes gossip their empty view update backlog to one another
2. On node N, `view_update_backlog_broker` (the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
3. Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
4. Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
5. In the next and following iterations of `view_update_backlog_broker` on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the check
```
auto backlog = _sp.local().get_view_update_backlog();
if (backlog_published && *backlog_published == backlog) {
sleep_abortable(gms::gossiper::INTERVAL, _as).get();
continue;
}
```
After this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty
This patch fixes the issue this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.
Fixes: https://github.com/scylladb/scylladb/issues/18461
Similarly to https://github.com/scylladb/scylladb/pull/18646, without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none
Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests
Closesscylladb/scylladb#18663
* github.com:scylladb/scylladb:
mv: gossip the same backlog if a different backlog was sent in a response
node_update_backlog: divide adding and fetching backlogs
Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.
Closesscylladb/scylladb#19102
as the matcher actually applies to all warnings from clang frontend,
and hence can be reused when building the tree with clang, so let's
rename it before using it in the clang build workflows.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This commit updates the configuration for ScyllaDB documentation so that:
6.0 is the latest version.
6.0 is removed from the list of unstable versions.
It must be merged when ScyllaDB 6.0 is released.
No backport is required.
Closesscylladb/scylladb#19003
before this change, we use "RPC or native". but before thrift support
is removed "RPC" implies "thrift", now that we've dropped thrift
support, "RPC" could be confusing here, so let's be more specific,
and put all connection types in place of "RPC or native".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
replace all occurrences of "rpc" in function names and debugging
messages to "thrift", as "rpc" is way too general, and since we
are removing "thrift" support, let's take this opportunity to
use a more specific name.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so we don't create new sstables with this unused column, but we
can still open old sstables of this table which was created with
the old schema.
Refs #3811
Refs #18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we've dropped the thift support, the `client_type` is always
`cql`, there is no need to differentiate different clients anymore.
so, we change `make_client_key()` so that it only return the IP address
and port.
Refs #3811
Refs #18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we've addressed all warnings, we are ready to tighten the
standards of this workflow, so that contributors are awared of
the violation of include-what-you-use policy.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
thrift support was deprecated since ScyllaDB 5.2
> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.
so let's drop it. in this change,
* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is
preserved for backward compatibility, as we could load
from an existing system.local table which still contains
this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for
backward compatibility with java-based nodetool.
* `rpc_port` and `start_rpc` options are preserved, but
they are marked as "Unused". so that the new release
of scylladb can consume existing scylla.yaml configurations
which might contain these settings. by making them
deprecated, user will be able get warned, and update
their configurations before we actually remove them
in the next major release.
Fixes#3811Fixes#18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Due to gradual raft introduction into statements code in cases when single statement modified more than one table or mutation producing function was composed out of simpler ones we violated transactional logic and statement execution was not atomic as whole.
This patch changes that, so now either all changes resulting from statement execution are applied or none. Affected statements types are:
- schema modification
- auth modifications
- service levels modifications
Fixes https://github.com/scylladb/scylladb/issues/17738Closesscylladb/scylladb#17910
* github.com:scylladb/scylladb:
raft: rename mutations_collector to group0_batch
raft: rename announce to commit
cql3: raft: attach description to each mutations collector group
auth: unify mutations_generator type
auth: drop redundant 'this' keyword
auth: remove no longer used code from standard_role_manager::legacy_modify_membership
cql3: auth: use mutation collector for service levels statements
cql3: auth: use mutation collector for alter role
cql3: auth: use mutation collector for grant role and revoke role
cql3: auth: use mutation collector for drop role and auto-revoke
auth: add refactored modify_membership func in standard_role_manager
auth: implement empty revoke_all in allow_all_authorizer
auth: drop request_execution_exception handling from default_authorizer::revoke_all
Revert "Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks"
cql3: auth: use mutation collector for grant and revoke permissions
cql3: extract changes_tablets function in alter_keyspace_statement
cql3: auth: use mutation collector for create role statement
auth: move create_role code into service
auth: add a way to announce mutations having only client_state ref
auth: add collect_mutations common helper
auth: remove unused header in common.hh
auth: add class for gathering mutations without immediate announce
auth: cql3: use auth facade functions consistently on write path
auth: remove unused is_enforcing function
1. Fixed a single typo (send -> sent)
2. Rephrase 'How many' to 'Number of' and use less passive tense.
3. Be more specific in the description of the different metrics insteda of the more generic descriptions.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#19067
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.
One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.
Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.
Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.
A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.
Fixes#17658.
Fixes#18561.
Closesscylladb/scylladb#18641
* github.com:scylladb/scylladb:
test: pylib: Do not block async reactor while removing directories
repair: Exclude tablet migrations with tablet repair
repair_service: Propagate topology_state_machine to repair_service
main, storage_service: Move topology_state_machine outside storage_service
storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
tablet_scheduler: Make disabling of balancing interrupt shuffle mode
tablet_scheduler: Log whether balancing is considered as enabled
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".
Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:
storage_proxy - No mapping for :: in the passed effective replication map
It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.
Users which need replica list stability should use the host_id-based API.
Fixes#18843Closesscylladb/scylladb#18955
* github.com:scylladb/scylladb:
tablets: Filter-out left nodes in get_natural_endpoints()
test: pylib: Extract start_writes() load generator utility
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a the backlog changed
since the last gossip round so a different backlog could have been send
through the response piggyback mechanism. With that information, gossip
will send an unchanged backlog to other nodes in the following gossip round.
Fixes: https://github.com/scylladb/scylladb/issues/18461
Currently, we only update the backlogs in node_update_backlog at the
same time when we're fetching them. This is done using storage_proxy's
method get_view_update_backlog, which is confusing because it's a getter
with side-effects. Additionally, we don't always want to update the
backlog when we're reading it (as in gossip which is only on shard 0)
and we don't always want to read it when we're updating it (when we're
not handling any writes but the backlog drops due to background work
finish).
This patch divides the node_view_backlog::add_fetch as well the
storage_proxy::get_view_update_backlog both into two methods; one
for updating and one for reading the backlog. This patch only replaces
the places where we're currently using the view backlog getter, more
situations where we should get/update the backlog should be considered
in a following patch.
It consists of reading method and parsing one and it uses class fields to carry data between those two. The former is additionally built with curly continuation chains, while it's naturally linear, so turn it into a coroutine while at it
Closesscylladb/scylladb#18994
* github.com:scylladb/scylladb:
snitch: Remove production_snitch_base::_prop_file_contents
snitch: Remove production_snitch_base::_prop_file_size
snitch: Coroutinize load_property_file()
All sharded<service>'s a supposed to have their own config and not use global db::config one. The service config, in turn, is to be created by main/cql_test_env/whatever out of db::config and, maybe, other data. Gossiper is almost there, but it still uses db::config in few places.
Closesscylladb/scylladb#19051
* github.com:scylladb/scylladb:
gossiper: Stop using db::config
gossiper: Move force_gossip_generation on gossip_config
gossiper: Move failure_detector_timeout_ms on gossip_config
main: Fix indentation after previous patch
main: Make gossiper config a sharded parameter
main: Add local variable for set of seeds
main: Add local variable for group0 id
main: Add local variable for cluster_name
Protocol servers are started last, and are registered in storage_service, which stops them. Also there are deferred actions scheduled to stop protocol servers on aborted start and a FIXME asking to make even this case rely on storage_service. Also, there's a (rather rare) aborted-start bug in alternator and redis. Yet, thrift can be left started in some weird circumstances. This patch fixes it all. As a side effect, the start-stop code becomes shorter and a bit better structured.
refs: #2737Closesscylladb/scylladb#19042
* github.com:scylladb/scylladb:
main: Start alternator expiration service earlier
main: Start redis transparently
main: Start alternator transparently
main: Start thrift transparently
main: Start native transport transparently
storage_service: Make register_protocol_server() start the server
storage_service: Turn register_protocol_server() async method
storage_service: Outline register_protocol_server()
main: Schedule deferred drain_on_shutdown() prior to protocol servers
main: Move some trailing startup earlier
this series
* add annotation to the github pull request when extraneous `#include` processor macros are identified
* add `exceptions` subdirectory to `CLEANER_DIRS` to demonstrate the annotation. we will fix the identified issue in a follow-up change.
---
* This is a CI workflow improvement. No backporting is required.
Closesscylladb/scylladb#19037
* github.com:scylladb/scylladb:
.github: add exception to CLEANER_DIRS
.github: annotate the report from clang-include-cleaner
.github: build headers before running clang-include-cleaner
Currently it gets the streaming/maintenance one from database, but it
can as well just assume that it's already running in the correct one,
and the main code fulfils this assumption.
This removes one more place that uses database as sched groups provider.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19078
clang-19 introduced a change which enforces the change proposed by [CWG 96](https://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#96), which was accepted by C++20 in [P1787R6](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1787r6.html), as [[temp.names]p5](https://eel.is/c++draft/temp.names#6).
so, to be future-proof and to be standard compliant, let's pass the
template arguments. otherwise we'd have build failure like
```
error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
```
---
no need to backport. as this change only addresses a FTBFS with a recent build of clang-19. but our CI is not a clang built from llvm's main HEAD.
Closesscylladb/scylladb#19100
* github.com:scylladb/scylladb:
util/result_try: pass template arg list explicitly
util/result_try: pass func as `const F&` instead of `F&&`
We recently added to cql-pytest tests the ability to check if tablets
are enabled or not (for some tablet-specific tests). When running
tests against Cassandra or old pre-tablet versions of Scylla, this
fact is detected and "False" is returned immediately. However, we
still look at a system table which didn't exist on really ancient
versions of Scylla, and tests couldn't run against such versions.
The fix is trivial: if that system table is missing, just ignore the
error and return False (i.e., no tablets). There were no tablets on
such ancient versions of Scylla.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19098
It's unused, populator uses it to print debugging messages, but it can
as well use table->dir() for it, just as sstable_directory does. One
message looks useless and is removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19113
Currently the code wraps simple "if" with std::invoke over a lambda.
Also, the local variable that gets the result, is declared as const one,
which prevents it from being std::move()-d in the very next line.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19106
This vector of paths is only used to generate the same vector of paths
for table config, but the latter already has all the needed info.
It's the part of the plan to stop using paths/directories in keyspaces
and tables, because with storage-options tables no longer keep their
data in "files on disk", so this information goes to sstables storage
manager (refs #12707)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19119
before this change, unlike other services in scylla,
topology_coordinator is not properly stopped when it is aborted,
because the scylla instance is no longer a leader or is being shut down.
its `run()` method just stops the grand loop and bails out before
topology_coordinator is destroyed. but we are tracking the migration
state of tablets using a bunch of futures, which might not be
handled yet, and some of them could carry failures. in that case,
when the `future` instances with failure state get destroyed,
seastar calls `report_failed_future`. and seastar considers this
practice a source a bug -- as one just fails to handle an error.
that's why we have following error:
```
WARN 2024-05-19 23:00:42,895 [shard 0:strm] seastar - Exceptional future ignored: seastar::rpc::unknown_verb_error (unknown verb), backtrace: /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56c14e /home/bhalevy/.ccm/scylla-repository/local_tarball/libre
loc/libseastar.so+0x56c770 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56ca58 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x38c6ad 0x29cdd07 0x29b376b 0x29a5b65 0x108105a /home/bhalevy/.ccm/scylla-repository/local_tarbal
l/libreloc/libseastar.so+0x3ff1df /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x400367 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff838 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36de58
/home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d092 0x1017cba 0x1055080 0x1016ba7 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a 0x1015524
```
and the backtrace looks like:
```
seastar::current_backtrace_tasklocal() at ??:?
seastar::current_tasktrace() at ??:?
seastar::current_backtrace() at ??:?
seastar::report_failed_future(seastar::future_state_base::any&&) at ??:?
service::topology_coordinator::tablet_migration_state::~tablet_migration_state() at topology_coordinator.cc:?
service::topology_coordinator::~topology_coordinator() at topology_coordinator.cc:?
service::run_topology_coordinator(seastar::sharded<db::system_distributed_keyspace>&, gms::gossiper&, netw::messaging_service&, locator::shared_token_metadata&, db::system_keyspace&, replica::database&, service::raft_group0&, service::topology_state_machine&, seastar::abort_source&, raft::server&, seastar::noncopyable_function<seastar::future<service::raft_topology_cmd_result> (utils::tagged_tagged_integer<raft::internal::non_final, raft::term_tag, unsigned long>, unsigned long, service::raft_topology_cmd const&)>, service::tablet_allocator&, std::chrono::duration<long, std::ratio<1l, 1000l> >, service::endpoint_lifecycle_notifier&) [clone .resume] at topology_coordinator.cc:?
seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at main.cc:?
seastar::reactor::run_some_tasks() at ??:?
seastar::reactor::do_run() at ??:?
seastar::reactor::run() at ??:?
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ??:?
```
and even worse, these futures are indirectly owned by `topology_coordinator`.
so there are chances that they could be used even after `topology_coordinator`
is destroyed. this is a use-after-free issue. because the
`run_topology_coordinator` fiber exits when the scylla instance retires
from the leader's role, this use-after-free could be fatal to a
running instance due to undefined behavior of use after free.
so, in this change, we handle the futures in `_tablets`, and note
down the failures carried by them if any.
Fixes#18745
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18991
* tools/cqlsh c8158555...0d58e5ce (6):
> cqlsh.py: fix server side describe after login command
> cqlsh: try server-side DESCRIBE, then client-side
> Refactor tests to accept both client and server side describe
> github actions: support testing with enterprise release
> Add the tab-completion support of SERVICE_LEVEL statements
> reloc/build_reloc.sh: don't use `--no-build-isolation`
Closesscylladb/scylladb#18990
Fetching only the first page is not the intuitive behavior expected by users.
This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.
Fixes#18774Closesscylladb/scylladb#19004
This fixes a problem where suite cleanup schedules lots of uninstall()
tasks for servers started in the suite, which schedules lots of tasks,
which synchronously call rmtree(). These take over a minute to finish,
which blocks other tasks for tests which are still executing.
In particular, this was observed to case
ManagerClient.server_stop_gracefully() to time-out. It has a timeout
of 60 seconds. The server was stopped quickly, but the RESTful API
response was not processed in time and the call timed out when it got
the async reactor.
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.
One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.
Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.
Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requets start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.
Fixes#17658.
Fixes#18561.
Will be used later in a place which doesn't have access to storage_service
but has to toplogy_state_machine.
It's not necessary to start group0 operation around polling because
the busy() state can be checked atomically and if it's false it means
the topology is no longer busy.
Tests will rely on that, they will run in shuffle mode, and disable
balancing around section which otherwise would be infinitely blocked
by ongoing shuffling (like repair).
Assigning to a member of an uninitialized optional
does not initialize the object before assigning to it.
This resulted in the AddressSanitizer detecting attempt
to double-free when the uninitialized string contained
apprently a bogus pointer.
The change emplaces the returned optional when needed
without resorting to the copy-assignment operator.
So it's not suceptible to assigning to uninitialized
memory, and it's more efficient as well...
Fixesscylladb/scylladb#19041
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#19043
If a node restart just before it stores bootstrapping node's IP it will
not have ID to IP mapping for bootstrapping node which may cause failure
on a write path. Detect this and fail bootstrapping if it happens.
Closesscylladb/scylladb#18927
* github.com:scylladb/scylladb:
raft topology: fix indentation after previous commit
raft topology: do not add bootstrapping node without IP as pending
test: add test of bootstrap where the coordinator crashes just before storing IP mapping
schema_tables: remove unused code
as we the functor passed to `invoke()` is not a rvalue, if we specify
the template parameter explicitly, clang errors out like:
```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -I/home/kefu/dev/scylladb/build -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/abseil -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT transport/CMakeFiles/transport.dir/RelWithDebInfo/server.cc.o -MF transport/CMakeFiles/transport.dir/RelWithDebInfo/server.cc.o.d -o transport/CMakeFiles/transport.dir/RelWithDebInfo/server.cc.o -c /home/kefu/dev/scylladb/transport/server.cc
In file included from /home/kefu/dev/scylladb/transport/server.cc:39:
/home/kefu/dev/scylladb/utils/result_try.hh:210:28: error: no matching function for call to 'invoke'
210 | return Converter::template invoke<const Cb, const Ex&>(_cb, ex);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/utils/result_try.hh:194:143: note: while substituting into a lambda expression here
194 | return [this, cont = std::forward<Continuation>(cont)] (bool& already_caught) mutable -> typename Converter::template wrapped_type<R> {
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:327:40: note: in instantiation of function template specialization 'utils::internal::result_catcher<exceptions::unavailable_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:521:68)>::wrap_in_catch<boost::outcome_v2::basic_result<seastar::foreign_ptr<std::unique_ptr<cql_transport::response>>, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy>, utils::internal::noop_converter, (lambda at /home/kefu/dev/scylladb/utils/result_try.hh:518:76)>' requested here
327 | first_handler.template wrap_in_catch<R, Converter, Continuation>(std::forward<Continuation>(cont)),
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:518:54: note: in instantiation of function template specialization 'utils::internal::try_catch_chain_impl<boost::outcome_v2::basic_result<seastar::foreign_ptr<std::unique_ptr<cql_transport::response>>, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy>, utils::internal::noop_converter, utils::internal::result_catcher<exceptions::unavailable_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:521:68)>, utils::internal::result_catcher<exceptions::read_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:526:69)>, utils::internal::result_catcher<exceptions::read_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:531:69)>, utils::internal::result_catcher<exceptions::mutation_write_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:536:79)>, utils::internal::result_catcher<exceptions::mutation_write_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:541:79)>, utils::internal::result_catcher<exceptions::already_exists_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:546:71)>, utils::internal::result_catcher<exceptions::prepared_query_not_found_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:551:81)>, utils::internal::result_catcher<exceptions::function_execution_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:556:75)>, utils::internal::result_catcher<exceptions::rate_limit_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:561:67)>, utils::internal::result_catcher<exceptions::cassandra_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:566:66)>, utils::internal::result_catcher<std::exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:578:49)>, utils::internal::result_catcher_dots<(lambda at /home/kefu/dev/scylladb/transport/server.cc:591:38)>>::invoke_in_try_catch<(lambda at /home/kefu/dev/scylladb/utils/result_try.hh:518:76)>' requested here
518 | result_type res = try_catch_chain_type::template invoke_in_try_catch<>([&fun] (bool&) { return fun(); }, handlers...);
| ^
/home/kefu/dev/scylladb/transport/server.cc:484:83: note: in instantiation of function template specialization 'utils::result_try<(lambda at /home/kefu/dev/scylladb/transport/server.cc:484:94), utils::internal::result_catcher<exceptions::unavailable_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:521:68)>, utils::internal::result_catcher<exceptions::read_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:526:69)>, utils::internal::result_catcher<exceptions::read_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:531:69)>, utils::internal::result_catcher<exceptions::mutation_write_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:536:79)>, utils::internal::result_catcher<exceptions::mutation_write_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:541:79)>, utils::internal::result_catcher<exceptions::already_exists_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:546:71)>, utils::internal::result_catcher<exceptions::prepared_query_not_found_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:551:81)>, utils::internal::result_catcher<exceptions::function_execution_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:556:75)>, utils::internal::result_catcher<exceptions::rate_limit_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:561:67)>, utils::internal::result_catcher<exceptions::cassandra_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:566:66)>, utils::internal::result_catcher<std::exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:578:49)>, utils::internal::result_catcher_dots<(lambda at /home/kefu/dev/scylladb/transport/server.cc:591:38)>>' requested here
484 | return utils::result_into_future<result_with_foreign_response_ptr>(utils::result_try([&] () -> result_with_foreign_response_ptr {
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:33:5: note: candidate function template not viable: expects an rvalue for 1st argument
33 | invoke(F&& f, Args&&... args) {
| ^ ~~~~~
/home/kefu/dev/scylladb/utils/result_try.hh:210:28: error: no matching function for call to 'invoke'
210 | return Converter::template invoke<const Cb, const Ex&>(_cb, ex);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/utils/result_try.hh:194:143: note: while substituting into a lambda expression here
194 | return [this, cont = std::forward<Continuation>(cont)] (bool& already_caught) mutable -> typename Converter::template wrapped_type<R> {
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:327:40: note: in instantiation of function template specialization 'utils::internal::result_catcher<exceptions::read_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:526:69)>::wrap_in_catch<boost::outcome_v2::basic_result<seastar::foreign_ptr<std::unique_ptr<cql_transport::response>>, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy>, utils::internal::noop_converter, (lambda at /home/kefu/dev/scylladb/utils/result_try.hh:194:16)>' requested here
327 | first_handler.template wrap_in_catch<R, Converter, Continuation>(std::forward<Continuation>(cont)),
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:326:79: note: in instantiation of function template specialization 'utils::internal::try_catch_chain_impl<boost::outcome_v2::basic_result<seastar::foreign_ptr<std::unique_ptr<cql_transport::response>>, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy>, utils::internal::noop_converter, utils::internal::result_catcher<exceptions::read_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:526:69)>, utils::internal::result_catcher<exceptions::read_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:531:69)>, utils::internal::result_catcher<exceptions::mutation_write_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:536:79)>, utils::internal::result_catcher<exceptions::mutation_write_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:541:79)>, utils::internal::result_catcher<exceptions::already_exists_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:546:71)>, utils::internal::result_catcher<exceptions::prepared_query_not_found_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:551:81)>, utils::internal::result_catcher<exceptions::function_execution_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:556:75)>, utils::internal::result_catcher<exceptions::rate_limit_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:561:67)>, utils::internal::result_catcher<exceptions::cassandra_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:566:66)>, utils::internal::result_catcher<std::exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:578:49)>, utils::internal::result_catcher_dots<(lambda at /home/kefu/dev/scylladb/transport/server.cc:591:38)>>::invoke_in_try_catch<(lambda at /home/kefu/dev/scylladb/utils/result_try.hh:194:16)>' requested here
326 | return try_catch_chain_impl<R, Converter, CatchHandlers...>::template invoke_in_try_catch<>(
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:518:54: note: in instantiation of function template specialization 'utils::internal::try_catch_chain_impl<boost::outcome_v2::basic_result<seastar::foreign_ptr<std::unique_ptr<cql_transport::response>>, utils::exception_container<exceptions::mutation_write_timeout_exception, exceptions::read_timeout_exception, exceptions::read_failure_exception, exceptions::rate_limit_exception>, utils::exception_container_throw_policy>, utils::internal::noop_converter, utils::internal::result_catcher<exceptions::unavailable_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:521:68)>, utils::internal::result_catcher<exceptions::read_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:526:69)>, utils::internal::result_catcher<exceptions::read_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:531:69)>, utils::internal::result_catcher<exceptions::mutation_write_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:536:79)>, utils::internal::result_catcher<exceptions::mutation_write_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:541:79)>, utils::internal::result_catcher<exceptions::already_exists_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:546:71)>, utils::internal::result_catcher<exceptions::prepared_query_not_found_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:551:81)>, utils::internal::result_catcher<exceptions::function_execution_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:556:75)>, utils::internal::result_catcher<exceptions::rate_limit_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:561:67)>, utils::internal::result_catcher<exceptions::cassandra_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:566:66)>, utils::internal::result_catcher<std::exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:578:49)>, utils::internal::result_catcher_dots<(lambda at /home/kefu/dev/scylladb/transport/server.cc:591:38)>>::invoke_in_try_catch<(lambda at /home/kefu/dev/scylladb/utils/result_try.hh:518:76)>' requested here
518 | result_type res = try_catch_chain_type::template invoke_in_try_catch<>([&fun] (bool&) { return fun(); }, handlers...);
| ^
/home/kefu/dev/scylladb/transport/server.cc:484:83: note: in instantiation of function template specialization 'utils::result_try<(lambda at /home/kefu/dev/scylladb/transport/server.cc:484:94), utils::internal::result_catcher<exceptions::unavailable_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:521:68)>, utils::internal::result_catcher<exceptions::read_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:526:69)>, utils::internal::result_catcher<exceptions::read_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:531:69)>, utils::internal::result_catcher<exceptions::mutation_write_timeout_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:536:79)>, utils::internal::result_catcher<exceptions::mutation_write_failure_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:541:79)>, utils::internal::result_catcher<exceptions::already_exists_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:546:71)>, utils::internal::result_catcher<exceptions::prepared_query_not_found_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:551:81)>, utils::internal::result_catcher<exceptions::function_execution_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:556:75)>, utils::internal::result_catcher<exceptions::rate_limit_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:561:67)>, utils::internal::result_catcher<exceptions::cassandra_exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:566:66)>, utils::internal::result_catcher<std::exception, (lambda at /home/kefu/dev/scylladb/transport/server.cc:578:49)>, utils::internal::result_catcher_dots<(lambda at /home/kefu/dev/scylladb/transport/server.cc:591:38)>>' requested here
484 | return utils::result_into_future<result_with_foreign_response_ptr>(utils::result_try([&] () -> result_with_foreign_response_ptr {
| ^
/home/kefu/dev/scylladb/utils/result_try.hh:33:5: note: candidate function template not viable: expects an rvalue for 1st argument
33 | invoke(F&& f, Args&&... args) {
| ^ ~~~~~
```
so to prepare for the change to pass template parameter explicitly,
let's pass `f` as a `const` reference, instead of as a rvalue refernece.
also, this parameter type matches with our usage case -- we always
pass a member variable `_cb` to `invoke`, and we don't expect that
`invoke()` would move it away.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* seastar 914a4241...9ce62705 (18):
> github: do not set --dpdk-machine haswell
> io_tester: correct calculation of writes count
> io-tester.md: update information about file size
> reactor: align used hint for extent size to 128KB for XFS
> Fix compilation failure on Ubuntu 22.04
> io_tester: align the used file size to 1MB
> circular_buffer_fixed_capacity: arrow operator instead of . operator
> posix-file-impl: Do not keep device-id on board
> github: s/clang++-18/clang++/
> include: include used headers
> include: include used headers
> iotune: allow user to set buffer size for random IO
> abort_source: add method to get exception pointer
> github: cancel a job if it takes longer than 40 minutes
> std-compat: remove #include:s which were added for pre C++17
> perf_tests: measure and report also cpu cycles
> linux_perf_events: add user_cpu_cycles_retired
> linux_perf_event: user_instructions_retired: exclude_idle
Closesscylladb/scylladb#19019
Next patches will put updateable_value's on it, but plain copy of them
across shard doesn't work (see #7316)
Indentation is deliberately left broken
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will do seeds assignment to gossiper config on each
shard, so it's good to have it once, then copy around
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will do group0_id assignment to gossiper config on each
shard, so it's good to have it once, then copy around
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's modified if its empty, next patch will make this code be called on
each shard, so modification must happen only once
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This description is readable from raft log table.
Previously single description was provided for the whole
announce call but since it can contain mutations from
various subsystems now description was moved to
add_mutation(s)/add_generator function calls.
mutation_collector supports generators but it was added to
/service/raft code so it couldn't depend on /auth/ but once
it's added we can remove generator type from /auth/ as it
can depend on /service/raft.
The main theme of this commit is executing drop
keyspace/table/aggregate/function statements in a single
transaction together with auth auto-revoke logic.
This is the logic which cleans related permissions after
resource is deleted.
It contains serveral parts which couldn't easily be split
into separate commits mainly because mutation collector related
paths can't be mixed together. It would require holding multiple
guards which we don't support. Another reason is that with mutation
collector the changes are announced in a single place, at the end
of statement execution, if we'd announce something in the middle
then it'd lead to raft concurrent modification infinite loop as it'd
invalidate our guard taken at the begining of statement execution.
So this commit contains:
- moving auto-revoke code to statement execution from migration_listener
* only for auth-v2 flow, to not break the old one
* it's now executed during statement execution and not merging schemas,
which means it produces mutations once as it should and not on each
node separately
* on_before callback family wasn't used because I consider it much
less readable code. Long term we want to remove
auth_migration_listener.
- adding mutation collector to revoke_all
* auto-revoke uses this function so it had to be changed,
auth::revoke_all free function wrapper was added as cql3
layer should not use underlying_authorizer() directly.
- adding mutation collector to drop_role
* because it depends on revoke_all and we can't mix old and new flows
* we need to switch all functions auth::drop_role call uses
* gradual use of previously introduced modify_membership, otherwise
we would need to switch even more code in this commit
The new function is simplified and handles only auth-v2 flow
with mutation_collector (single transaction logic).
It's not used in this commit and we'll switch code paths
gradually in subsequent commits.
The change applies only to auth-v2 code path.
It seems nothing in the code except cdc and truncate
throws this exception so it's probably dead code.
I'll keep it for now in other places to not accidentally
break things in auth-v1, in auth-v2 even if this exception
is used it should likely fail the query because otherwise
data consistency is silently violated.
This is done to achieve single transaction semantics.
The change includes auto-grant feature. In particular
for schema related auto-grant we don't use normal
mutation collector announce path but follow migration manager,
this may be unified in the future.
This is done to achieve single transaction semantics.
grant_permissions_to_creator is logically part of create role
but its change will be included in following commits
as it spans multiple usages.
Additinally we disabled rollback during create role as
it won't work and is not needed with single transaction logic.
We need this later as we'll add condition
based on legacy_mode(qp) and free function
doesn't have access to qp.
Moreover long term we should get rid of this
weird free function pattern bloat.
Statements code have only access to client_state from
which it takes auth::service. It doesn't have abort_source
nor group0_client so we need to add them to auth::service.
Additionally since abort_source can't be const the whole
announce_mutations method needs non const auth::service
so we need to remove const from the getter function.
To achieve write atomicity across different tables we need to announce
mutations in a single transaction. So instead of each function doing
a separate announce we need to collect mutations and announce them once
at the end.
In d0f5873, we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster.
This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained.
We also introduce error injections to be able to test that it indeed happens.
Fixesscylladb/scylladb#18761Closesscylladb/scylladb#18764
* github.com:scylladb/scylladb:
db/hints: Introduce an error injection to test draining
db/hints: Ensure that draining happens
Task manager's tasks stay in memory after they are finished.
Moreover, even if a child task is unregistered from task manager,
it is still alive since its parent keeps a foreign pointer to it. Also,
when a task has finished successfully there is no point in keeping
all of its descendants in memory.
The patch introduces folding of task manager's tasks. Whenever
a task which has a parent is finished it is unregistered from task
manager and foreign_ptr to it (kept in its parent) is replaced
with its status. Children's statuses of the task are dropped unless
they or one of their descendants failed. So for each operation we
keep a tree of tasks which contains:
- a root task and its direct children (status if they are finished, a task
otherwise);
- running tasks and their direct children (same as above);
- a statuses path from root to failed tasks.
/task_manager/wait_task/ does not unregister tasks anymore.
Refs: #16694.
- [ ] ** Backport reason (please explain below if this patch should be backported or not) **
Requires backport to 6.0 as task number exploded with tablets.
Closesscylladb/scylladb#18735
* github.com:scylladb/scylladb:
docs: describe task folding
test: rest_api: add test for task tree structure
test: rest_api: modify new_test_module
tasks: test: modify test_task methods
api: task_manager: do not unregister task in /task_manager/wait_task/
tasks: unregister tasks with parents when they are finished
tasks: fold finished tasks info their parents
tasks: make task_manager::task::impl::finish_failed noexcept
tasks: change _children type
Prior to registering drain_on_shutdown and all the protorocl servers.
To keep the natural sequence
- start core
- register drain-on-shutdown
- start transport(s)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now possible to start protocol server when registered. It will also
be stopped automatically on shutdown / aborted shutdown.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now possible to start protocol server when registered. It will also
be stopped automatically on shutdown / aborted shutdown.
Also move the controller variable lower to keep it all next to each
other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now possible to start protocol server when registered. It will also
be stopped automatically on shutdown / aborted shutdown.
It also fixes a rare bug. If thrifst is not asked to be started on boot,
its deferred shutdown action isn't created, so it it's later started via
the API, it won't be stopped on shutdown.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now possible to start protocol server when registered. It will also
be stopped automatically on shutdown / aborted shutdown.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Auth interface is quite mixed-up but general rule is that cql
statements code calls auth::* free functions from auth/service.hh
to execute auth logic.
There are many exceptions where underlying_authorizer or
underlying_role_manager or auth::service method is used instead.
Service should not leak it's internal APIs to upper layers so
functions like underlying_role_manager should not exists.
In this commit we fix tiny fragment related to auth write path.
Currently, the backlog used for MV flow control is only updated
after we generate view updates as a result of a write request.
However, when the resources are no longer used, we should also
notice that to prevent excessive slowdowns caused by the MV
flow control calulating the delays based of an outdated, large
backlog.
This patch makes it so the backlogs are updated every time
a view update finishes, and not only when the updates start.
Fixes#18783Closesscylladb/scylladb#18804
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc
Use the tablet_map get_primary_replica or get_primary_replica_within_dc,
respectively to see if this node is the primary replica for each tablet
or not.
Fixes https://github.com/scylladb/scylladb/issues/17752
No backport is required before 6.0 as tablets (and tablet repair) are introduced in 6.0
Closesscylladb/scylladb#18784
* github.com:scylladb/scylladb:
repair: repair_tablets: use get_primary_replica
repair: repair_tablets: no need to check ranges_specified per tablet
locator: tablet_map: add get_primary_replica_within_dc
locator: tablet_map: get_primary_replica: do not copy tablet info
locator: tablet_map: get_primary_replica: return tablet_replica
when creating the build rules using CMake 3.28 and up, it generates
the rules to scan for C++20 modules for C++20 projects by default.
but this slows down the compilation, and introduces unnecessary
dependencies for each of the targets when building .cc files. also,
it prevents the static analysis tools from running from a repo which
only have its building system generated, but not yet built. as,
these tools would need to process the source files just like a compiler
does, and if any of the included header files is missing, they just
fail.
so, before we migrate to C++20 modules, let's disable this feature.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#19038
After a protocol server is registered, it can be instantly started by
the main code. It makes sense to generalize this sequence by teaching
register_protocol_server() start it.
For now it's a no-op change, as "start_instantly" is false by default,
but next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nex patches will remove protocol servers' deferred stops and will rely
on drain_on_shutdown -> stop_transport to do it, so the drain deferred
action some come before protocol servers' registration.
This also fixes a bug. Currently alternator and redis both rely on
protocol servers to stop them on shutdown. However, when startup is
aborted prior to drain_on_shutdown() registration, protocol servers are
not stopped and alternator and redis can remain stopped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_abort_on_ebadf() call and some api endpoints registration come
after protocol servers. The latter is going to be shuffled, so move the
former earlier not to hang around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
to cover more directories to prevent regressions of violating
the "include what you use" policy in this directory.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, user has to click into the "Details" link for
access the report from clang-include-cleaner. but this is neither
convenient nor obvious.
after this change, the report is annotated in the github web interface,
this helps the reviewers and contributers to user this tool in a
more efficient way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
clang-include-cleaner actually interprets the preprocessor macros,
and looks at the symbols. so we have to prepare the included headers
before using it.
so, but in ScyllaDB, we don't have a single target for building all the
used headers, so we have to build them either in batch of separately.
in this change, we build the included headers before running
clang-include-cleaner. this allows us to run clang-include-cleaner on
more source files.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The test test_table.py::test_concurrent_create_and_delete_table failed
on Amazon DynamoDB because of a silly typo - "false" instead of "False".
A function detecting Scylla tried to return false when noticing this
isn't Scylla - but had a typo, trying to return "false" instead of "False".
This patch fixes this typo, and the test now works on DynamoDB:
test/alternator/run --aws test_table.py::test_concurrent_create_and_delete_table
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17799
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.
Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.
For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions. One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2. Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):
Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}
The worst state before the patch had the following distribution of tablets for the smaller table:
Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33
One node has as many as 171 tablets of that table and another one has as few as 3.
After the patch, the worst distribution looks like this:
Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65
Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.
Refs #16824Closesscylladb/scylladb#18885
* github.com:scylladb/scylladb:
tablets: load balancer: Use random selection of candidates when moving tablets
test: perf: Add test for tablet load balancer effectiveness
load_sketch: Extract get_shard_minmax()
load_sketch: Allow populating only for a given table
Tablet allocation does not guarantee fairness of
the first replica in the replicas set across dcs.
The lack of this fix cause the following dtest to fail:
repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc
Use the tablet_map get_primary_replica* functions to get
the primary replica for each tablet, possibly within a dc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The code already turns off `primary_replica_only`
if `!ranges_specified.empty()`, so there's no need to
check it again inside the per-tablet loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In order to avoid per-table tablet load imbalance balance from forming
in the cluster after adding nodes, the load balancer now picks the
candidate tablet at random. This should keep the per-table
distribution on the target node similar to the distribution on the
source nodes.
Currently, candidate selection picks the first tablet in the
unordered_set, so the distribution depends on hashing in the unordered
set. Due to the way hash is calculated, table id dominates the hash
and a single table can be chosen more often for migration away. This
can result in imbalance of tablets for any given table after
bootstrapping a new node.
For example, consider the following results of a simulation which
starts with a 6-node cluster and does a sequence of node bootstraps
and decommissions. One table has 4096 tablets and RF=1, and the other
has 256 tablets and RF=2. Before the patch, the smaller table has
node overcommit of 2.34 in the worst topology state, while after the
patch it has overcommit of 1.65. overcommit is calculated as max load
(tablet count per node) dividied by perfect average load (all tablets / nodes):
Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64}
Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}}
Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}}
Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}}
The worst state before the patch had the following distribution of tablets for the smaller table:
Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62
Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76
Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88
Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05
Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37
Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74
Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33
One node has as many as 171 tablets of that table and the one has as few as 3.
After the patch, the worst distribution looks like this:
Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17
Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68
Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00
Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32
Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88
Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72
Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65
Most-loaded node has 121 tablets and least loaded node has 34 tablets.
It's still not good, a better distribution is possible, but it's an improvement.
Refs #16824
We have two paths for generating the json text representation, one
for large items and one for small items, but the large item path is
lacking:
- it doesn't yield, so a response with many items will stall
- it doesn't wait for network sends to be accepted by the network
stack, so it will allocate a lot of memory
Fix by moving the generation to a thread. This allows us to wait for
the network stack, which incidentally also fixes stalls.
The cost of the thread is amortized by the fact we're emitting a large
response.
Fixes#18806Closesscylladb/scylladb#18807
Update docs for backup procedure to use `DESC SCHEMA WITH INTERNALS`
instead of plain `DESC SCHEMA`.
Add a note to use cqlsh in a proper version (at least 6.0.19).
Closesscylladb/scylladb#18953
If /task_manager/wait_task/ unregisters the task, then there is no
way to examine children failures, since their statuses can be checked
only through their parent.
Currently, when a child task is unregistered, it is still kept by its parent. This leads
to excessive memory usage, especially when the tasks are configured to be kept in task
manager after they are finished (task_ttl_in_seconds).
Introduce task_essentials struct which keeps only data necesarry for task manager API.
When a task which has a parent is finished, a foreign pointer to it in its parent is replaced
with respective task_essentials. Once a parent task is finished it is also folded into
its parent (if it has one). Children details of a folded task are lost, unless they
(or some of their subtrees) failed. That is, when a task is finished, we keep:
- a root task (until it is unregistered);
- task_essentials of root's direct children;
- a path (of task_essentials) from root to each failed task (so that the reason
of a failure could be examined).
This effectively removes "finally" block so if
authorized_prepared_cache.stop() resolves with exception, the
prepared_cache.stop() is skipped. But that's not a problem -- even if
.stop() throws the shole scylla stop aborts so we don't really care if
it was clean or not.
Also, authorized_prepared_cache.stop() closes the gate and cancels the
timer. None of those can resolve with exception.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19001
Currently, there is no indication of tablets in the logged KSMetaData.
Print the tablets configuration of either the`initial` number of tablets,
if enabled, or {'enabled':false} otherwise.
For example:
```
migration_manager - Create new Keyspace: KSMetaData{name=tablets_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"initial":0}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004d446a8}
migration_manager - Create new Keyspace: KSMetaData{name=vnodes_ks, strategyClass=org.apache.cassandra.locator.NetworkTopologyStrategy, strategyOptions={"datacenter1": "1"}, cfMetaData={}, durable_writes=true, tablets={"enabled":false}, userTypes=org.apache.cassandra.config.UTMetaData@0x600004c33ea8}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#18998
This patch adds a test reproducing for the known issue #7963, where
after adding a secondary-index to a table, queries might immediately
start to use this index - even before it is built - and produce wrong
results.
The issue is still open and unfixed, so the new test is marked "xfail".
Interestingly, even though Cassandra claims to have found and fixed
a similar bug in 2015 (CASSANDRA-8505), this test also fails on
Cassandra - trying a query right after CREATE INDEX and before it
was fully built may cause the query to fail.
Refs #7963
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18993
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.
Fixes#18977.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18979
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18964
Currently the loader is called via API, which inherits the maintenance scheduling group from API http server. The loader then can either do load_and_stream() or call (legacy) distributed_loader::upload_new_sstables(). The latter first switches into streaming scheduling group, but the former doesn't and continues running in the maintenance one.
All this is not really a problem, because streaming sched group and maintenance sched group is one group under two different variable names. However, it's messy and worth delegating the sched group switch (even if it's a no-op) to the sstables-loader. As a nice side effect, this patch removes one place that uses database as proxy object to get configuration parameters.
Closesscylladb/scylladb#18928
* github.com:scylladb/scylladb:
sstables-loader: Run loading in its scheduling group
sstables-loader: Add scheduling group to constructor
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.
The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18898
This fiend was used to carry string with property file contents into the
parse_property_file(), but it can go with an argument just as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This field was used to carry property file size across then-lambdas, now
the code is coroutinized and can live with on-stack variable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we rely on `seastar/util/std-compat.hh` to
include the used headers provided by stdandard library. this was
necessary before we moved to a C++20 compliant standard library
implementation. but since Seastar has dropped C++17 support. its
`seastar/util/std-compat.hh` is not responsible for providing these
headers anymore.
so, in this change, we include the used header directly instead
of relying on `seastar/util/std-compat.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18986
before this change, in order to avoid repeating/hardwiring the
compiling options set by Seastar, we just inherit the compiling
options of Seastar for building Abseil, as the former exposes the
options to enable sanitizers.
this works fine, despite that, strictly speaking, not all options
are necessary for building abseil, as abseil is not a Seastar
application -- it is just a C++ library.
but when we introduce dependencies which are only generated at
build time, and these dependencies are passed to the compiler
at build time, this breaks the build of Abseil. because these
dependencies are exposed by the Seastar's .pc file, and consumed
by Abseil. when building Abseil, apparently, the building process
driven by ninja is not started yet, so we are not able to build
Abseil with these settings due to missing dependencies.
so instead of inheriting the compiling options from Seastar, just
set the sanitizer related compiling options directly, to avoid
referencing these missing dependencies.
the upside is that we pass a much smaller set of compiling options
to compiler when building Abseil, the downside is that we hardwire
these options related to sanitizer manually, they are also detected
by Seastar's building system. but fortunately, these options are
relatively stable across the building environements we support.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18987
We want to verify that a hint directory is drained
when any of the nodes correspodning to it leaves
the cluster. The test scenario should happen before
the whole cluster has been migrated to
the host-ID-based hinted handoff, so when we still
rely on the mappings between hint endpoint managers
and the hint directories managed by them.
To make such a test possible, in these changes we
introduce an error injection rejecting incoming
hints. We want to test a scenario when:
1. hints are saved towards a given node -- node N1,
2. N1 changes its IP to a different one,
3. some other node -- node N2 -- changes its IP
to the original IP of N1,
4. hints are saved towards N2 and they are stored
in the same directory as the hints saved towards
N1 before,
5. we start draining N2.
Because at some point N2 needs to be stopped,
it may happen that some mutations towards
a distributed system table generate a hint
to N2 BEFORE it has finished changing its IP,
effectively creating another hint directory
where ALL of the hints towards the node
will be stored from there on. That would disturb
the test scenario. Hence, this error injection is
necessary to ensure that all of the steps in the
test proceed as expected.
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.
Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.
Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.
In the test_mv_topology_change case, we use an injection to
delay the view updates application, so that the ERMs have
a chance to change in the process. This injection was also
enabled on a new node in the test, which was later decommissioned.
During the shutdown, writes were still being performed, causing
view update generation and delays due to the injection which in
turn delayed the node shutdown, causing the test to timeout.
This patch removes the injection for the node being shut down.
At the same time, the force_gossip_topology_changes=True option
is also removed from its config, but for that option it's enough
to enable on the first node in the cluster and all nodes use it.
Fixes: https://github.com/scylladb/scylladb/issues/18941Closesscylladb/scylladb#18958
The Alternator test suite usually runs on a specific configuration of
Scylla set up by test.py or test/alternator/run. However, we do consider
it an important design goal of this test suite that developers should be
able to run these tests against any DynamoDB-API implementation, including
any version Scylla manually run by the developer in *any way* he or she
pleases.
The recent commit dc80b5dafe changed the way
we retrieve the configured autentication key, which is needed if Scylla is
run with --alternator-enforce-authorization. However, the new code assumed
that Scylla was also run with
--authenticator PasswordAuthenticator --authorizer CassandraAuthorizer
so that the default role of "cassandra" has a valid, non-null, password
(namely, "cassandra"). If the developer ran Scylla manually without
these options, the test initialization code broke, and all tests in the
suite failed.
This patch fixes this breakage. You can now run the Alternator test
suite against Scylla run manually without any of the aforementioned
options, and everything will work except some tests in test_authorization.py
will fail as expected.
This patch has no affect on the usual test.py or test/alternator/run
runs, as they already run Scylla with all the aforementioned options
and weren't exposed to the problem fixed here.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18957
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.
Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).
That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.
Refs #18904.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18907
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".
Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:
storage_proxy - No mapping for :: in the passed effective replication map
It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.
Users which need replica list stability should use the host_id-based API.
Fixes#18843
This change supports changing replication factor in tablets-enabled keyspaces.
This covers both increasing and decreasing the number of tablets replicas through
first building topology mutations (`alter_keyspace_statement.cc`) and then
tablets/topology/schema mutations (`topology_coordinator.cc`).
For the limitations of the current solution, please see the docs changes attached to this PR.
Fixes: #16129Closesscylladb/scylladb#16723
* github.com:scylladb/scylladb:
test: Do not check tablets mutations on nodes that don't have them
test: Fix the way tablets RF-change test parses mutation_fragments
test/tablets: Unmark RF-changing test with xfail
docs: document ALTER KEYSPACE with tablets
Return response only when tablets are reallocated
cql-pytest: Verify RF is changes by at most 1 when tablets on
cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1
Reject ALTER with 'replication_factor' tag
Implement ALTER tablets KEYSPACE statement support
Parameterize migration_manager::announce by type to allow executing different raft commands
Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks
Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Allow query_processor to check if global topo queue is empty
Introduce new global topo `keyspace_rf_change` req
New raft cmd for both schema & topo changes
Add storage service to query processor
tablets: tests for adding/removing replicas
tablet_allocator: make load_balancer_stats_manager configurable by name
If there is no mapping from host id to ip while a node is in bootstrap
state there is no point adding it to pending endpoint since write
handler will not be able to map it back to host id anyway. If the
transition sate requires double writes though we still want to fail.
In case the state is write_both_read_old we fail the barrier that will
cause topology operation to rollback and in case of write_both_read_new
we assert but this should not happen since the mapping is persisted by
this point (or we failed in write_both_read_old state).
Fixes: scylladb/scylladb#18676
On the next boot there is no host ID to IP mapping which causes node to
crash again with "No mapping for :: in the passed effective replication map"
assertion.
`tablets_options->erase(it);` invalidates `it`, but it's still referred
to later in the code in the last `else`, and when that code is invoked,
we get a `heap-use-after-free` crash.
Fixes: #18926Closesscylladb/scylladb#18936
This patch cleans up a bit the code in Alternator which splits up
the operation's X-Amz-Target header (the second part of it is the
name of the operation, e.g., CreateTable).
The patch doesn't change any functionality or change performance in
any meaningful way. I was just reviewing this code and was annoyed by
the unnecessary variable and unnecessary creation of strings and
vectors for such a simple operation - and wanted to clean it up.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18830
The test creates two partitions and passes them through the reader, but
the partitions are out-of-order. This is benign but best to fix it
anyway.
Found after bumping validation level inside the compactor.
Closesscylladb/scylladb#18848
Tests in test_tombstone_gc.py are parametrized with string instead
of bool values. Fix that. Use the value to create a keyspace with
or without tablets.
Fixes: #18888.
Closesscylladb/scylladb#18893
otherwise when compiling with the new seastar, which removed
`#include <variant>` from `std-compat.hh`, the {mode}-headers
target would fail to build, like:
```
./data_dictionary/storage_options.hh:34:29: error: no template named 'variant' in namespace 'std'
10:45:15 using value_type = std::variant<local, s3>;
10:45:15 ~~~~~^
10:45:15 ./data_dictionary/storage_options.hh:35:5: error: unknown type name 'value_type'; did you mean 'std::_Bit_const_iterator::value_type'?
10:45:15 value_type value = local{};
10:45:15 ^~~~~~~~~~
10:45:15 std::_Bit_const_iterator::value_type
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18921
This commit adds the information that the manual recovery procedure
is not supported if tablets are enabled.
In addition, the content in the Manual Recovery Procedure is reorganized
by adding the Prerequisites and Procedure subsections - in this way,
we can limit the number of Note and Warning boxes that made the page
hard to follow.
Fixes https://github.com/scylladb/scylladb/issues/18895Closesscylladb/scylladb#18935
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.
Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.
The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.
The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently only the user tenant (statement scheduling group) and system
(default scheduling group) tenants exist, as we used to have only
user-initiated operations and sytem (internal) ones. Now there is need
to distinguish between two kinds of system operation: foreground and
background ones. The former should use the system tenant while the
latter will use the new maintenance tenant (streaming scheduling group).
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.
This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786Fixes: #18709Closesscylladb/scylladb#18816
This series ignores errors in `load_history()` to prevent `abort_requested_exception` coming from `get_repair_module().check_in_shutdown()` from escaping during `repair_service::stop()`, causing
```
repair_service::~repair_service(): Assertion `_stopped' failed.
```
Fixes https://github.com/scylladb/scylladb/issues/18889
Backport to 6.0 required due to 523895145dClosesscylladb/scylladb#18890
* github.com:scylladb/scylladb:
repair: load_history: warn and ignore all errors
repair_service: debug stop
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.
We want to ensure that when the replication factor
of a keyspace changes, it changes by at most 1 per DC
if it uses tablets. The rationale for that is to make
sure that the old and new quorums overlap by at least
one node.
After these changes, attempts to change the RF of
a keyspace in any DC by more than 1 will fail.
This patch removes the support for the "wildcard" replication_factor
option for ALTER KEYSPACE when the keyspace supports tablets.
It will still be supported for CREATE KEYSPACE so that a user doesn't
have to know all datacenter names when creating the keyspace,
but ALTER KEYSPACE will require that and the user will have to
specify the exact change in replication factors they wish to make by
explicitly specifying the datacenter names.
Expanding the replication_factor option in the ALTER case is
unintuitive and it's a trap many users fell into.
See #8881, #15391, #16115
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.
This PR ensures that CDC keeps working correctly in the recovery
mode after leaving the raft-based topology.
We update `system.cdc_local` in `topology_state_load` to ensure
a node restarting in the recovery mode sees the last CDC generation
created by the topology coordinator.
Additionally, we extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator.
Fixesscylladb/scylladb#17409Fixesscylladb/scylladb#17819Closesscylladb/scylladb#18820
* github.com:scylladb/scylladb:
test: test_topology_recovery_basic: test CDC during recovery
test: util: start_writes_to_cdc_table: add FIXME to increase CL
test: util: start_writes_to_cdc_table: allow restarting with new cql
storage_service: update system.cdc_local in topology_state_load
This patch fixes two bugs in `test_topology_ops`:
1. The values of `tablets_enabled` were nonempty strings, so they
always evaluated to `True` in the if statement responsible for
enabling writing workers only if tablets are disabled. Hence, the
writing workers were always disabled.
2. The `topology_experimental_raft suite` uses tablets by default,
so we need a config with empty `experimental_features` to disable
them.
Ensuring this test works with and without tablets is considered
a part of 6.0, so we should backport this patch.
Closesscylladb/scylladb#18900
Now the loading code has two different paths, and only one of them
switches sched group. It's cleaner and more natural to switch the sched
group in the loader itself, so that all code paths run in it and don't
care switching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the user wants to change the default initial tablets value, it uses ALTER KEYSPACE statement. However, specifying `WITH tablets = { initial: $value }` will take no effect, because statement analyzer only applies `tablets` parameters together with the `replication` ones, so the working statement should be `WITH replication = $old_parameters AND tablets = ...` which is not very convenient.
This PR changes the analyzer so that altering `tablets` happens independently from `replication`. Test included.
fixes: #18801Closesscylladb/scylladb#18899
* github.com:scylladb/scylladb:
cql-pytest: Add validation of ALTER KEYSPACE WITH TABLETS
cql3: Fix parsing of ALTER KEYSPACE's tablets parameters
cql3: Remove unused ks_prop_defs/prepare_options() argument
before this change, we rely on `seastar/util/std-compat.hh` to
include the used headers provided by stdandard library. this was
necessary before we moved to a C++20 compliant standard library
implementation. but since Seastar has dropped C++17 support. its
`seastar/util/std-compat.hh` is not responsible for providing these
headers anymore.
so, in this change, we include the used headers directly instead
of relying on `seastar/util/std-compat.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18883
This commit adds the main description of tablets and their
benefits.
The article can be used as a reference in other places
across the docs where we mention tablets.
Closesscylladb/scylladb#18619
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.
Fixes#18085.
Closesscylladb/scylladb#18287
* github.com:scylladb/scylladb:
test: Fix flakiness in topology_experimental_raft/test_tablets
service: Use tablet read selector to determine which replica to account table stats
storage_service: Fix race between tablet split and stats retrieval
There's a test that checks how ALTER changes the initial tablets value,
but it equips the statement with `replication` parameters because of
limitations that parser used to impose. Now the `tablets` parameters can
come on their own, so add a new test. The old one is kept from
compatibility considerations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the `WITH` doesn't include the `replication` parameters, the
`tablets` one is ignoded, even if it's present in the statement. That's
not great, those two parameter sets are pretty much independent and
should be parsed individually.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, the call to `get_repair_module().check_in_shutdown()`
may throw `abort_requested_exception` that causes
`repair_service::stop()` to fail, and trigger assertion
failure in `~repair_service`.
We alredy ignore failure from `update_repair_time`,
so expand the logic to cover the whole function body.
Fixesscylladb/scylladb#18889
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`test_topology_ops` is flaky, which has been uncovered by gating
in scylladb/scylladb#18707. However, debugging it is harder than it
should be because write workers can flood the logs. They may send
a lot of failed writes before the test fails. Then, the log file
can become huge, even up to 20 GB.
Fix this issue by stopping a write worker after the first error.
This test is important for 6.0, so we can backport this change.
Closesscylladb/scylladb#18851
Raft service levels are read-only in recovery mode. This patch adds check and proper error message when a user tries to modify service levels in recovery mode.
Fixes https://github.com/scylladb/scylladb/issues/18827Closesscylladb/scylladb#18841
* github.com:scylladb/scylladb:
test/auth_cluster/test_raft_service_levels: try to create sl in recovery
service/qos/raft_sl_dda: reject changes to service levels in recovery mode
service/qos/raft_sl_dda: extract raft_sl_dda steps to common function
iwyu is short for "include what you use". this workflow is added to
identify missing "#include" and extraneous "#include" in C++ source
files.
This workflow is triggered when a pull request is created targetting
the "master" branch. It uses the clang-include-cleaner tool provided
by clang-tools package to analyze all the ".cc" and ".hh" source files.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18122
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.
This is needed, because the same name cannot be used for 2 separate
entities, because we're getting double-metrics-registration error, thus
the names have to be configurable, not hardcoded.
Users of prepared statement reference it with the help of "smart" pointers. None of the users are supposed to modify the object they point to, so mark the respective pointer type as `pointer<const prepared_statement>`. Also mark the fields of prepared statement itself with const's (some of them already are)
Closesscylladb/scylladb#18872
* github.com:scylladb/scylladb:
cql3: Mark prepared_statement's fields const
cql3: Define prepared_statement weak pointer as const
In topology on raft, management of CDC generations is moved to the
topology coordinator. We extend the topology recovery test to verify
that the CDC keeps working correctly during the whole recovery
process. In particular, we test that after restarting nodes in the
recovery mode, they correctly use the active CDC generation created
by the topology coordinator. A node restarting in the recovery mode
should learn about the active generation from `system.cdc_local`
(or from gossip, but we don't want to rely on it). Then, it should
load its data from `system.cdc_generations_v3`.
Fixesscylladb/scylladb#17409
This patch allows us to restart writing (to the same table with
CDC enabled) with a new CQL session. It is useful when we want to
continue writing after closing the first CQL session, which
happens during the `reconnect_driver` call. We must stop writing
before calling `reconnect_driver`. If a write started just before
the first CQL session was closed, it would time out on the client.
We rename `finish_and_verify` - `stop_and_verify` is a better
name after introducing `restart`.
When the node with CDC enabled and with the topology on raft
disabled bootstraps, it reads system.cdc_local for the last
generation. Nodes with both enabled use group0 to get the last
generation.
In the following scenario with a cluster of one node:
1. the node is created with CDC and the topology on raft enabled
2. the user creates table T
3. the node is restarted in the recovery mode
4. the CDC log of T is extended with new entries
5. the node restarts in normal mode
The generation created in the step 3 is seen in
system_distributed.cdc_generation_timestamps but not in
system.cdc_generations_v3, thus there are used streams that the CDC
based on raft doesn't know about. Instead of creating a new
generation, the node should use the generation already committed
to group0.
Save the last CDC generation in the system.cdc_local during loading
the topology state so that it is visible for CDC not based on raft.
Fixesscylladb/scylladb#17819
The system-distributed-keyspace and view-update-generator often go in pair, because streaming, repair and sstables-loader (via distributed-loader) need them booth to check if sstable is staging and register it if it's such. The check is performed by messing directly with system_distributed.view_build_status table, and the registration happens via view-update-generator.
That's not nice, other services shouldn't know that view status is kept in system table. Also view-update-generator is a service to generae and push view updates, the fact that it keeps staging sstables list is the implementation detail.
This PR replaces dependencies on the mentioned pair of services with the single dependency on view-builder (repair, sstables-loader and stream-manager are enlightened) and hides the view building-vs-staging details inside the view_builder.
Along the way, some simplification of repair_writer_impl class is done.
Closesscylladb/scylladb#18706
* github.com:scylladb/scylladb:
stream_manager: Remove system_distributed_keyspace and view_update_generator
repair: Remove system_distributed_keyspace and view_update_generator
streaming: Remove system_distributed_keyspace and view_update_generator
sstables_loader: Remove system_distributed_keyspace and view_update_generator
distributed_loader: Remove system_distributed_keyspace and view_update_generator
view: Make register_staging_sstable() a method of view_builder
view: Make check_view_build_ongoing() helper a method of view_builder
streaming: Proparage view_builder& down to make_streaming_consumer()
repair: Keep view_builder& on repair_writer_impl
distributed_loader: Propagate view_builder& via process_upload_dir()
stream_manager: Add view builder dependency
repair_service: Add view builder dependency
sstables_loader: Add view_bulder dependency
main: Start sstables loader later
repair: Remove unwanted local references from repair_meta
Database used to be (and still is in many ways) an object used to get configuration from. Part of the configuration is the set of pre-configured scheduling groups. That's not nice, services should use each other for some real need, not as proxies to configuration. This patch patches the places that explicitly switch to statement group _not_ to use database to get the group itself.
fixes: #17643Closesscylladb/scylladb#18799
* github.com:scylladb/scylladb:
database: Don't export statement scheduling group
test: Use async attrs and cql-test-env scheduling groups
test: Use get_scheduling_groups() to get scheduling groups
api: Don't switch sched group to start/stop protocol servers
main: Don't switch sched group to start protocol servers
code: Switch to sched group in request_stop_server()
code: Switch to server sched group in start()
protocol_server: Keep scheduling group on board
code: Add scheduling group to controllers
redis: Coroutinize start() method
in 4c1b6f04, we added a concept for fmt::is_formattable<>. but it
was not ncessary. the fmt::is_formattable<> trait was enough. the
reason 4c1b6f04 was actually a leftover of a bigger change which
tried to add trait for the cases where fmt::is_formattable<> was
not able to cover. but that was based on the wrong impression that
fmt::is_formattable<> should be able to work with container types
without including, for instance `fmt/ranges.h`. but in 222dbf2c,
we include `fmt/ranges.h` in tests, where the range-alike formatter
is used, that enables `fmt::is_formattable<>` to tell that container
types are formattable.
in short, 4c1b6f04 was created based on a misunderstanding, and
it was a reduced type trait, which is proved to be not necessary.
so, in this change, it is dropped. but the type constraints is
preserved to make the build failure more explicit, if the fallback
formatter does not match with the type to be formatted by Boost.test.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18879
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.
Fixes https://github.com/scylladb/scylladb/issues/18098Closesscylladb/scylladb#18769
this series disables operator<<:s for vector and unordered_map, and drop operator<< for mutation, because we don't have to keep it to work with these operator:s anymore. this change is a follow up of https://github.com/scylladb/seastar/issues/1544
this change is a cleanup. so no need to backport
Closesscylladb/scylladb#18866
* github.com:scylladb/scylladb:
mutation,db: drop operator<< for mutation and seed_provider_type&
build: disable operator<< for vector and unordered_map
db/heat_load_balance: include used header
test: define a more generic boost_test_print_type
test/boost: define fmt::formatter for service_level_controller_test.cc
test/boost: include test/lib/test_utils.hh
in theory, std::result_of_t should have been removed in C++20. and
std::invoke_result_t is available since C++17. thanks to libstdc++,
the tree is compiling. but we should not rely on this.
so, in this change, we replace all `std::result_of_t` with
`std::invoke_result_t`. actually, clang + libstdc++ is already warning
us like:
```
In file included from /home/runner/work/scylladb/scylladb/multishard_mutation_query.cc:9:
In file included from /home/runner/work/scylladb/scylladb/schema/schema_registry.hh:11:
In file included from /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/unordered_map:38:
Warning: /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/type_traits:2624:5: warning: 'result_of<void (noop_compacted_fragments_consumer::*(noop_compacted_fragments_consumer &))()>' is deprecated: use 'std::invoke_result' instead [-Wdeprecated-declarations]
2624 | using result_of_t = typename result_of<_Tp>::type;
| ^
/home/runner/work/scylladb/scylladb/mutation/mutation_compactor.hh:518:43: note: in instantiation of template type alias 'result_of_t' requested here
518 | if constexpr (std::is_same_v<std::result_of_t<decltype(&GCConsumer::consume_end_of_stream)(GCConsumer&)>, void>) {
|
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18835
When selecting from mutation_fragments(table) one may want to apply
token() filtering againts partition key. This doesn't work currently,
but used to crash. This patch adds a regression test for that
refs: #18637
refs: #18768
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18759
this change is inspired by clang-tidy. it warns like:
```
[752/852] Building CXX object service/CMakeFiles/service.dir/migration_manager.cc.o
Warning: /home/runner/work/scylladb/scylladb/service/migration_manager.cc:891:71: warning: 'view' used after it was moved [bugprone-use-after-move]
891 | db.get_notifier().before_create_column_family(*keyspace, *view, mutations, ts);
| ^
/home/runner/work/scylladb/scylladb/service/migration_manager.cc:886:86: note: move occurred here
886 | auto mutations = db::schema_tables::make_create_view_mutations(keyspace, std::move(view), ts);
| ^
```
in which, `view` is an instance of view_ptr which is a type with the
semantics of shared pointer, it's backed by a member variable of
`seastar::lw_shared_ptr<const schema>`, whose move-ctor actually resets
the original instance. so we are actually accessing the moved-away
pointer in
```c++
db.get_notifier().before_create_column_family(*keyspace, *view, mutations, ts)
```
so, in this change, instead of moving away from `view`, we create
a copy, and pass the copy to
`db::schema_tables::make_create_view_mutations()`. this should be fine,
as the behavior of `db::schema_tables::make_create_view_mutations()`
does not rely on if the `view` passed to it is a moved away from it or not.
the change which introduced this use-after-move was 88a5ddabce
Refs 88a5ddabceFixes#18837
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18838
since we've migrated away from the generic homebrew formatters
for range-alike containers, there is no need to keep there operator<<
around -- they were preserved in order to work with the container
formatters which expect operator<< of the elements.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
seastar provides an option named `Seastar_DEPRECATED_OSTREAM_FORMATTERS`
to enable the operator<< for `std::vector` and `std::unordered_map`,
and this option is enabled by default. but we intent to avoid using
them, so that we can use the fmt::formatter specializations when
Boost.test prints variables. if we keep these two operator<< enabled,
Boost.test would use them when printing variables to be compaired
then the check fails, but if elements in the vector or unordered_map
to be compaired does do not provide operator<<, compiling would fail.
so, in this change, let's disable these operator<< implementations.
this allows us to ditch the operator<< implementations which are
preserved only for testing.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in this header, we use `hr_logger.trace("returned _pp={}", p)` to
print a `vector<float>`, so we we need to include `fmt/ranges.h`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
fmt::is_formattable<T>::value is false, even if
* T is a container of U, and
* fmt::is_formattable<U>, and
* U can be formatted using fmt::formatter
so, we have to define a more generic boost_test_print_type()
for the all types supported by {fmt}. it will help us to ditch the
operator<< for vector and unordered_map in Seastar, and allow us
to use the fmt::formatter specialization of the element
types.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since we are moving away for operator<< based formatter, more and more
types now only have {fmt} based formatters. the same will apply to the
STL container types after ditching the generic homebrew formatter in
to_string.hh, so to be prepared for the change, let's add the
fmt::formatter for tests as well.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change was created in the same spirit of 505900f18f. because
we are deprecating the operator<< for vector and unorderd_map in
Seastar, some tests do not compile anymore if we disable these
operators. so to be prepared for the change disabling them, let's
include test/lib/test_utils.hh for accessing the printer dedicated
for Boost.test. and also '#include <fmt/ranges.h>' when necessary,
because, in order to format the ranges using {fmt}, we need to
use fmt/ranges.h.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Not only users of prepared_statement point to immutable object, but the
class itself doesn't assume modifications of its fields, so mark them
const too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The pointer points to immutable prepared_statement, so tune up the type
respectively. Tracing has its own alieas for it, fix one too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The modified lines of code intend to await the first appearance of a log
on one of the nodes.
But due to misplaced parentheses, instead of creating a list of log-awaiting
tasks with a list comprehension, they pass a generator expression to
asyncio.create_task().
This is nonsense, and it fails immediately with a type error.
But since they don't actually check the result of the await,
the test just assumes that the search completed successfully.
This was uncovered by an upgrade to Python 3.12, because its typing is stronger
and asyncio.create_task() screams when it's passed a regular generator.
This patch fixes the bad list comprehension, and also adds an error check
on the completed awaitables (by calling `await` on them).
Fixes#18740Closesscylladb/scylladb#18754
Continuation of the prevuous patch, but with its own flavor. There's a
manual test that wants to run seastar thread in statement scheduling
group and gets one from database. This patch makes it get the group from
cql-test-env and, while at it, makes it switch to that group using
thread attributes passed to async() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's such a helper in cql-test-env that other tests use to get sched
groups from. Few other tests (ab)use databse for that, this patch fixes
those remnants.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the protocol servers implementations now maintain scheduling group
on their own, so the API handler can stop caring
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method is used to stop protocol server in the runtime (via the
API). Since it's not just "kick it and wait to wrap up", it's needed to
perform this in the inherited sched group too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch makes all protocol servers implementations use the inherited
sched group in their start methods.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The groups is now mandatory for the real protocol server implementation
to initialize. Previous patch make all of them get the sched group as
constructor argument, so that's where to take it from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are four of them currently -- transport, thrift, alternator and
redis. This patch makes main pass to all the statement scheduling group
as constructor argument. Next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Callers of it had just checked if an sstable still has some views
building, so the should talk to view-builder to register the sstable
that's now considered to be staging.
Effectively. this is to hide the view-update-generator from other
services and make them communicate with the builder only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper checks if there's an ongoing build of a view, and it's in
fact internal to view-builder, who keeps its status in one of its
system tables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is on its own, nothing depends on it. Neither it can work
before system distributed keyspace is started, so move it lower.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When constructed, the class copies local references to services just to
push them into make_repair_writer() later in the same initializers list.
There's no need in keeping those references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This doesn't apply for auth-v2 as we improved data placement and
removed cassandra quirk which was setting different CL for some
default superuser involved operations.
Fixes#18773Closesscylladb/scylladb#18785
This commit enables publishing documentation
from branch-6.0. The docs will be published
as UNSTABLE (the warning about version 6.0
being unstable will be displayed).
Closesscylladb/scylladb#18832
mode
When a cluster goes into recovery mode and service levels were migrated
to raft, service levels become temporarily read-only.
This commit adds a proper error message in case a user tries to do any
changes.
When setting/dropping a service level using raft data accessor, the same
validation steps are executed (this_shard_id = 0 and guard is present).
To not duplicate the calls in both functions, they can be extracted to a
helper function.
One source of flakiness is in test_tablet_metadata_propagates_with_schema_changes_in_snapshot_mode
due to gossiper being aborted prematurely, and causing reconnection
storm.
Another is test_tablet_missing_data_repair which is flaky due an issue
in python driver that session might not reconnect on rolling restart
(tracked by https://github.com/scylladb/python-driver/issues/230)
Refs #15356.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
File-based tablet streaming calls every shard to return data of every
group that intersects with a given range.
After dynamic group allocation, that breaks as the tablet range will
only be present in a single shard, so an exception is thrown causing
migration to halt during streaming phase.
Ideally, only one shard is invoked, but that's out of the scope of this
fix and compaction_groups_for_token_range() should return empty result
if none of the local groups intersect with the range.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18798
This commit fixes the incorrect Raft-related information on the Handling Cluster Membership Change Failures page
introduced with https://github.com/scylladb/scylladb/pull/17500.
The page describes the procedure for when Raft is disabled. Since 6.0, Raft for consistent schema management
is enabled and mandatory (cannot be disabled), this commit adds the procedure for Raft-enabled setups.
Closesscylladb/scylladb#18803
Since we introduced the ability to revert migrations, we can no longer
rely on ordering of transition stages to determine whether to account
pending or leaving replica. Let's use read selector instead, which
correctly has info which replica type has correct stats info.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.
It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).
Fixes#18085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since commit 952dfc6157 "repair: Introduce
repair_partition_count_estimation_ratio config option", get_config() is
used. We need to include db/config.hh for that.
Spotted when backporting to 5.4 branch.
Refs #18615Closesscylladb/scylladb#18780
As part of the Alternator test suite, we check Alternator's support for
authentication. Alternator maps Scylla's existing CQL roles to AWS's
authentication:
* AWS's access_key_id <- the name of the CQL role
* AWS's secret_access_key <- the salted hash of the password of the CQL role
Before this patch, the Alternator test suite created a new role with a
preset salted hash (role "alternator", salted hash "secret_pass")
and than used that in the tests. However, with the advent of Raft-based
metadata it is wrong to write directly to the roles table, and starting
with #17952 such writes will be outright forbidden.
But we don't actually need to create a new CQL role! We already have
a perfectly good CQL role called "cassandra", and our tests already use
it. So what this patch does is to have the Alternator tests (conftest.py)
read from the roles system-table the salted hash of the "cassandra" role,
and then use that - instead of the hard-coded pair alternator/secret_pass -
in the tests.
A couple more tests assumed that the role name that was used was
"alternator", but now it was changed to "cassandra" so those tests
needed minor fixes as well.
After this patch, the Alternator tests no longer *write* to the roles
system table. Moreover, after this patch, test/alternator/run and
test/alternator/suite.yaml (used when testing with test.py) no longer
need to do extra ugly CQL setup before starting the Alternator tests.
Fixes#18744
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18771
* seastar 42f15a5f...914a4241 (33):
> sstring: deprecate formatters for vector and unordered_map
> github: use fedora:40 image for testing
> github: add 2 testing combinations back to the matrix
> github: extract test.yaml into a resusable workflow
> build: use initial-exec TLS when building seastar as shared library
> coroutine: preserve this->container before calling dtor
> smp: allocate hugepages eagerly when kernel support is available
> shared_mutex: Add tests for std::shared_lock and std::unique_lock
> shared_mutex: Add RAII locks
> README.md: replace C++17 with C++23
> treewide: do not check for SEASTAR_COROUTINES_ENABLED
> build: support enabled options when building seastar-module
> treewide: include required header files
> build: move add_subdirectory(src) down
> README.md: replace CircleCI badge with GitHub badge
> weak_ptr: Make it possible to convert to "compatible" pointers
> circleci: remove circleci CI tests
> build: use DPDK_MACHINE=haswell when testing dpdk build on github-hosted runner
> build: add --dpdk-machine option to configure.py
> build: stop translating -march option to names recognized by DPDK
> github: encode matrix.enables in cache key
> doc/prometheus.md: add metrics? in URL exporter URI
> tests/unit/metrics_tester: use deferred_stop() when appropriate
> httpd: mark http_server_control::stop() noexcept
> reactor: print scheduling group along with backtrace
> reactor: update lowres_clock when max_task_backlog is exceeded
> tests: add test for prometheus exporter
> tests: move apps/metrics_tester to tests/unit
> apps/metrics_tester: keep metrics with "private" labels
> apps/metrics_tester: support "labels" in conf.yaml
> apps/metrics_tester: stop server properly
> apps/metrics_tester: always start exporter
> apps/metrics_tester: fix typo in conf-example.yaml
Closesscylladb/scylladb#18800
There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18644
Coroutinization will help improve readability and allow easier changes planned for this code.
This work was separated from https://github.com/scylladb/scylladb/pull/17910 to make it smoother to review and merge.
Closesscylladb/scylladb#18788
* github.com:scylladb/scylladb:
cql3: coroutinize create/alter/drop service levels
auth: coroutinize alter_role and drop_role
auth: coroutinize grant_permissions and revoke_permissions
auth: coroutinize create_role
cql3: statements: co-routinize auth related statements
cql3: statements: release unused guard explicitly in auth related statements
The sl-controller is stopped in three steps. The first (and instantly the second) is unsubscribing from lifecycle notification and draining. The third is stop itself. First two steps are "out of order" as compared to the desired start-stop sequence of any service, this patch fixes these steps.
After this PR the drain_on_shutdown() (the call that drains the node upon stop) finally becomes clean and tidy and is no longer accompanied by ad-hoc fellow drains/stops/aborts/whatever.
refs: #2737Closesscylladb/scylladb#18731
* github.com:scylladb/scylladb:
sl_controller: Remove drain() method
sl_controller: Move abort kicking into do_abort()
main,sl_controller: Subscribe for early abort
main: Unsubscribe sl controller next to subscribing
Currently guard is released immediately because those functions are
based on continuations and guard lifetime is not extended. In the following
commit we rewrite those functions to coroutines and lifetime will be
automatically extended. This would deadlock the client because we'd
try to take second guard inside auth code without releasing this unused
one.
In the future commits auth guard will be removed and the one from
statement will be used but this needs some more code re-arrangements.
The draining now only consists of waiting for the data update future to
resolve. It can be safely moved to .stop() (i.e. -- later) because its
stopping had already been initiated by abort-source, and no other
services depend on sl-controller to be stopped and drained.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Draining sl controller consists of two parts -- first, kicks the wrap-up
process by aborting operations, breaking semaphores, etc. It's
no-waiting part. At last there goes co_await of the completion future.
This part moves the no-waiting part into recently introduced abort
subscription, so that wrap-up starts few bits earlier.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's stop-signal in main that fires an abort source on stop. Lots of
other services are subscribed in it, add the sl-controller too. For now
it's a no-op, but next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The subscription only handles on_leave_cluster() and only for local
node, so even if controller gets subscribed for longer, it won't do any
harm.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-20 21:26:31 +03:00
1097 changed files with 13223 additions and 17942 deletions
parser.add_argument('--commits',default=None,type=str,help='Range of promoted commits.')
parser.add_argument('--pull-request',type=int,help='Pull request number to be backported')
parser.add_argument('--head-commit',type=str,required=is_pull_request(),help='The HEAD of target branch after the pull request specified by --pull-request is merged')
@@ -199,7 +199,7 @@ The `scylla.yaml` file in the repository by default writes all database data to
Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.
Additionally, when running on under-powered platforms like portable laptops, the `--overprovisined` flag is useful.
Additionally, when running on under-powered platforms like portable laptops, the `--overprovisioned` flag is useful.
[](https://github.com/scylladb/scylladb/actions/workflows/seastar.yaml) [](https://github.com/scylladb/scylladb/actions/workflows/reproducible-build.yaml) [](https://github.com/scylladb/scylladb/actions/workflows/clang-nightly.yaml)
See [test.py manual](docs/dev/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and
Thrift. There is also support for the API of Amazon DynamoDB™,
By default, Scylla is compatible with Apache Cassandra and its API - CQL.
There is also support for the API of Amazon DynamoDB™,
which needs to be enabled and configured in order to be used. For more
information on how to enable the DynamoDB™ API in Scylla,
and the current compatibility of this feature as well as Scylla-specific extensions, see
"description":"The plugin ID, describe the component the metric belongs to. Examples are cache, thrift, etc'. Regex are supported.The plugin ID, describe the component the metric belong to. Examples are: cache, thrift etc'. regex are supported",
"description":"The plugin ID, describe the component the metric belongs to. Examples are cache and alternator, etc'. Regex are supported.",
"summary":"Triggers read barrier for the given Raft group to wait for previously committed commands in this group to be applied locally. For example, can be used on group 0 to wait for the node to obtain latest schema changes.",
"type":"string",
"nickname":"read_barrier",
"produces":[
"application/json"
],
"parameters":[
{
"name":"group_id",
"description":"The ID of the group. When absent, group0 is used.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"timeout",
"description":"Timeout in seconds after which the endpoint returns a failure. If not provided, 60s is used.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.