The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.
Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.
Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.
Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.
The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.
When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.
Fixes#18181.
system test details:
test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml
instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.
latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode: write
99.9th: 3.145727ms
99th: 1.998847ms
99.9th: 3.145727ms
99th: 2.031615ms
Mode: read
99.9th: 3.145727ms
99th: 2.031615ms
99.9th: 3.145727ms
99th: 2.031615ms
Mode: write
99.9th: 3.047423ms
99th: 1.933311ms
99.9th: 3.047423ms
99th: 1.933311ms
Mode: read
99.9th: 3.145727ms
99th: 1.900543ms
99.9th: 3.145727ms
99th: 1.900543ms
Mode: write
99.9th: 5.079039ms
99th: 3.604479ms
99.9th: 35.389439ms
99th: 25.624575ms
Mode: write
99.9th: 3.047423ms
99th: 1.998847ms
99.9th: 3.047423ms
99th: 1.998847ms
Mode: read
99.9th: 3.080191ms
99th: 2.031615ms
99.9th: 3.112959ms
99th: 2.031615ms
```
Closesscylladb/scylladb#20572
* github.com:scylladb/scylladb:
docs: Document tablet merging
tests/boost: Add test to verify correctness of balancer decisions during merge
tests/topology_experimental_raft: Add tablet merge test
service: Handle exception when retrying split
service: Co-locate sibling tablets for a table undergoing merge
gms: Add cluster feature for tablet merge
service: Make merge of resize plan commutative
replica: Implement merging of compaction groups on merge completion
replica: Handle tablet merge completion
service: Implement tablet map resize for merge
locator: Introduce merge_tablet_info()
service: Rename topology::transition_state::tablet_split_finalization
service: Respect initial_tablet_count if table is in growing mode
service: Wire migration_tablet_set into the load balancer
locator: Add tablet_map::sibling_tablets()
service: Introduce sorted_replicas_for_tablet_load()
locator/tablets: Extend tablet_replica equality comparator to three-way
service: Introduce alias to per-table candidate map type
service: Add replication constraint check variant for migration_tablet_set
service: Add convergence check variant for migration_tablet_set
service: Add migration helpers for migration_tablet_set
service/tablet_allocator: Introduce migration_tablet_set
service: Introduce migration_plan::add(migrations_vector)
locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
locator/tablets: Introduce tablet_map::needs_merge()
locator/tablets: Introduce resize_decision::initial_decision()
locator/tablets: Fix return type of three-way comparison operators
service: Extract update of node load on migrations
service: Extract converge check for intra-node migration
service: Extract erase of tablet replicas from candidate list
scripts/tablet-mon: Allow visualization of tablet id
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.
This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.
A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.
Fixes: scylladb/scylladb#20082Closesscylladb/scylladb#21600
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.
Fixesscylladb/scylladb#20754
The PR should be backported to 6.2, this version has view builder on group0.
Closesscylladb/scylladb#21708
* github.com:scylladb/scylladb:
test/topology_custom/test_view_build_status: add reproducer
service/topology_coordinator: migrate view builder only if all nodes are up
This series adds WCU support for the delete item operation.
It also splits the Alternator WCU metric by an ops label to give us better visibility of how much each ops contributes to the WCU calculation.
No need to backport to the open source
Closesscylladb/scylladb#21709
* github.com:scylladb/scylladb:
test_returnconsumedcapacity.py: Add delete Item tests
alternator/executor: Add WCU support for delete item
alternator/executer use uint in describe_item
alternator/consumed_capacity.hh: Make the total_bytes public
test_metrics validate split wcu_total to ops
Alternato: split WCU metrics into ops
And split it into two -- one for materialized view, another for
secondary index. This is to fit current cqlpy layout that has different
files for views and indexes.
refs: #21552
refs: #21551 (detached this patch from there, as that PR needs fix in
the core code)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21677
With commits ed7d352e7d and bb1867c7c7, we now have input streams for both compressed and uncompressed SSTables that provide seamless checksum and digest checking. The code for these was based on `validate_checksums()`, which implements its own validation logic over raw streams. This has led to some duplicate code.
This PR deduplicates the uncompressed case by modifying `validate_checksums()` to use a checksummed input stream instead of a raw stream. The same cannot be done for compressed SSTables though. The reason is that `validate_checksums()` needs to examine the whole data file, even if an invalid chunk is encountered. In the checksummed case we support that by offloading the error handling logic from the data source via a function parameter. In the compressed data source we cannot do that because it needs to return decompressed data and decompression may fail if the data are invalid.
This PR also enables `validate_checksums()` to partially verify SSTables with just the per-chunk checksums if the digest is missing.
In more detail, this PR consists of:
* Port of some integrity checks from `do_validate_uncompressed()` to the checksummed data source. It should now be able to detect corruption due to truncated or appended chunks (expected number of chunks is retrieved from the CRC component).
* Introduction of `error_handler` parameter in checksummed data source and `data_stream()`.
* Refactoring of `validate_checksums()`. The JSON response of `sstable validate-checksums` was also modified to report a missing digest.
* Tests for `validate_checksums()` against SSTables with truncated data, appended data, invalid digests, or no digest.
Refs #19058.
This PR is a hybrid of cleanup and feature. No backport is needed.
Closesscylladb/scylladb#20933
* github.com:scylladb/scylladb:
tools/scylla-sstable: Rename valid_checksums -> valid
test: Check validate_checksums() with missing digest
sstables: Allow validate_checksums() to report missing digests
sstables: Refactor validate_checksums() to use checksummed data stream
sstables: Add error_handler parameter to data_stream()
sstables: Add error handler in checksummed data source
sstables: Check for excessive chunks in checksummed data source
sstables: Check for premature EOF in checksummed data source
test: test_validate_checksums: Check SSTable with invalid digest
test: test_validate_checksums: Check SSTable with appended data
test: test_validate_checksums: Complement test for truncated SSTable
This implements the ability for the balancer to co-locate sibling
tablets on the same shard. Co-location is low in priority, so
regular load balancer is preferred over it. Previous changes
allowed balancer to move co-located sibling tablets together,
to not undo the co-location work done so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.
The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.
NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).
The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.
Fixes: scylladb/scylladb#6403
perf-simple-query -- smp 1 -m 1G output
Before:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41291 insns/op, 24485 cycles/op, 0 errors)
62669.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41277 insns/op, 24695 cycles/op, 0 errors)
69172.12 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41326 insns/op, 24463 cycles/op, 0 errors)
56706.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41143 insns/op, 24513 cycles/op, 0 errors)
56416.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41186 insns/op, 24851 cycles/op, 0 errors)
throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70
After:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 40733 insns/op, 23145 cycles/op, 0 errors)
59283.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40624 insns/op, 23948 cycles/op, 0 errors)
70851.03 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40625 insns/op, 23027 cycles/op, 0 errors)
70549.61 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40650 insns/op, 23266 cycles/op, 0 errors)
68634.96 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40622 insns/op, 22935 cycles/op, 0 errors)
throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/
Tested mixed cluster manually.
"
* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
group0: drop unused field from replace_info struct
test: rename raft_address_map_test to address_map_test and move if from raft tests
raft_address_map: remove raft address map
topology coordinator: do not modify expire state for left/new nodes any more in raft address map
topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
group0: drop raft address map dependency from raft_rpc
group0: move raft_ticker_type definition from raft_address_map.hh
storage_service: do not update raft address map on gossiper events
group0: drop raft address map dependency from raft_server_with_timeouts
group0: move group0 upgrade code to host ids
repair: drop raft address map dependency
group0: remove unused raft address map getter from raft_group0
group0: drop raft address map from group0_state_machine dependency since it is not used there any more
group0: remove dependency on raft address map from group0_state_id_handler
gossiper: add get_application_state_ptr that searches by host_id
gossiper: change get_live_token_owners to return host ids
view: move view building to host id
hints: use host id to send hints
storage_proxy: remove id_vector_to_addr since it is no longer used
db: consistency_level: change is_sufficient_live_nodes to work on host ids
...
This patch adds three basic tests for delete item. A simple one that
validate that a simple short delete item returns 1 WCU.
The second tries to delete a missing item.
The third stores a bigger item and use the ReturnValues='ALL_OLD' to
make the API gets the previous stored item and see that the WCU is as
expected.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch modify the post_item WCU test to validate that it uses the
right ops. Note that the test will pass even before this change but we
want to validate the extra label.
Currently, the tri-compare operator for big_decimal (operator <=>), uses
a precise but potentially very expensive algorithm for comparing the
numbers: it first brings them to the same scale, then compares the
normalized unscaled values. big_decimal has abritrary precisions,
therefore the stored numbers can be arbitrarily large.
In extreme cases, comparing two numbers can result in huge amount of
memory allocated and stalls. If this type is used int he primary key of
a table, these comparisons can make the node completely unresponsive.
This patch adds the following fast-paths to operator <=>:
* An early return for the case of equal scales.
* An early return for different signs.
* An early return for the case where one or both of the numbers are 0.
* A fast algorithm for detecting the case where the there is a big
difference between the two numbers. This algorithm works only with the
scales and is able to compare the two numbers by using only one division
and some additions and substractions. This algorithm is imprecise and
when the numbers are closer than its confidence window, it will
fall-back to the current slow but precise tri-compare.
All but the last case should have been fast before as well, but the
scale-compare algorithm makes a huge difference. Numbers, which would
previously make the node unresponsive, now compare in constant-time.
Fixes: scylladb/scylladb#21716Closesscylladb/scylladb#21715
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.
in this change, we:
- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
`std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
to feed `std::views::transform()` a view range.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
limitations:
there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21700
Add tablet task manager module and keep it in storage_service.
Introduce tablet_virtual_task that covers tablet repair.
Thanks to a repair virtual task, a user can check the list of pending
repairs, get the status of a specific repair, or abort it using the task
manager API.
Fixes: #21368.
No backport, new feature
Closesscylladb/scylladb#21624
* github.com:scylladb/scylladb:
test: add test to check tablet repair tasks
test: topology_tasks: enable tablets
service: keep tablets module in storage_service
service: rename storage_service::_task_manager_module
service: add tablet_virtual_task
tasks: utilize preliminary virtual task lookup
In commit 8bf62a0 we introduced a test/pytest.ini which affects every
run of pytest in the project. One specific line in that file
log_cli = true
Overrides pytest's standard CLI output, which is traditionally short
unless the "-v" (verbose) option is used, to be always long and spammy.
There is absolutely no reason to do that - if the user wants to run
"pytest -v", they can do that - it doesn't need to be the default.
Moreover, as https://docs.pytest.org/en/stable/how-to/logging.html
explains, the "log_cli = true" was added in pytest 3.4 to revert to
pytest 3.3 behavior that "community feedback" showed was NOT LIKED.
Why would we want to revert to behavior that wasn't liked?
After this patch, which removes that line, the output of commands
like
cd test/cqlpy; pytest
return to what they used to be before commit 8bf62a0 and what the
pytest developers intended. Users who like verbose output can use
"pytest -v".
Fixes#21712
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21717
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.
Fixes#20030
This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.
Closesscylladb/scylladb#21582
* github.com:scylladb/scylladb:
compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
compaction_group: rename `update_main_sstable_list_on_compaction_completion`
compaction_group: update maintenance sstable set on scrub compaction completion
compaction_group: store table::sstable_list_builder::result in replacement_desc
table::sstable_list_builder: remove old sstables only from current list
table::sstable_list_builder: return removed sstables from build_new_list
In commit 9ff9cd37c3 we added in
test/alternator/test_number.py a workaround for a boto3 bug that
prevented us (and still prevents us) from testing numbers with high
precision. Because the workaround was so bizarre, the three lines it
requires - two imports and an assignment - were preceded by a 5-line
comment explaining it.
Unfortunately, a later commit 93b9b85c12
went and arbitrarily moved import lines around to satisfy some PEP-8
"requirements", resulting in the comment being separated from the lines
it was supposed to explain.
This patch moves the comment in front of the main line it explains.
The two imports that are needed just for this line and aren't used
elsewhere remain in their current place (where the PEP8 police demands
they stay), but this is less important for the understanding of this
trick so it's fine.
No functionality of the test was changed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#21635
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.
This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2
Fixes: scylladb/scylladb#21096Closesscylladb/scylladb#21609
Raft address map is not use any longer to resolve addresses anyway, so
drop dependency on it from raft_ip_address_updater and rename it to
reflect that it is no longer raft address map specific.
In this rather large path we mode to address nodes in storage proxy by
host ids instead of ips. Some subsystems storage proxy calls to are
not yet converted to host ids, so we translate back and forth when we
interact with them.
Currently the locator::topology object, when created, does not contain
local node, but it is started to be used to access local database. It
sort of work now because there are explicit checks in the code to handle
this special case like in topology::get_location for instance. We do not
want to hack around it and instead rely on an invariant that the local
node is always there. To do that we add local node during
locator::topology creation. There is a catch though. Unlike with IP host
ID is not known during startup. We actually need to read from the
database to know it, so the topology starts with host ID zero and then
it changes once to the real one. This is not a problem though. As long as
the (one node) topology is consistent (_cfg.this_host_id is equal to the
node's id) local access will work.
The function looks up provided host id in gossip_address_map and throws
unknown_address if the mapping is not available. Otherwise it sends the
message by IP found.
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.
Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency
Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions
Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios
Fixesscylladb/scylladb#21724
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21726
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Demote --scylla-data-dir and --scylla-yaml-file to schema source
helpers, rather than schema source in themselves. This practically means
that when these options are used, they won't define where the tool will
attempt to load the schema from, they will just be helpers to help locate
the schema, for whichever schema source the tool was instructed to use
(or left to choose).
--scylla-data-dir and --scylla-yaml-file being schema sources were
problematic with encryption at rest and for S3 support (not yet
implemented). With encryption, the tool needs access to the
configuration, so --scylla-yaml-file is often used to provide the path
to the configuration file, which contains encryption configuration,
needed for the tool to decrypt the sstable. Currently, using this option
implies forcing the tool to read the schema from the schema tables,
which is a problematic option for tests -- Scylla might be compacting a
schema sstable and this will make the tool fail to load the schema.
Demoting these options the schema helpers, allows providing them, while
at the same time having the option to use a different schema-source.
To allow the user to force the tool to load the schema from the schema
tables, a new --schema-tables option is added. Similarly, a
--sstable-schema option is introduced to force the tool to load the
schema from the sstable itself.
With this, each 4 schema source now has an option to force the use of
said schema source. There are various helper options to be used along
with these.
The documentation as well as the tests are updated with the changes.
The schema related documentation gets an rather extensive facelift
because it was a bit out-of-date and incomplete.
Fixes: scylladb/scylladb#20534Closesscylladb/scylladb#21678
Once e.g. `ALTER KEYSPACE` is performed, all in-memory objects should be updated accordingly, but this is not entirely true for keyspace metadata object. The reason for that is that keyspace metadata are stored in 2 system tables: `system_schema.keyspaces` and `system_schema.scylla_keyspaces`. Up until now the in-memory keyspace metadata object has been updated only with entries from the first table, and missed updates when entries from the 2nd table changed. These entries were e.g. initial tablets or storage options.
This change fixes this oversight by considering both tables when checking if keyspace metadata need to be updated. From the implementation point of view, the change is simple: we're considering `system_schema.scylla_keyspaces` also in `merge_keyspaces()` and if old and new schemas have any differences, we include that when altering ks.
Fixes#20768
Backport: no need, I don't think the issue is severe, atm it seems like it can only influence the tablets number, which should not bring the cluster down nor result in returning bad data, it can mostly influence the speed of the db.
Closesscylladb/scylladb#20852
Tablets are no longer an experimental feature, but topology_tasks test
suite treats them as if they were.
Enable tablets with their own config option in topology_tasks suite.
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there.
This patch modifies the `update_sstable_sets_on_compaction_completion`
to remove the input sstable from the maintenance sstable set if it
exists in that set.
Also added a testcase to verify the fix.
Fixes#20030
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Enable the `enable_tablets` configuration flag in object store tests to better
align with production environments, where it is enabled by default via the
`scylla.yaml` in Scylla's relocatable tarball. This change will improve test
coverage of tablet-related features.
Previously, `enable_tablets` defaulted to false in tests, creating a mismatch
with typical production deployments.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.
The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
mapped destinations
While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.
Fixesscylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
`coroutine::parallel_for_each` accepts both a range and a pair of
iterators. let's use the former when appropriate. it is simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21684
ScyllaDB doesn't support counters with tablets yet. So scylla-sstable
tests which use counter schema are marked with xfail, but this is done
too aggressively, disabling too many tests that are otherwise fine.
There are two tests affected:
* test_scylla_sstable_script - this test uses early return when the
schema parameter is the one with counters and tablets are enabled.
This is still too eager because tablets are now always enabled. Also,
the early return make the fact that this test is disabled hidden.
So change the check to check whether tablets are used on the test
keyspace and use xfail instead of sneaky early return.
* test_scylla_sstable_dump_data - this test is blanket-disabled when run
with the tablets parameter. Even though only 1 out of 5 schemas tested
use counters. Remove the blanket xfail and only add it when test
keyspace uses tablets and the schema parameter is the one with
counters.
This makes dozens of test run again, restoring the test coverage lost
with the too eager use of xfail (and sneaky return).
Refs: #18180Closesscylladb/scylladb#21685
in 0dff187b7a, we dropped `InjectingHandler.log_message()`, but this
method was defined to override the default implementation provided by
`BaseHTTPRequestHandler.log_message()`. this change flooded the standard
output when testing `aws_error_injection_test` with `test.py` with
logging messages like:
```
127.0.0.1 - - [26/Nov/2024 17:27:34] "PUT /?Policy=0&Key=%2Ftest%2Ftestobject-large-817295 HTTP/1.1" 200
127.0.0.1 - - [26/Nov/2024 17:27:34] "PUT /?Policy=1&Key=%2Ftest%2Ftestobject-large-817306 HTTP/1.1" 200
```
this is unexpected.
in this change, we bring this method back, and additionally, we
format the logging message lazily.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21689
When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long.
Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed.
Fixes: #21499.
Refs: #21425.
No backport needed - previous versions may use task_ttl
Closesscylladb/scylladb#21505
* github.com:scylladb/scylladb:
test: add test to check user_task_ttl
tasks: api: move make_task method
docs: nodetool: update backup and restore commands docs
docs: update task manager docs
nodetool: add nodetool tasks user-ttl command
node_ops: use user task ttl for node ops virtual task
tasks: use user_task_ttl for tasks started by user
api: task_manager: add /task_manager/user_ttl to get and set user task ttl
tasks: add task_manager::task::is_user_task method
tasks: keep updateable_value of task_ttl in task manager
db: config: add user_task_ttl_seconds named value
Stop taking snapshots of MVs and allow taking snapshot of individual tables, now one can take a snapshot of any base table, any view or index. Also add tests to cover new cases both boost test (using cc code) and pytest (using the API)
Also, update documentation to reflect the change
fixes: #21339fixes: #20760Closesscylladb/scylladb#21433
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.
Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views
Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices
Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
- `std::ranges::to` adaptor requires rvalue references
- Necessitated updates to variable and parameter constness
- Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
from `common` to enable efficient range conversion
2. Range Iteration and Mutation
- Range views may mutate internal state during iteration
- Cannot pass ranges by const reference in some scenarios
- Solution: Pass ranges by rvalue reference to explicitly indicate
state invalidation
Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change
This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21648