Commit Graph

7938 Commits

Author SHA1 Message Date
Michael Litvak
373855b493 service/qos/service_level_controller: update cache on startup
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.

The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.

Fixes scylladb/scylladb#21763

Closes scylladb/scylladb#21773
2024-12-11 12:05:28 +01:00
Tomasz Grabiec
440a96605f Merge 'topology_custom/test_tablets: add remove/replace tests for edge cases' from Benny Halevy
Test cases related to #21826:

1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to
    remove a node with another node down in the datacenter, leaving
    no normal token owners in that dc (reproducing #21826).
    Removenode is expected to fail in this case since it
    should have no place to rebuild the removed node replicas,
    yet it currently succeeds unexpectedly.

2. test_remove_failure_then_replace: verify that removenode
    fails as expected when there are not enough nodes to
    rebuild its replicas on, with and without additional zero-token nodes.

3. test_replace_with_no_normal_token_owners_in_dc: verify that
    nodes can be replaced in a datacenter that has no live
    token owners, with and without additional zero-token nodes.
    Tablet replace uses all replicas to rebuild the lost replicas
    and therefore should succeed in the edge case.
    The restored data is verified as well.

Refs #21826

* New tests, no backport needed

Closes scylladb/scylladb#21827

* github.com:scylladb/scylladb:
  topology_custom/test_tablets: add remove/replace tests for edge cases
  test: pylib: _cluster_remove_node: log message on successful paths
  test: pylib: _cluster_remove_node: mark server as removed only when removenode succeeded
2024-12-11 12:04:14 +01:00
Benny Halevy
b95312064f topology_custom/test_tablets: add remove/replace tests for edge cases
Test cases related to #21826:

1. test_remove_failure_with_no_normal_token_owners_in_dc: attempts to
remove a node with another node down in the datacenter, leaving
no normal token owners in that dc (reproducing #21826).
Removenode is expected to fail in this case since it
should have no place to rebuild the removed node replicas,
yet it currently succeeds unexpectedly.

2. test_remove_failure_then_replace: verify that removenode
fails as expected when there are not enough nodes to
rebuild its replicas on, with and without additional zero-token nodes.

3. test_replace_with_no_normal_token_owners_in_dc: verify that
nodes can be replaced in a datacenter that has no live
token owners, with and without additional zero-token nodes.
Tablet replace uses all replicas to rebuild the lost replicas
and therefore should succeed in the edge case.
The restored data is verified as well.

Refs #21826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-10 21:39:15 +02:00
Tomasz Grabiec
8e60a0b831 Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili
Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped.

It works with tablets, but is not safe. A concurrent replication process may bring back old data.

This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request.

Backporting is not needed.

Fixes #16411

Closes scylladb/scylladb#19789

* github.com:scylladb/scylladb:
  docs: docs: topology-over-raft: Document truncate_table request
  storage_proxy: fix indentation and remove empty catch/rethrow
  test: add tests for truncate with tablets
  storage_proxy: use new TRUNCATE for tablets
  truncate: make TRUNCATE a global topology operation
  storage_service: move logic of wait_for_topology_request_completion()
  RPC: add truncate_with_tablets RPC with frozen_topology_guard
  feature_service: added cluster feature for system.topology schema change
  system.topology_requests: change schema
  storage_proxy: propagate group0 client and TSM dependency
2024-12-10 17:50:50 +01:00
Tomasz Grabiec
bf18a17bd6 tablets: scheduler: Fix temporary imbalance in a mixed-capacity cluster on decommission
When tablet scheduler drains nodes, it chooses target location based
on "badness" metric. Nodes with lowest score are preferred. Before the
patch, the score which was used was the number of tablets on that node
post-movement. This way we populate least-loaded node first. But this
works only if nodes have equal number of shards. If nodes have different
capacity, then number of tablets is not a good metric, because we don't
aim to equalize per-node count, but per-shard count. We assume that each
shard has equal capacity.

Because of this bug, during decommission, the nodes with fewer shards
would be preferred to receive replicas, which may lead to overloading
of those nodes. This imbalance would be later fixed by the normal load
balancing logic, but it's still problematic.

Fixes #21783

Closes scylladb/scylladb#21860
2024-12-10 14:18:03 +02:00
Benny Halevy
eeb6d3dd74 test: pylib: _cluster_remove_node: log message on successful paths
Log a message when removenode succeeded as expected
or when it failed as expected with the `expected_error`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-10 11:55:27 +02:00
Benny Halevy
cd566924b9 test: pylib: _cluster_remove_node: mark server as removed only when removenode succeeded
Currently, we call server_mark_removed also when removenode
failed with the `expected_error`, where the function returns success
but the server is not supposed to be in a removed state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-10 11:55:27 +02:00
Botond Dénes
5d040e0206 Merge 'truncate: commit log replay positions are not saved correctly' from Ferenc Szili
TRUNCATE TABLE saves the current commit log replay positions in case there is a crash so that replay knows where to begin replaying the mutations. These are collected and saved per shard into `system.truncated`. In case a shard received no mutations, its replay position will be an empty, default constructed object of type `db::replay_position` with its members set to 0. Truncate will incorrectly interpret these empty replay positions as if they were coming from shard 0, and save them as such, potentially overwriting an actual valid replay position coming from the actual shard 0. In the case of a crash, this will cause the commit log on shard 0 to be replayed from the beginning, and result with data resurrection.

Fixes #21719

Closes scylladb/scylladb#21722

* github.com:scylladb/scylladb:
  test: add test for truncate saving replay positions
  database: correctly save replay position for truncate
2024-12-10 10:05:30 +02:00
Botond Dénes
924189c50e Merge 'replica/table: improve error message when encountering orphaned sstables' from Lakshmi Narayanan Sreethar
On startup, if a server reads an sstable that belongs to a tablet that
doesn't have any local replica, it throws an error in the following
format and refuses to start :

```
Storage wasn't found for tablet 1 of table test.test
```

This patch updates the code path to throw a nicer error that includes
the sstable name that caused the problem.

This patch also adds a testcase to verify the error being thrown.

Fixes https://github.com/scylladb/scylladb/issues/18038

PR improves an error message - no need to backport.

Closes scylladb/scylladb#21805

* github.com:scylladb/scylladb:
  replica/table: fix indent in compaction_group_for_sstable
  replica/table: improve error message when encountering orphaned sstables
2024-12-10 06:34:12 +02:00
Kefu Chai
ce2f80c227 treewide: migrate from boost::make_iterator_range to ranges::subrange
Replace boost::make_iterator_range() with std::ranges::subrange.

This change improves code modernization and reduces external dependencies:

- Replace boost::make_iterator_range() with std::ranges::subrange
- Remove boost/range/iterator_range.hpp include
- Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range>

This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21787
2024-12-09 21:31:53 +02:00
Kefu Chai
48c8d24345 treewide: drop support for fmt < v10
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to support fmt < 10.

in this change, we drop the support fmt < 10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21847
2024-12-09 20:42:38 +02:00
Ferenc Szili
e65a235fd5 test: add tests for truncate with tablets
This patch adds the unit tests for truncate with tablets.

test_truncate_while_migration() triggers a tablet migration, then runs
a TRUNCATE TABLE for the table containing the tablet being migrated.
test_truncate_with_concurrent_drop() starts a truncate, then attempts to
drop the table while it is being truncated.
test_truncate_while_node_restart() validates the case where a replica
node is restarted while truncate is running.
test_truncate_with_coordinator_crash() validates if truncate is
correctly completed in cases where the topology coordinator has crashed
or restarted after the truncate session is cleared, but before the
truncate request is finalized.
2024-12-09 16:38:50 +01:00
Tomasz Grabiec
7e2875d648 Merge 'Add tablet merge support' from Raphael Raph Carvalho
The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.

Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.

Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.

Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.

The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.

When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.

Fixes #18181.

system test details:

test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml

instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.

latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode:			 write
  99.9th:	 3.145727ms
  99th:		 1.998847ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 2.031615ms
  99.9th:	 3.145727ms
  99th:		 2.031615ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.933311ms
  99.9th:	 3.047423ms
  99th:		 1.933311ms
Mode:			 read
  99.9th:	 3.145727ms
  99th:		 1.900543ms
  99.9th:	 3.145727ms
  99th:		 1.900543ms
Mode:			 write
  99.9th:	 5.079039ms
  99th:		 3.604479ms
  99.9th:	 35.389439ms
  99th:		 25.624575ms
Mode:			 write
  99.9th:	 3.047423ms
  99th:		 1.998847ms
  99.9th:	 3.047423ms
  99th:		 1.998847ms
Mode:			 read
  99.9th:	 3.080191ms
  99th:		 2.031615ms
  99.9th:	 3.112959ms
  99th:		 2.031615ms
```

Closes scylladb/scylladb#20572

* github.com:scylladb/scylladb:
  docs: Document tablet merging
  tests/boost: Add test to verify correctness of balancer decisions during merge
  tests/topology_experimental_raft: Add tablet merge test
  service: Handle exception when retrying split
  service: Co-locate sibling tablets for a table undergoing merge
  gms: Add cluster feature for tablet merge
  service: Make merge of resize plan commutative
  replica: Implement merging of compaction groups on merge completion
  replica: Handle tablet merge completion
  service: Implement tablet map resize for merge
  locator: Introduce merge_tablet_info()
  service: Rename topology::transition_state::tablet_split_finalization
  service: Respect initial_tablet_count if table is in growing mode
  service: Wire migration_tablet_set into the load balancer
  locator: Add tablet_map::sibling_tablets()
  service: Introduce sorted_replicas_for_tablet_load()
  locator/tablets: Extend tablet_replica equality comparator to three-way
  service: Introduce alias to per-table candidate map type
  service: Add replication constraint check variant for migration_tablet_set
  service: Add convergence check variant for migration_tablet_set
  service: Add migration helpers for migration_tablet_set
  service/tablet_allocator: Introduce migration_tablet_set
  service: Introduce migration_plan::add(migrations_vector)
  locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
  locator/tablets: Introduce tablet_map::needs_merge()
  locator/tablets: Introduce resize_decision::initial_decision()
  locator/tablets: Fix return type of three-way comparison operators
  service: Extract update of node load on migrations
  service: Extract converge check for intra-node migration
  service: Extract erase of tablet replicas from candidate list
  scripts/tablet-mon: Allow visualization of tablet id
2024-12-06 18:06:20 +01:00
Lakshmi Narayanan Sreethar
fa10b0b390 replica/table: improve error message when encountering orphaned sstables
On startup, if a server reads an sstable that belongs to a tablet that
doesn't have any local replica, it throws an error in the following
format and refuses to start :

```
Storage wasn't found for tablet 1 of table test.test
```

This patch updates the code path to throw a nicer error that includes
the sstable name that caused the problem.

This patch also adds a testcase to verify the error being thrown.

Fixes #18038
2024-12-06 21:22:24 +05:30
Abhinav
6c90a25014 Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600
2024-12-06 10:45:07 +01:00
Piotr Dulikowski
def51e252d Merge 'service/topology_coordinator: migrate view builder only if all nodes are up' from Michał Jadwiszczak
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.

Fixes scylladb/scylladb#20754

The PR should be backported to 6.2, this version has view builder on group0.

Closes scylladb/scylladb#21708

* github.com:scylladb/scylladb:
  test/topology_custom/test_view_build_status: add reproducer
  service/topology_coordinator: migrate view builder only if all nodes are up
2024-12-06 09:07:07 +01:00
Avi Kivity
9024e4940c counters.hh: drop unused boost includes
Re-add them to source files that need them.

Closes scylladb/scylladb#21738
2024-12-05 12:27:41 +02:00
Nadav Har'El
86a8ca8a9f Merge 'Alternator add WCU for delelte item' from Amnon Heiman
This series adds WCU support for the delete item operation.
It also splits the Alternator WCU metric by an ops label to give us better visibility of how much each ops contributes to the WCU calculation.

No need to backport to the open source

Closes scylladb/scylladb#21709

* github.com:scylladb/scylladb:
  test_returnconsumedcapacity.py: Add delete Item tests
  alternator/executor: Add WCU support for delete item
  alternator/executer use uint in describe_item
  alternator/consumed_capacity.hh: Make the total_bytes public
  test_metrics validate split wcu_total to ops
  Alternato: split WCU metrics into ops
2024-12-05 11:27:20 +02:00
Pavel Emelyanov
dd8f56ad3a test: Move test_query_built_indexes_virtual_table from boost to cqlpy
And split it into two -- one for materialized view, another for
secondary index. This is to fit current cqlpy layout that has different
files for views and indexes.

refs: #21552
refs: #21551 (detached this patch from there, as that PR needs fix in
the core code)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21677
2024-12-05 09:17:23 +02:00
Raphael S. Carvalho
8344722a26 tests/boost: Add test to verify correctness of balancer decisions during merge
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-04 13:11:11 -03:00
Raphael S. Carvalho
76ab293505 tests/topology_experimental_raft: Add tablet merge test
Passed ./test.py --mode=dev ... --repeat=50.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-04 13:11:07 -03:00
Ferenc Szili
7f29b7d8f6 storage_proxy: propagate group0 client and TSM dependency
This commit makes storage_proxy::remote dependent on raft_group0_client
and topology_state_machine. storage_proxy::remote gets references to these via
the call to start_remote(). These references will be needed to call
storage_service::truncate_table_with_tablets().
2024-12-04 11:30:06 +01:00
Botond Dénes
f55dc71c3f Merge 'Use checksummed input streams in validate_checksums()' from Nikos Dragazis
With commits ed7d352e7d and bb1867c7c7, we now have input streams for both compressed and uncompressed SSTables that provide seamless checksum and digest checking. The code for these was based on `validate_checksums()`, which implements its own validation logic over raw streams. This has led to some duplicate code.

This PR deduplicates the uncompressed case by modifying `validate_checksums()` to use a checksummed input stream instead of a raw stream. The same cannot be done for compressed SSTables though. The reason is that `validate_checksums()` needs to examine the whole data file, even if an invalid chunk is encountered. In the checksummed case we support that by offloading the error handling logic from the data source via a function parameter. In the compressed data source we cannot do that because it needs to return decompressed data and decompression may fail if the data are invalid.

This PR also enables `validate_checksums()` to partially verify SSTables with just the per-chunk checksums if the digest is missing.

In more detail, this PR consists of:
* Port of some integrity checks from `do_validate_uncompressed()` to the checksummed data source. It should now be able to detect corruption due to truncated or appended chunks (expected number of chunks is retrieved from the CRC component).
* Introduction of `error_handler` parameter in checksummed data source and `data_stream()`.
* Refactoring of `validate_checksums()`. The JSON response of `sstable validate-checksums` was also modified to report a missing digest.
*  Tests for `validate_checksums()` against SSTables with truncated data, appended data, invalid digests, or no digest.

Refs #19058.

This PR is a hybrid of cleanup and feature. No backport is needed.

Closes scylladb/scylladb#20933

* github.com:scylladb/scylladb:
  tools/scylla-sstable: Rename valid_checksums -> valid
  test: Check validate_checksums() with missing digest
  sstables: Allow validate_checksums() to report missing digests
  sstables: Refactor validate_checksums() to use checksummed data stream
  sstables: Add error_handler parameter to data_stream()
  sstables: Add error handler in checksummed data source
  sstables: Check for excessive chunks in checksummed data source
  sstables: Check for premature EOF in checksummed data source
  test: test_validate_checksums: Check SSTable with invalid digest
  test: test_validate_checksums: Check SSTable with appended data
  test: test_validate_checksums: Complement test for truncated SSTable
2024-12-04 10:46:18 +02:00
Raphael S. Carvalho
3e518c7b23 service: Co-locate sibling tablets for a table undergoing merge
This implements the ability for the balancer to co-locate sibling
tablets on the same shard. Co-location is low in priority, so
regular load balancer is preferred over it. Previous changes
allowed balancer to move co-located sibling tablets together,
to not undo the co-location work done so far.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 23:55:43 -03:00
Raphael S. Carvalho
e00798f1b1 service: Rename topology::transition_state::tablet_split_finalization
This transition state will be reused by merge completion, so let's
rename it to tablet_resize_finalization.
The completion handling path will also be reused, so let's rename
functions involved similarly.

The old name "tablet split finalization" is deprecated but still
recognized and points to the correct transition. Otherwise, the
reverse lookup would fail when populating topology system table
which last state was split finalization.

NOTE:
I thought of adding a new tablet_merge_finalization, but it would
complicate things since more than one table could be ready for
either split or merge, so you need a generic transition state
for handling resize completion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Avi Kivity
841481c202 Merge "move storage proxy and adjacent services to identify hosts by ids" from Gleb
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).

The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.

Fixes: scylladb/scylladb#6403

perf-simple-query -- smp 1 -m 1G output

Before:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41291 insns/op,   24485 cycles/op,        0 errors)
62669.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41277 insns/op,   24695 cycles/op,        0 errors)
69172.12 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41326 insns/op,   24463 cycles/op,        0 errors)
56706.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41143 insns/op,   24513 cycles/op,        0 errors)
56416.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41186 insns/op,   24851 cycles/op,        0 errors)

         throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
  cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70

After:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   40733 insns/op,   23145 cycles/op,        0 errors)
59283.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40624 insns/op,   23948 cycles/op,        0 errors)
70851.03 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40625 insns/op,   23027 cycles/op,        0 errors)
70549.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40650 insns/op,   23266 cycles/op,        0 errors)
68634.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40622 insns/op,   22935 cycles/op,        0 errors)

         throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
  cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/

Tested mixed cluster manually.
"

* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
  group0: drop unused field from replace_info struct
  test: rename raft_address_map_test to address_map_test and move if from raft tests
  raft_address_map: remove raft address map
  topology coordinator: do not modify expire state for left/new nodes any more in raft address map
  topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
  group0: drop raft address map dependency from raft_rpc
  group0: move raft_ticker_type definition from raft_address_map.hh
  storage_service: do not update raft address map on gossiper events
  group0: drop raft address map dependency from raft_server_with_timeouts
  group0: move group0 upgrade code to host ids
  repair: drop raft address map dependency
  group0: remove unused raft address map getter from raft_group0
  group0: drop raft address map from group0_state_machine dependency since it is not used there any more
  group0: remove dependency on raft address map from group0_state_id_handler
  gossiper: add get_application_state_ptr that searches by host_id
  gossiper: change get_live_token_owners to return host ids
  view: move view building to host id
  hints: use host id to send hints
  storage_proxy: remove id_vector_to_addr since it is no longer used
  db: consistency_level: change is_sufficient_live_nodes to work on host ids
  ...
2024-12-03 18:18:48 +02:00
Amnon Heiman
d2ca1ebfa0 test_returnconsumedcapacity.py: Add delete Item tests
This patch adds three basic tests for delete item. A simple one that
validate that a simple short delete item returns 1 WCU.
The second tries to delete a missing item.
The third stores a bigger item and use the ReturnValues='ALL_OLD' to
make the API gets the previous stored item and see that the WCU is as
expected.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-12-03 15:55:41 +02:00
Amnon Heiman
f4c79d7728 test_metrics validate split wcu_total to ops
This patch modify the post_item WCU test to validate that it uses the
right ops. Note that the test will pass even before this change but we
want to validate the extra label.
2024-12-03 15:55:41 +02:00
Botond Dénes
b6a9c79af3 utils/big_decimal: add fast paths to operator <=>
Currently, the tri-compare operator for big_decimal (operator <=>), uses
a precise but potentially very expensive algorithm for comparing the
numbers: it first brings them to the same scale, then compares the
normalized unscaled values. big_decimal has abritrary precisions,
therefore the stored numbers can be arbitrarily large.
In extreme cases, comparing two numbers can result in huge amount of
memory allocated and stalls. If this type is used int he primary key of
a table, these comparisons can make the node completely unresponsive.

This patch adds the following fast-paths to operator <=>:
* An early return for the case of equal scales.
* An early return for different signs.
* An early return for the case where one or both of the numbers are 0.
* A fast algorithm for detecting the case where the there is a big
  difference between the two numbers. This algorithm works only with the
  scales and is able to compare the two numbers by using only one division
  and some additions and substractions. This algorithm is imprecise and
  when the numbers are closer than its confidence window, it will
  fall-back to the current slow but precise tri-compare.

All but the last case should have been fast before as well, but the
scale-compare algorithm makes a huge difference. Numbers, which would
previously make the node unresponsive, now compare in constant-time.

Fixes: scylladb/scylladb#21716

Closes scylladb/scylladb#21715
2024-12-03 14:56:51 +02:00
Michał Jadwiszczak
dab3256dc1 test/topology_custom/test_view_build_status: add reproducer
The test reproduces scylladb/scylladb#20754
2024-12-03 10:17:26 +01:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Botond Dénes
b87fb94a5e Merge 'tasks: add tablet repair virtual task' from Aleksandra Martyniuk
Add tablet task manager module and keep it in storage_service.
Introduce tablet_virtual_task that covers tablet repair.

Thanks to a repair virtual task, a user can check the list of pending
repairs, get the status of a specific repair, or abort it using the task
manager API.

Fixes: #21368.

No backport, new feature

Closes scylladb/scylladb#21624

* github.com:scylladb/scylladb:
  test: add test to check tablet repair tasks
  test: topology_tasks: enable tablets
  service: keep tablets module in storage_service
  service: rename storage_service::_task_manager_module
  service: add tablet_virtual_task
  tasks: utilize preliminary virtual task lookup
2024-12-02 17:22:44 +02:00
Nadav Har'El
c45ddb964f pytest: don't override default live-logging setting
In commit 8bf62a0 we introduced a test/pytest.ini which affects every
run of pytest in the project. One specific line in that file

    log_cli = true

Overrides pytest's standard CLI output, which is traditionally short
unless the "-v" (verbose) option is used, to be always long and spammy.
There is absolutely no reason to do that - if the user wants to run
"pytest -v", they can do that - it doesn't need to be the default.

Moreover, as https://docs.pytest.org/en/stable/how-to/logging.html
explains, the "log_cli = true" was added in pytest 3.4 to revert to
pytest 3.3 behavior that "community feedback" showed was NOT LIKED.
Why would we want to revert to behavior that wasn't liked?

After this patch, which removes that line, the output of commands
like
    cd test/cqlpy; pytest

return to what they used to be before commit 8bf62a0 and what the
pytest developers intended. Users who like verbose output can use
"pytest -v".

Fixes #21712

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21717
2024-12-02 17:00:51 +02:00
Avi Kivity
58baeac0ad Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list
2024-12-02 13:32:49 +02:00
Nadav Har'El
6d37b53653 test/alternator: move comment next to bizarre code that it explains
In commit 9ff9cd37c3 we added in
test/alternator/test_number.py a workaround for a boto3 bug that
prevented us (and still prevents us) from testing numbers with high
precision. Because the workaround was so bizarre, the three lines it
requires - two imports and an assignment - were preceded by a 5-line
comment explaining it.

Unfortunately, a later commit 93b9b85c12
went and arbitrarily moved import lines around to satisfy some PEP-8
"requirements", resulting in the comment being separated from the lines
it was supposed to explain.

This patch moves the comment in front of the main line it explains.
The two imports that are needed just for this line and aren't used
elsewhere remain in their current place (where the PEP8 police demands
they stay), but this is less important for the understanding of this
trick so it's fine.

No functionality of the test was changed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21635
2024-12-02 10:56:09 +01:00
Abhinav
acd643bd75 test: Parametrize 'replacement with inter-dc encryption' test to confirm behavior in zero token node cases.
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.

This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2

Fixes: scylladb/scylladb#21096

Closes scylladb/scylladb#21609
2024-12-02 10:32:46 +01:00
Gleb Natapov
1028ce17cd test: rename raft_address_map_test to address_map_test and move if from raft tests
It has nothing to do with raft now.
2024-12-02 10:31:14 +02:00
Gleb Natapov
96309224ff raft_address_map: remove raft address map
It is no longer used.
2024-12-02 10:31:14 +02:00
Gleb Natapov
c65f64cc5f storage_service: do not update raft address map on gossiper events
Raft address map is not use any longer to resolve addresses anyway, so
drop dependency on it from raft_ip_address_updater and rename it to
reflect that it is no longer raft address map specific.
2024-12-02 10:31:13 +02:00
Gleb Natapov
12937aeb7f storage_proxy: move to addressing nodes by host ids instead of ips
In this rather large path we mode to address nodes in storage proxy by
host ids instead of ips. Some subsystems storage proxy calls to are
not yet converted to host ids, so we translate back and forth when we
interact with them.
2024-12-02 10:31:11 +02:00
Gleb Natapov
0882f2024c locator: topology: make topology object always contain local node
Currently the locator::topology object, when created, does not contain
local node, but it is started to be used to access local database. It
sort of work now because there are explicit checks in the code to handle
this special case like in topology::get_location for instance. We do not
want to hack around it and instead rely on an invariant that the local
node is always there. To do that we add local node during
locator::topology creation. There is a catch though. Unlike with IP host
ID is not known during startup. We actually need to read from the
database to know it, so the topology starts with host ID zero and then
it changes once to the real one. This is not a problem though. As long as
the (one node) topology is consistent (_cfg.this_host_id is equal to the
node's id) local access will work.
2024-12-02 10:31:11 +02:00
Gleb Natapov
1c5a7826dc storage_service: pass gossip_address_map
It will be used in the following patches.
2024-12-02 10:31:11 +02:00
Gleb Natapov
79358278f2 service: raft: move raft pinger to sending messages by host id
This allows us to drop dependency on raft_address_map from direct_fd_pinger.
2024-12-02 10:31:11 +02:00
Andrei Chekun
6c267bbc70 test.py: Make it test/cqlpy python module
Removed all path modification and migrated to python way of importing packages. This is another small step to the one pool cluster for better scheduling and better resource utilization.

Fixes: https://github.com/scylladb/scylladb/issues/21644

Closes scylladb/scylladb#21585
2024-12-01 18:26:17 +02:00
Gleb Natapov
76aa41dfcf messaging_service: pass gossip_address_map to the mm and introduce send by id functions
The function looks up provided host id in gossip_address_map and throws
unknown_address if the mapping is not available. Otherwise it sends the
message by IP found.
2024-12-01 12:12:30 +02:00
Gleb Natapov
ca2544e57e gossiper: introduce gossip address map
Introduce new address map that will be populated by the gossiper. Create
in during initialization and pass it to the gossiper.
2024-12-01 12:12:29 +02:00
Gleb Natapov
be5caec54e service: make address_map raft independent
We want to start using address map class outside for raft, so lets make
it work on host_id instead of raft::servers_id and move is outside of
raft.
2024-12-01 12:12:29 +02:00
Kefu Chai
65949ce607 test: topology_custom: ensure node visibility before keyspace creation
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.

Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency

Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions

Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios

Fixes scylladb/scylladb#21724

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21726
2024-11-29 17:13:21 +01:00
Kefu Chai
f436edfa22 mutation: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-29 14:01:44 +08:00
Botond Dénes
ff90a77f5b scylla-sstable: revamp schema sources
Demote --scylla-data-dir and --scylla-yaml-file to schema source
helpers, rather than schema source in themselves. This practically means
that when these options are used, they won't define where the tool will
attempt to load the schema from, they will just be helpers to help locate
the schema, for whichever schema source the tool was instructed to use
(or left to choose).
--scylla-data-dir and --scylla-yaml-file being schema sources were
problematic with encryption at rest and for S3 support (not yet
implemented). With encryption, the tool needs access to the
configuration, so --scylla-yaml-file is often used to provide the path
to the configuration file, which contains encryption configuration,
needed for the tool to decrypt the sstable. Currently, using this option
implies forcing the tool to read the schema from the schema tables,
which is a problematic option for tests -- Scylla might be compacting a
schema sstable and this will make the tool fail to load the schema.
Demoting these options the schema helpers, allows providing them, while
at the same time having the option to use a different schema-source.

To allow the user to force the tool to load the schema from the schema
tables, a new --schema-tables option is added. Similarly, a
--sstable-schema option is introduced to force the tool to load the
schema from the sstable itself.

With this, each 4 schema source now has an option to force the use of
said schema source. There are various helper options to be used along
with these.

The documentation as well as the tests are updated with the changes.
The schema related documentation gets an rather extensive facelift
because it was a bit out-of-date and incomplete.

Fixes: scylladb/scylladb#20534

Closes scylladb/scylladb#21678
2024-11-28 18:36:09 +02:00