Commit Graph

45670 Commits

Author SHA1 Message Date
Ferenc Szili
e65a235fd5 test: add tests for truncate with tablets
This patch adds the unit tests for truncate with tablets.

test_truncate_while_migration() triggers a tablet migration, then runs
a TRUNCATE TABLE for the table containing the tablet being migrated.
test_truncate_with_concurrent_drop() starts a truncate, then attempts to
drop the table while it is being truncated.
test_truncate_while_node_restart() validates the case where a replica
node is restarted while truncate is running.
test_truncate_with_coordinator_crash() validates if truncate is
correctly completed in cases where the topology coordinator has crashed
or restarted after the truncate session is cleared, but before the
truncate request is finalized.
2024-12-09 16:38:50 +01:00
Ferenc Szili
4cd7a1acab storage_proxy: use new TRUNCATE for tablets
This change adds branching based on keyspace replication method, and
uses the new TRUNCATE for keyspaces with tablets.
2024-12-09 16:38:50 +01:00
Ferenc Szili
93cfeb9160 truncate: make TRUNCATE a global topology operation
This commit adds the code needed to create a TRUNCATE global topology
request. It also adds the handler for this request to the topology
coordinator.
The execution of the truncate operation is not canceled on a timeout,
but the query coordinator side will return a timeout error.
2024-12-09 16:38:37 +01:00
Ferenc Szili
fa3ec6e633 storage_service: move logic of wait_for_topology_request_completion()
This change moves to logic of
storage_service::wait_for_topology_request_completion() into
topology_state_machine.
2024-12-04 12:03:15 +01:00
Ferenc Szili
36d35d2297 RPC: add truncate_with_tablets RPC with frozen_topology_guard
This change introduces a new truncate_with_tablets RPC with a parameter
of type service::frozen_topology_guard. This is materialized on replica
nodes into a topology_guard which guarantees that truncate is performed
under a global session, which, in turn, makes sure that we don't execute
truncate as a result of stale RPCs.

Also, this RPC does not have a timeout. Timeout will be handled on the
coordinator side, and the truncate operation will not be allowed to time
out.
2024-12-04 11:30:07 +01:00
Ferenc Szili
bfbfc0fea9 feature_service: added cluster feature for system.topology schema change
This patch adds a feature serive which protects the system.topology
schema change against situations where clusters are incompletely
upgraded to new a version and could be rolled back.
2024-12-04 11:30:07 +01:00
Ferenc Szili
3ac44109e3 system.topology_requests: change schema
This commit adds the new column in the system.topology_requests
table which are needed for the new global topology request.
2024-12-04 11:30:06 +01:00
Ferenc Szili
7f29b7d8f6 storage_proxy: propagate group0 client and TSM dependency
This commit makes storage_proxy::remote dependent on raft_group0_client
and topology_state_machine. storage_proxy::remote gets references to these via
the call to start_remote(). These references will be needed to call
storage_service::truncate_table_with_tablets().
2024-12-04 11:30:06 +01:00
Avi Kivity
841481c202 Merge "move storage proxy and adjacent services to identify hosts by ids" from Gleb
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).

The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.

Fixes: scylladb/scylladb#6403

perf-simple-query -- smp 1 -m 1G output

Before:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41291 insns/op,   24485 cycles/op,        0 errors)
62669.58 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41277 insns/op,   24695 cycles/op,        0 errors)
69172.12 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   41326 insns/op,   24463 cycles/op,        0 errors)
56706.60 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41143 insns/op,   24513 cycles/op,        0 errors)
56416.65 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   41186 insns/op,   24851 cycles/op,        0 errors)

         throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
  cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70

After:

enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   40733 insns/op,   23145 cycles/op,        0 errors)
59283.09 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40624 insns/op,   23948 cycles/op,        0 errors)
70851.03 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40625 insns/op,   23027 cycles/op,        0 errors)
70549.61 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40650 insns/op,   23266 cycles/op,        0 errors)
68634.96 tps ( 63.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   40622 insns/op,   22935 cycles/op,        0 errors)

         throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
  cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59

CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/

Tested mixed cluster manually.
"

* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
  group0: drop unused field from replace_info struct
  test: rename raft_address_map_test to address_map_test and move if from raft tests
  raft_address_map: remove raft address map
  topology coordinator: do not modify expire state for left/new nodes any more in raft address map
  topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
  group0: drop raft address map dependency from raft_rpc
  group0: move raft_ticker_type definition from raft_address_map.hh
  storage_service: do not update raft address map on gossiper events
  group0: drop raft address map dependency from raft_server_with_timeouts
  group0: move group0 upgrade code to host ids
  repair: drop raft address map dependency
  group0: remove unused raft address map getter from raft_group0
  group0: drop raft address map from group0_state_machine dependency since it is not used there any more
  group0: remove dependency on raft address map from group0_state_id_handler
  gossiper: add get_application_state_ptr that searches by host_id
  gossiper: change get_live_token_owners to return host ids
  view: move view building to host id
  hints: use host id to send hints
  storage_proxy: remove id_vector_to_addr since it is no longer used
  db: consistency_level: change is_sufficient_live_nodes to work on host ids
  ...
2024-12-03 18:18:48 +02:00
Avi Kivity
b99d4ec055 abstract_replication_strategy.hh: apply pimpl to boost::icl::interval_map
interval_map is a heavyweight header, hide it behind the pimpl idiom
to reduce #include load.

Ref #1
2024-12-03 13:59:45 +01:00
Botond Dénes
b6a9c79af3 utils/big_decimal: add fast paths to operator <=>
Currently, the tri-compare operator for big_decimal (operator <=>), uses
a precise but potentially very expensive algorithm for comparing the
numbers: it first brings them to the same scale, then compares the
normalized unscaled values. big_decimal has abritrary precisions,
therefore the stored numbers can be arbitrarily large.
In extreme cases, comparing two numbers can result in huge amount of
memory allocated and stalls. If this type is used int he primary key of
a table, these comparisons can make the node completely unresponsive.

This patch adds the following fast-paths to operator <=>:
* An early return for the case of equal scales.
* An early return for different signs.
* An early return for the case where one or both of the numbers are 0.
* A fast algorithm for detecting the case where the there is a big
  difference between the two numbers. This algorithm works only with the
  scales and is able to compare the two numbers by using only one division
  and some additions and substractions. This algorithm is imprecise and
  when the numbers are closer than its confidence window, it will
  fall-back to the current slow but precise tri-compare.

All but the last case should have been fast before as well, but the
scale-compare algorithm makes a huge difference. Numbers, which would
previously make the node unresponsive, now compare in constant-time.

Fixes: scylladb/scylladb#21716

Closes scylladb/scylladb#21715
2024-12-03 14:56:51 +02:00
Kamil Braun
8f858325b6 Merge 'topology_coordinator: introduce reload_count in topology state and use it to prevent race' from Gleb Natapov
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.

Closes scylladb/scylladb#21713

* github.com:scylladb/scylladb:
  topology_coordinator: introduce reload_count in topology state and use it to prevent race
  storage_service: use conditional_variable::when in co-routines consistently
2024-12-03 12:00:56 +01:00
Kefu Chai
4bc7e068ff locator: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21754
2024-12-03 11:05:35 +02:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Kefu Chai
99de3962c3 db/schema_applier: Fix spelling annotations to pass codespell checks
This commit addresses inconsistent spelling annotations that triggered
codespell warnings in our codebase.

Problem:
- Previous annotations like "CREATEing" and "DROPing" were flagged as
  misspellings by the codespell workflow
- These annotations were used to describe CQL statement execution contexts

Solution:
- Updated annotations to "CREAT'ing" and "DROP'ing"
- Preserves the intent of the original annotations
- Silences codespell warnings without changing the underlying meaning
- Ensures consistent and spell-checker-friendly code documentation

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21741
2024-12-03 09:01:26 +02:00
Botond Dénes
b87fb94a5e Merge 'tasks: add tablet repair virtual task' from Aleksandra Martyniuk
Add tablet task manager module and keep it in storage_service.
Introduce tablet_virtual_task that covers tablet repair.

Thanks to a repair virtual task, a user can check the list of pending
repairs, get the status of a specific repair, or abort it using the task
manager API.

Fixes: #21368.

No backport, new feature

Closes scylladb/scylladb#21624

* github.com:scylladb/scylladb:
  test: add test to check tablet repair tasks
  test: topology_tasks: enable tablets
  service: keep tablets module in storage_service
  service: rename storage_service::_task_manager_module
  service: add tablet_virtual_task
  tasks: utilize preliminary virtual task lookup
2024-12-02 17:22:44 +02:00
Nadav Har'El
c45ddb964f pytest: don't override default live-logging setting
In commit 8bf62a0 we introduced a test/pytest.ini which affects every
run of pytest in the project. One specific line in that file

    log_cli = true

Overrides pytest's standard CLI output, which is traditionally short
unless the "-v" (verbose) option is used, to be always long and spammy.
There is absolutely no reason to do that - if the user wants to run
"pytest -v", they can do that - it doesn't need to be the default.

Moreover, as https://docs.pytest.org/en/stable/how-to/logging.html
explains, the "log_cli = true" was added in pytest 3.4 to revert to
pytest 3.3 behavior that "community feedback" showed was NOT LIKED.
Why would we want to revert to behavior that wasn't liked?

After this patch, which removes that line, the output of commands
like
    cd test/cqlpy; pytest

return to what they used to be before commit 8bf62a0 and what the
pytest developers intended. Users who like verbose output can use
"pytest -v".

Fixes #21712

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21717
2024-12-02 17:00:51 +02:00
Takuya ASADA
0700b322b8 install.sh: fix incorrect variable name
$without_systemd_check is incorrect variable name, it should be
$skip_systemd_check.

The bug skips to run "systemctl --user daemon-reload" unexpectedly on
nonroot mode installation.

This is likely root cause of the issue #21720.

Fixes #21720

Closes scylladb/scylladb#21747
2024-12-02 16:37:33 +02:00
Avi Kivity
58baeac0ad Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list
2024-12-02 13:32:49 +02:00
Nadav Har'El
6d37b53653 test/alternator: move comment next to bizarre code that it explains
In commit 9ff9cd37c3 we added in
test/alternator/test_number.py a workaround for a boto3 bug that
prevented us (and still prevents us) from testing numbers with high
precision. Because the workaround was so bizarre, the three lines it
requires - two imports and an assignment - were preceded by a 5-line
comment explaining it.

Unfortunately, a later commit 93b9b85c12
went and arbitrarily moved import lines around to satisfy some PEP-8
"requirements", resulting in the comment being separated from the lines
it was supposed to explain.

This patch moves the comment in front of the main line it explains.
The two imports that are needed just for this line and aren't used
elsewhere remain in their current place (where the PEP8 police demands
they stay), but this is less important for the understanding of this
trick so it's fine.

No functionality of the test was changed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21635
2024-12-02 10:56:09 +01:00
Abhinav
acd643bd75 test: Parametrize 'replacement with inter-dc encryption' test to confirm behavior in zero token node cases.
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.

This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2

Fixes: scylladb/scylladb#21096

Closes scylladb/scylladb#21609
2024-12-02 10:32:46 +01:00
Gleb Natapov
052e893444 group0: drop unused field from replace_info struct
The field is no longer used.
2024-12-02 10:31:14 +02:00
Gleb Natapov
1028ce17cd test: rename raft_address_map_test to address_map_test and move if from raft tests
It has nothing to do with raft now.
2024-12-02 10:31:14 +02:00
Gleb Natapov
96309224ff raft_address_map: remove raft address map
It is no longer used.
2024-12-02 10:31:14 +02:00
Gleb Natapov
b9d454c0d5 topology coordinator: do not modify expire state for left/new nodes any more in raft address map
The map is no longer used and gossiper address map is fully managed by
the gossiper.
2024-12-02 10:31:13 +02:00
Gleb Natapov
cbb6148a36 topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used 2024-12-02 10:31:13 +02:00
Gleb Natapov
fca1f90cc7 group0: drop raft address map dependency from raft_rpc
No need to update raft address map on config changes any longer.
2024-12-02 10:31:13 +02:00
Gleb Natapov
64b135db7d group0: move raft_ticker_type definition from raft_address_map.hh
It has nothing to do with raft address map after all.
2024-12-02 10:31:13 +02:00
Gleb Natapov
c65f64cc5f storage_service: do not update raft address map on gossiper events
Raft address map is not use any longer to resolve addresses anyway, so
drop dependency on it from raft_ip_address_updater and rename it to
reflect that it is no longer raft address map specific.
2024-12-02 10:31:13 +02:00
Gleb Natapov
fa1397af13 group0: drop raft address map dependency from raft_server_with_timeouts
It is only needed to translate id to ip in the log output, but there
is no point in doing so now. All the logging (in the converted code)
is id based now.
2024-12-02 10:31:13 +02:00
Gleb Natapov
fbaf0a3cce group0: move group0 upgrade code to host ids
Drop unneeded ip to id translation.
2024-12-02 10:31:13 +02:00
Gleb Natapov
4ddb925997 repair: drop raft address map dependency
Replace it with gossiper address map, but make dependency localized.
Only functions that actually use address map get it now.
2024-12-02 10:31:13 +02:00
Gleb Natapov
ef09a93843 group0: remove unused raft address map getter from raft_group0 2024-12-02 10:31:13 +02:00
Gleb Natapov
85233830cf group0: drop raft address map from group0_state_machine dependency since it is not used there any more 2024-12-02 10:31:13 +02:00
Gleb Natapov
8fbb28cfcb group0: remove dependency on raft address map from group0_state_id_handler
Now that we can look up gossip state by host id we do not need to do the
translation in group0_state_id_handler.
2024-12-02 10:31:13 +02:00
Gleb Natapov
18a9de51e7 gossiper: add get_application_state_ptr that searches by host_id 2024-12-02 10:31:13 +02:00
Gleb Natapov
7d751709e3 gossiper: change get_live_token_owners to return host ids
Also amend the only user and drop the ip to id translation.
2024-12-02 10:31:13 +02:00
Gleb Natapov
20d1b80535 view: move view building to host id
Use host ids in view building code as well.
2024-12-02 10:31:13 +02:00
Gleb Natapov
0ca14ef8b7 hints: use host id to send hints
Drop address translation that no longer needed. Templates here are used
temporarily until another user of the function (MV) is converted as
well.
2024-12-02 10:31:12 +02:00
Gleb Natapov
5b9e4c2f07 storage_proxy: remove id_vector_to_addr since it is no longer used
Was needed during transition period only.
2024-12-02 10:31:12 +02:00
Gleb Natapov
6116751e44 db: consistency_level: change is_sufficient_live_nodes to work on host ids
It is called from storage proxy which works on host ids now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
eb3d2307ce replication_strategy: move sanity_check_read_replicas to host id
It is called from storage proxy which works on host ids now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
ccbfabb858 db: consistency_level: move filter_for_query to host id
It is called from storage proxy which works on host ids now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
474b47ed22 database: move hits rates handling to host ids
Hits rates map is now indexed by ip. Change it to be indexed by host id since this is what
storage proxy uses now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
d2cf5ca030 messaging_service: pass host id to connection_dropped handler id available
RPC clients which are host id aware may pass the id to
connection_dropped callback and save the need for translation.
2024-12-02 10:31:12 +02:00
Gleb Natapov
9f7183286a storage_proxy: change batchlog to work on host ids
It was not translated in the first pass.
2024-12-02 10:31:12 +02:00
Gleb Natapov
a1fdc8c847 storage_proxy: change mutation rpcs to send forward and reply addresses as host ids
RPCs from old nodes will still use old format so translation will be
used in this case. The change is backwards compatible thanks to RPC
extensibility.
2024-12-02 10:31:12 +02:00
Gleb Natapov
cd9b349886 migration_manager: move to use host ids instead of ips
Users also amended to pass ids instead of ips.
2024-12-02 10:31:12 +02:00
Gleb Natapov
2f23a21a23 raft: raft_group_registry: do not insert entry into raft address map on incoming message
Raft map is no longer used to send raft messages. We rely on gossiper
address propagation now.
2024-12-02 10:31:12 +02:00
Gleb Natapov
1f302577d0 group0: move transfer_snapshot to use host ids
No need to translate id to ip any longer.
2024-12-02 10:31:12 +02:00