Commit Graph

46934 Commits

Author SHA1 Message Date
Tomasz Grabiec
dfc9101dfd test: perf_load_balancing: Set node capacity
Otherwise, load balancer will not make any plan once it becomes
capacity-aware.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
6169401dbc test: perf_load_balancing: Convert to topology_builder
The test no longer worked becuase load balancer requires proper schema
in the database now. Convert to topology_builder which builds topology
in the database and create schema in the database (which needs proper
topology).
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Nadav Har'El
e0f24c03e7 Merge 'test.py: merge all 'Topology' suite types int one folder 'cluster'' from Artsiom Mishuta
Now that we support suite subfolders, there is no
need to create an own suite for object_store and auth_cluster, topology, topology_custom.
this PR merge all these folders into one: 'cluster"

this pr also introduce and apply 'prepare_3_nodes_cluster' fixture  that  allow preparing non-dirty 3 nodes cluster
that can be reused between tests(for tests that was in topology folder)

number of tests in master
release -3461
dev       -3472
debug   -3446

number of tests in this PR
release -3460
dev       -3471
debug   -3445

There is a minus one test in each mode because It was 2 test_topology_failure_recovery files(topology and topology_custom) with the same utility functions but different test cases. This PR merged them into one

Closes scylladb/scylladb#22917

* github.com:scylladb/scylladb:
  test.py: merge object_store into cluster folder
  test.py: merge auth_cluster into cluster folter
  test.py: rename topology_custom folder to cluster
  test.py: merge topology test suite into topology_custom
  test.py delete conftest in topology_custom
  test.py apply prepare_3_nodes_cluster in topology
  test.py: introduce prepare_3_nodes_cluster marker
2025-03-04 19:26:32 +02:00
Patryk Jędrzejczak
c13b6c91d3 Merge 'raft topology: drop changing the raft voters config via storage_service' from Emil Maskovsky
For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly.

We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions).
Furthermore, the other bundled commit improves the reaction again by reacting to the node `on_up()` and `on_down()` events, which again shortens the reaction time and improves the HA.

The change has effect on the timing in the tablets migration test though, as it previously relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum.

Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes).

Fixes: scylladb/scylladb#22860

Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969

No backport: Part of the limited voters new feature, so this shouldn't to be backported.

Closes scylladb/scylladb#22847

* https://github.com/scylladb/scylladb:
  raft: use direct return of future for `run_op_with_retry`
  raft: adjust the voters interface to allow atomic changes
  raft topology: drop removing the node from raft config via storage_service
  raft topology: drop changing the raft voters config via storage_service
2025-03-04 13:59:47 +01:00
Nadav Har'El
d096aac200 test/cqlpy/run: reduce number of tablets
In commit 2463e524ed, Scylla's default changed
from starting with one tablet per shard to starting 10 per shard. The
functional tests don't need more tablets and it can only slow down the
tests, so the patch added --tablets-initial-scale-factor=1 to test/*/suite.yaml
but forgot to add it to test/cqlpy/run.py (to affect test/cqlpy/run) so
this patch does this now.

This patch should *only* be about making tests faster, although to be
honest, I don't see any measurable improvement in test speed (10 isn't
so many). But, unfortunately, this is only part of the story. Over time
we allowed a few cqlpy tests to be written in a way that relies on having
only a small number of tablets or even exactly one tablet per shard (!).
These tests are buggy and should be fixed - see issues #23115 and #23116
as examples. But adding the option --tablets-initial-scale-factor=1 also
to run.py will make these bugs not affect test/cqlpy/run in the same way
as it doesn't affect test.py.

These buggy tests will still break with `pytest cqlpy` against a Scylla
you ran yourself manually, so eventually will still need to fix those
test bugs.

Refs #23115
Refs #23116

Closes scylladb/scylladb#23125
2025-03-04 15:39:21 +03:00
Asias He
60913312af repair: Enable small table optimization for system_replicated_keys
This enterprise-only system table is replicated and small. It should be
included for small table optimization.

Fixes scylladb/scylla-enterprise#5256

Closes scylladb/scylladb#23135
2025-03-04 12:40:56 +02:00
Artsiom Mishuta
97a620cda9 test.py: merge object_store into cluster folder
Now that we support suite subfolders, there is no
need to create an own suite for object_store
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
a283b391c2 test.py: merge auth_cluster into cluster folter
Now that we support suite subfolders, there is no
need to create an own suite for auth_cluster
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
d1198f8318 test.py: rename topology_custom folder to cluster
rename topology_custom folder to cluster
as it contains not only topology test cases
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
d8e17c4356 test.py: merge topology test suite into topology_custom
Now that we support suite subfolders, there is no
need to create an own suite for topology
2025-03-04 10:32:44 +01:00
Artsiom Mishuta
ef62dfa6a9 test.py delete conftest in topology_custom
delete conftest in the sepatate commi for brtter diff listing during
merge topology_custom and topology
2025-03-04 10:32:43 +01:00
Artsiom Mishuta
cf48444e3b test.py apply prepare_3_nodes_cluster in topology
apply prepare_3_nodes_cluster for all tests in the topology folder
via applying mark at the test module level using pytestmark
https://docs.pytest.org/en/stable/example/markers.html#marking-whole-classes-or-modules

set initial initial_size for topology folder to 0
2025-03-04 10:32:43 +01:00
Artsiom Mishuta
20777d7fc6 test.py: introduce prepare_3_nodes_cluster marker
prepare_3_nodes_cluster marker will allow preparing non-dirty 3 nodes cluster
that can be reused between tests
2025-03-04 10:32:43 +01:00
Nadav Har'El
a56751e71b test/cqlpy: fix test assuming just one tablet
The cqlpy test test_compaction.py::test_compactionstats_after_major_compaction
was written to assume we have just one tablet per shard - if there are many
tablets compaction splitting the data, the test scenario might not need
compaction in the way that the test assumes it does.

Recently (commit 2463e524ed) Scylla's default
was changed to have 10 tablets per shard - not one. This broke this test.
The same commit modified test/cqlpy/suite.yaml, but that affects only test.py
and not test/cqlpy/run, and also not manual runs against a manually-installed
Scylla. If this test absolutely requires a keyspace with 1 and not 10
tablets, then it should create one explicitly. So this is what this test
does (but only if tablets are in use; if vnodes are used that's fine
too).

Before this patch,
  test/cqlpy/run test_compaction.py::test_compactionstats_after_major_compaction
fails. After the patch, it passes.

Fixes #23116

Closes scylladb/scylladb#23121
2025-03-04 10:15:29 +02:00
Kefu Chai
a43072a21e cql3,test: replace boost::range::adjacent_find with std::ranges
to reduce third-party dependencies and modernize the codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22998
2025-03-04 10:08:02 +02:00
Artsiom Mishuta
d7f9c5654b test.py: change test uname
This commit change the test uname replacement fron "_" to "." to be able support sub-folders in
scylla-pkg scripts logic

Closes scylladb/scylladb#23130
2025-03-04 09:58:58 +02:00
Wojciech Mitros
dae7221342 rust: update dependencies
The currently used versions of "wasmtime", "idna", "cap-std" and
"cap-primitives" packages had low to moderate security issues.
In this patch we update the dependencies to versions with these
issues fixed.
The update was performed by changing the "wasmtime" (and "wasmtime-wasi")
version in rust/wasmtime_bindings/Cargo.toml and updating rust/Cargo.lock
using the "cargo update" command with the affected package. To fix an
issue with different dependencies having different versions of
sub-dependencies, the package "smallvec" was also updated to "1.13.1".
After the dependency update, the Rust code also needed to be updated
because of the slightly changed API. One Wasm test case needed to be
updated, as it was actually using an incorrect Wat module and not
failing before. The crate also no longer allows multiple tables in
Wasm modules by default - it is now enabled by setting the "gc" crate
feature and configuring the Engine with config.wasm_reference_types(true).

Fixes https://github.com/scylladb/scylladb/issues/23127

Closes scylladb/scylladb#23128
2025-03-04 09:45:23 +02:00
Pavel Emelyanov
e4e15a00b7 Merge 'reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Botond Dénes
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.

Fixes: scylladb/scylladb#22919

Bug is present in all live versions, backports are required.

Closes scylladb/scylladb#23044

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
  test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
2025-03-04 10:40:28 +03:00
Botond Dénes
71d8b7aa9f querier: demote tombstone warning for range-scans to debug level
Range scans are expected to go though lots of tombstones, no need to
spam the logs about this. The tombstone warning log is demoted to debug
level, if somebody wants to see it they can bump the logger to debug
level.

Fixes: https://github.com/scylladb/scylladb/issues/23093

Closes scylladb/scylladb#23094
2025-03-04 10:38:06 +03:00
Kefu Chai
a483ff8647 mutation: replace boost::upper_bound with std::ranges::upper_bound
Reduces dependencies on boost/range.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23119
2025-03-04 10:36:57 +03:00
Kefu Chai
a20cd6539c cql3, dht: Remove redundant std::move() calls
These redundant `std::move()` calls were identified by GCC-14.
In general, copy elision applies to these places, so adding
`std::move()` is not only unnecessary but can actually prevent
the compiler from performing copy elision, as it causes the
return statement to fail to satisfy the requirements for
copy elision optimization.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23063
2025-03-04 10:36:49 +03:00
Botond Dénes
6f7a069bce Merge 'Label basic metrics' from Amnon Heiman
This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar)

The idea in this series is to create an equivalent of levels with a label.
First, label a subset of the metrics used by the dashboards.
Second, the per-table metrics that are now off by default will be marked with a different label.
The following specific optional features: CDC, CAS, and Alternator have a dedicated label now.
This will allow users to disable all metrics of features that are not in use.

All the rest of the metrics are left unlabeled.

Without any changes, users would get the same metrics they are getting today.
But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies).

The labels are not reported based on the seastar feature of hiding labels that start with an underscore.

Closes scylladb/scylladb#12246

* github.com:scylladb/scylladb:
  db/view/view.cc: label metrics with basic_level
  transport/server.cc: label metrics with basic_level
  service/storage_proxy.cc: label metrics with basic_level and cas
  main.cc: label metrics with basic_level
  streaming/stream_manager.cc: label metrics with basic_level
  repair/repair.cc: label metrics with basic_level
  service/storage_service.cc: label metrics with basic_level
  gms/gossiper.cc: label metrics with basic_level
  replica/database.cc: label metrics with basic_level
  cdc/log.cc: label metrics with basic_level and cdc
  alternator: label metrics with basic_level and alternator
  row_cache.cc: label metrics with basic_level
  query_processor.cc: label metrics with basic_level
  sstables.cc: label metrics with basic_level
  utils/logalloc.cc label metrics with basic_level
  commitlog.cc: label metrics with basic_level
  compaction_manager.cc: label metrics with basic_level
  Adding the __level and features labels
2025-03-04 09:32:11 +02:00
Calle Wilund
2f10205714 config: Enable optional TLS1.3 session ticket usage in cert setup
Refs #22916

Adds an "enable_session_tickets" option to TLS setup for our server
endpoints (not documented for internode RPC, as we don't handle it
on the client side there), allowing enabling of TLS3 client session
ticket, i.e. quicker reconnect.

Session tickets are valid within a time frame or until a node
restarts, whichever comes first.

v2:
Use "TLS1.3" in help message

Closes scylladb/scylladb#22928
2025-03-04 09:30:53 +02:00
Amnon Heiman
19a414598b db/view/view.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_view_builder_builds_in_progress

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
9518a85ad0 transport/server.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_transport_cql_errors_total
scylla_transport_current_connections
scylla_transport_requests_served
scylla_transport_requests_shed

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
cbae9a4abe service/storage_proxy.cc: label metrics with basic_level and cas
The following metrics will be marked with basic_level label:
scylla_storage_proxy_coordinator_background_reads
scylla_storage_proxy_coordinator_background_writes
scylla_storage_proxy_coordinator_cas_background
scylla_storage_proxy_coordinator_cas_dropped_prune
scylla_storage_proxy_coordinator_cas_failed_read_round_optimization
scylla_storage_proxy_coordinator_cas_foreground
scylla_storage_proxy_coordinator_cas_prune
scylla_storage_proxy_coordinator_cas_read_contention_bucket
scylla_storage_proxy_coordinator_cas_read_contention_count
scylla_storage_proxy_coordinator_cas_read_latency_count
scylla_storage_proxy_coordinator_cas_read_latency_sum
scylla_storage_proxy_coordinator_cas_read_timeouts
scylla_storage_proxy_coordinator_cas_read_unavailable
scylla_storage_proxy_coordinator_cas_read_unfinished_commit
scylla_storage_proxy_coordinator_cas_total_operations
scylla_storage_proxy_coordinator_cas_write_condition_not_met
scylla_storage_proxy_coordinator_cas_write_contention_count
scylla_storage_proxy_coordinator_cas_write_latency_count
scylla_storage_proxy_coordinator_cas_write_latency_sum
scylla_storage_proxy_coordinator_cas_write_timeout_due_to_uncertainty
scylla_storage_proxy_coordinator_cas_write_timeouts
scylla_storage_proxy_coordinator_cas_write_unavailable
scylla_storage_proxy_coordinator_cas_write_unfinished_commit
scylla_storage_proxy_coordinator_current_throttled_base_writes
scylla_storage_proxy_coordinator_foreground_reads
scylla_storage_proxy_coordinator_foreground_writes
scylla_storage_proxy_coordinator_range_timeouts
scylla_storage_proxy_coordinator_range_unavailable
scylla_storage_proxy_coordinator_read_errors_local_node
scylla_storage_proxy_coordinator_read_latency_count
scylla_storage_proxy_coordinator_read_latency_sum
scylla_storage_proxy_coordinator_reads_local_node
scylla_storage_proxy_coordinator_reads_remote_node
scylla_storage_proxy_coordinator_read_timeouts
scylla_storage_proxy_coordinator_read_unavailable
scylla_storage_proxy_coordinator_speculative_data_reads
scylla_storage_proxy_coordinator_speculative_digest_reads
scylla_storage_proxy_coordinator_total_write_attempts_local_node
scylla_storage_proxy_coordinator_write_errors_local_node
scylla_storage_proxy_coordinator_write_latency_bucket
scylla_storage_proxy_coordinator_write_latency_count
scylla_storage_proxy_coordinator_write_latency_sum
scylla_storage_proxy_coordinator_write_timeouts
scylla_storage_proxy_coordinator_write_unavailable
scylla_storage_proxy_replica_received_counter_updates

All cas related metrics are labeled with __cas label.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
fd5d1f1f6a main.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_scylladb_current_version
scylla_reactor_utilization

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
5747af8555 streaming/stream_manager.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_node_ops_finished_percentage

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
48397f8dff repair/repair.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_node_ops_finished_percentage

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
83bfcb53be service/storage_service.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_node_operation_mode

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
1b64fa2283 gms/gossiper.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_gossip_heart_beat
scylla_gossip_live
scylla_gossip_unreachable

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
cfc5c60ba5 replica/database.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_database_active_reads
scylla_database_dropped_view_updates
scylla_database_queued_reads
scylla_database_requests_blocked_memory
scylla_database_requests_blocked_memory_current
scylla_database_schema_changed
scylla_database_total_reads
scylla_database_total_reads_failed
scylla_database_total_view_updates_pushed_local
scylla_database_total_view_updates_pushed_remote
scylla_database_total_writes
scylla_database_total_writes_failed
scylla_database_total_writes_timedout
scylla_database_total_writes_rate_limited
scylla_database_view_update_backlog

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:39 +02:00
Amnon Heiman
cf50c71ef5 cdc/log.cc: label metrics with basic_level and cdc
The following metrics will be marked with basic_level label:
scylla_cdc_operations_failed
scylla_cdc_operations_total

All metrics are labeld with the __cdc label.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00
Amnon Heiman
a474e95ef0 alternator: label metrics with basic_level and alternator
The following metrics will be marked with basic_level label:
scylla_alternator_operation
scylla_alternator_op_latency_bucket
scylla_alternator_op_latency_count
scylla_alternator_op_latency_sum
scylla_alternator_total_operations
scylla_alternator_batch_item_count
scylla_alternator_op_latency
scylla_alternator_op_latency_summary
scylla_expiration_items_deleted

All alternator metrics are marked with __alternator label.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00
Amnon Heiman
f40dc4e5c4 row_cache.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_cache_bytes_total
scylla_cache_bytes_used
scylla_cache_partition_evictions
scylla_cache_partition_hits
scylla_cache_partition_insertions
scylla_cache_partition_merges
scylla_cache_partition_misses
scylla_cache_partition_removals
scylla_cache_range_tombstone_reads
scylla_cache_reads
scylla_cache_reads_with_misses
scylla_cache_row_evictions
scylla_cache_row_hits
scylla_cache_row_insertions
scylla_cache_row_misses
scylla_cache_row_removals
scylla_cache_rows
scylla_cache_rows_merged_from_memtable
scylla_cache_row_tombstone_reads

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00
Amnon Heiman
0dde54d053 query_processor.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_cql_authorized_prepared_statements_cache_evictions
scylla_cql_batches
scylla_cql_deletes
scylla_cql_deletes_per_ks
scylla_cql_filtered_read_requests
scylla_cql_filtered_rows_dropped_total
scylla_cql_filtered_rows_matched_total
scylla_cql_filtered_rows_read_total
scylla_cql_inserts
scylla_cql_inserts_per_ks
scylla_cql_prepared_cache_evictions
scylla_cql_reads
scylla_cql_reads_per_ks
scylla_cql_reverse_queries
scylla_cql_rows_read
scylla_cql_secondary_index_reads
scylla_cql_select_bypass_caches
scylla_cql_select_partition_range_scan_no_bypass_cache
scylla_cql_statements_in_batches
scylla_cql_unpaged_select_queries
scylla_cql_unpaged_select_queries_per_ks
scylla_cql_updates
scylla_cql_updates_per_ks
2025-03-03 16:58:38 +02:00
Amnon Heiman
94ba8af788 sstables.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_sstables_cell_tombstone_writes
scylla_sstables_range_tombstone_reads
scylla_sstables_range_tombstone_writes
scylla_sstables_row_tombstone_reads
scylla_sstables_tombstone_writes
2025-03-03 16:58:38 +02:00
Amnon Heiman
bf39a760aa utils/logalloc.cc label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_lsa_total_space_bytes
scylla_lsa_non_lsa_used_space_bytes

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00
Amnon Heiman
6826b98c88 commitlog.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_commitlog_segments
scylla_commitlog_allocating_segments
scylla_commitlog_unused_segments
scylla_commitlog_alloc
scylla_commitlog_flush
scylla_commitlog_bytes_written
scylla_commitlog_pending_allocations
scylla_commitlog_requests_blocked_memory
scylla_commitlog_flush_limit_exceeded
scylla_commitlog_disk_total_bytes
scylla_commitlog_disk_active_bytes
scylla_commitlog_disk_slack_end_bytes
2025-03-03 16:58:38 +02:00
Amnon Heiman
67ca02b361 compaction_manager.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_compaction_manager_compactions
2025-03-03 16:58:38 +02:00
Amnon Heiman
30b34d29b2 Adding the __level and features labels
Scylla generates many metrics, and when multiplied by the number of
shards, the total number of metrics adds a significant load to a
monitoring server.

With multi-tier monitoring, it is helpful to have a smaller subset of
metrics users care about and allow them to get only those.

This patch adds two kind of labels, the a __level label, currently with
a single value, but we can add more in the future.
The second kind, is a cross feature label, curently for alternator, cdc
and cas.

We will use the __level label to mark the interesting user-facing metrics.

The current level value is:
basic - metrics for Scylla monitoring

In this phase, basic will mark all metrics used in the dashboards.
In practice, without any configuration change, Prometheus would get the
same metrics as it gets today.

While it is possible to filter by the label, e.g.:
curl http://localhost:9180/metrics?__level=basic

The labels themselves are not reported thanks to label filtering of
labels begin with __.

The feature labels:
__cdc, __cas and __alternator can be an easy way to disable a set of
metrics when not using a feature.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-03-03 16:58:38 +02:00
Emil Maskovsky
8c67307971 raft: use direct return of future for run_op_with_retry
Clean up the code by using direct return of future for `run_op_with_retry`.

This can be done as the `run_op_with_retry` function is already returning
a future that we can reuse directly. What needs to be taken care of is
to not use temporaries referenced from inside the lambda passed to the
`run_op_with_retry`.
2025-03-03 15:19:58 +01:00
Emil Maskovsky
28d1aeb1fa raft: adjust the voters interface to allow atomic changes
Allow setting the voters and non-voters in a single operation. This
ensures that the configuration changes are done atomically.

In particular, we don't want to set voters and non-voters separately
because it could lead to inconsistencies or even the loss of quorum.

This change also partially reverts the commit 115005d, as we will only
need the convenience wrappers for removing the voters (not for adding
them).

Refs: scylladb/scylladb#18793
2025-03-03 15:19:58 +01:00
Emil Maskovsky
074f4fcdf1 raft topology: drop removing the node from raft config via storage_service
For the limited voters feature to work properly we need to make sure
that we are only managing the voter status through the topology
coordinator. This means that we should not change the node votership
from the storage_service module for the raft topology directly.

This needs to be done in addition to dropping of the votership change
from the storage_service module.

The `remove_from_raft_config` is redundant and can be removed because
a successfully completed `removenode` operation implies that the node
has been removed from group 0 by the topology coordinator.

Refs: scylladb/scylladb#22860
Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969
2025-03-03 15:15:43 +01:00
Emil Maskovsky
834f506790 raft topology: drop changing the raft voters config via storage_service
For the limited voters feature to work properly we need to make sure
that we are only managing the voter status through the topology
coordinator. This means that we should not change the node votership
from the storage_service module for the raft topology directly.

We can drop the voter status changes from the storage_service module
because the topology coordinator will handle the votership changes
eventually. The calls in the storage_service module were not essential
and were only used for optimization (improving the HA under certain
conditions).

This has effect on the timing in the tablets migration test though,
as it relied on the node being made non-voter from the service_storage
`raft_removenode()` function. The fix is to add another server to the
topology to make sure we will keep the quorum.

Previously the test worked because the test waits for an injection to be
reached and it was ensured that the injection (log line) has only been
triggered after the node has been made non-voter from the
`raft_removenode()`. This is not the case anymore. An alternative fix
would be to wait for the first node to be made non-voter before stopping
the second server, but this would make the test more complex (and it is
not strictly required to only use 4 servers in the test, it has been
only done for optimization purposes).

Fixes: scylladb/scylladb#22860

Refs: scylladb/scylladb#18793
Refs: scylladb/scylladb#21969
2025-03-03 15:15:43 +01:00
Artsiom Mishuta
90106c6f19 test.py: skip test_incremental_read_repair[row-tombstone]
skip test test_incremental_read_repair[row-tombstone]
due to https://github.com/scylladb/scylladb/issues/21179

Closes scylladb/scylladb#23126
2025-03-03 15:26:28 +02:00
Nadav Har'El
ea19b79fe2 Merge 'De-duplicate API's table name to table ID conversion' from Pavel Emelyanov
This is continuation of #21533

There are two almost identical helpers in api/ -- validate_table(ks, cf) and get_uuid(ks, cf). Both check if the ks:cf table exists, throwing bad_param_exception if it doesn't. There's slight difference in their usage, namely -- callers of the latter one get the table_id found and make use of it, while the former helper is void and its callers need to re-search for the uuid again if the need (spoiler: they do).

This PR merges two helpers together, so there's less code to maintain. As a nice side effect, the existing validate_table() callers save one re-lookup of the ks:cf pair in database mappings.

Affected endpoints are validated by existing tests:
* column_family/{autocompation|tombstone_gc|compaction_strategy}, validated by the tests described in #21533
* /storage_service/{range_to_endpoint_map|describe_ring|ownership}, validated by nodetool tests
* /storage_service/tablets/{move|repair}, validated by tablets move and repair tests

Closes scylladb/scylladb#22742

* github.com:scylladb/scylladb:
  api: Remove get_uuid() local helper
  api: Make use of validate_table()'s table_id
  api: Make validate_table() helper return table_id after validation
  api: Change validate_table()'s ctx argument to database
2025-03-03 13:39:50 +02:00
Kefu Chai
5571b537b5 tree: Make values mutable to enable move semantics
Previously, variables were marked as const, causing std::move() calls to
be redundant as reported by GCC warnings. This change either removes
const qualifiers or marks related lambdas as mutable, allowing the
compiler to properly utilize move constructors for better performance.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23066
2025-03-03 13:53:02 +03:00